raffaelecarelle/ai-code-review-bot

There is no license information available for the latest version (1.0.0) of this package.

Minimal, extensible AI code review tool

1.0.0 2025-09-18 07:22 UTC

This package is auto-updated.

Last update: 2025-09-18 07:28:42 UTC


README

AI Code Review Bot

AI Code Review Bot

CI

Minimal, extensible AI-assisted code review tool for PHP projects.

  • Analyzes unified diffs (from Pull/Merge Requests or files)
  • Produces normalized findings (machine-readable JSON or human summary)
  • Loads a simple YAML/JSON config with provider/policy settings and an optional coding guidelines file
  • Safe defaults: deterministic Mock AI provider; no network calls unless configured

Official documentation: Docs

Table of Contents

    1. Objectives and scope
    1. Architecture and main modules
    1. Quick start
    1. Configuration (.aicodereview.yml)
    1. VCS adapters (GitHub/GitLab)
    1. Coding guidelines file
    1. AI providers and token budgeting
    1. Security & Performance
    1. Output formats
    1. Development and QA
    1. Credits
    1. License

1. Objectives and scope

  • Functional
    • Analyze diffs and produce review findings for coding standard violations and simple risk patterns.
    • Dynamic configuration for providers, policy, token budget, rules, and VCS.
    • Post results back to PR/MR via platform adapters when requested.
  • Non-functional
    • Safe defaults: no external calls by default (mock provider) and no PR comments unless --comment.
    • Modular design to plug real LLM providers and VCS platforms.

2. Architecture and main modules (PHP)

  • bin/aicr: CLI entry point (Symfony Console) running the review command in single-command mode.
  • src/Command/ReviewCommand.php: Orchestrates reading config, loading diff (from file or git), running Pipeline, and optional PR/MR commenting. Uses Symfony Process for git.
  • src/Config.php: Loads YAML/JSON config, merges with defaults, expands ${ENV} variables, exposes sections (providers, context, policy, vcs, prompts).
  • src/DiffParser.php: Minimal unified diff parser returning added lines per file with accurate line numbers.
  • src/Pipeline.php: End-to-end pipeline: parse diff, build AI provider, chunk with token budget, apply policy, and render output.
  • src/Adapters/: VcsAdapter interface and GithubAdapter/GitlabAdapter/BitbucketAdapter implementations (resolve branches from PR/MR id and post comments).
  • src/Providers/: AIProvider interface and concrete providers (OpenAI, Gemini, Anthropic, Ollama, Mock).
  • src/Support/: Core utility classes for enhanced functionality:
    • ChunkBuilder: Intelligent diff chunking with semantic analysis and optimization
    • TokenBudget: Advanced token management with compression and per-file caps
    • ResourceManager: Safe resource handling with automatic cleanup
    • ApiCache: Response caching with TTL and size management
    • InputSanitizer: Security-focused input validation and sanitization
    • DiffProcessor: Enhanced diff processing with filtering capabilities
    • SemanticChunker: Context-aware code chunking for better AI analysis
  • src/Config/Constants: Centralized configuration constants replacing magic numbers and strings.

3. Quick start

  • Install dependencies via Composer:
composer install
  • Option A: Analyze an existing diff file
    • Create or use a unified diff, e.g., examples/sample.diff.
    • Run:
php bin/aicr review --diff-file examples/sample.diff --output summary
php bin/aicr review --diff-file examples/sample.diff --output json
php bin/aicr review --diff-file examples/sample.diff --output summary --provider openai
  • Option B: Analyze a PR/MR by ID using git
    • Configure vcs.platform in .aicodereview.yml (github or gitlab) and set required identifiers/tokens.
    • Then run (the command fetches branches, computes diff, and analyzes it):
php bin/aicr review --id 123 --output summary
php bin/aicr review --id 123 --output summary --provider gemini
  • To also post a comment back to the PR/MR, add --comment:
php bin/aicr review --id 123 --output summary --comment
php bin/aicr review --id 123 --output summary --comment --provider anthropic

Notes

  • Provide --config <path> to use a non-default config file.
  • Use --provider <name> to override the default provider from config (e.g., openai, gemini, anthropic, ollama, mock).
  • Without --diff-file, --id is required and branches are resolved via the configured adapter.

4. Configuration (.aicodereview.yml)

Example (see .aicodereview.yml in this repo and examples/config.*.yml):

version: 1
providers:
  # Safe deterministic provider by default
  default: mock
context:
  diff_token_limit: 8000
  overflow_strategy: trim
  per_file_token_cap: 2000
  enable_semantic_chunking: true
  enable_diff_compression: true
policy:
  min_severity_to_comment: info
  max_comments: 50
  redact_secrets: true
  consolidate_similar_findings: true
  max_findings_per_file: 5
  severity_limits:
    error: 10
    warning: 10
    info: 5
guidelines_file: null
vcs:
  # Set one of: github | gitlab | bitbucket
  platform: null
  # GitHub: owner/repo (optional if GH_REPO env or remote origin is GitHub)
  repo: null
  # GitLab: numeric id or full path namespace/repo (optional if GL_PROJECT_ID or remote origin is GitLab)
  project_id: null
  # GitLab: override API base for self-hosted instances (e.g., https://gitlab.example.com/api/v4)
  api_base: null
  # Bitbucket: workspace name (required for Bitbucket)
  workspace: null
  # Bitbucket: repository name (required for Bitbucket)
  repository: null
  # Bitbucket: access token for authentication (required for Bitbucket)
  accessToken: null
  # Bitbucket: API request timeout in seconds (optional, defaults to 30)
  timeout: 30
prompts:
  # Optional: append additional instructions to the base prompts used by the LLM
  # You can use single strings or lists of strings
  system_append: "Prefer concise findings and avoid duplicates."
  user_append:
    - "Prioritize security and performance related issues."
  extra:
    - "If a secret or key is detected, suggest redaction."
excludes:
  # Array of paths to exclude from code review
  # Each element is treated as glob, regex, or relative path from project root
  # Examples:
  - "*.md"           # Exclude all markdown files (glob)
  - "composer.lock"  # Exclude specific files (exact match)
  - "tests/*.php"    # Exclude files in specific directories with patterns (glob)
  - "vendor"         # Exclude entire vendor directory (directory)
  - "node_modules"   # Exclude node_modules directory (directory)
  - "build"          # Exclude build artifacts (directory)
  - "dist"           # Exclude distribution files (directory)

Notes

  • Env var expansion works in any string value: ${VAR_NAME}.
  • Tokens/ids read from env if not set: GH_TOKEN/GITHUB_TOKEN, GL_TOKEN/GITLAB_TOKEN, GH_REPO, GL_PROJECT_ID.

5. VCS adapters (GitHub/GitLab/Bitbucket)

  • Configure vcs.platform and required parameters as needed.
  • The review command supports a single --id option (PR number for GitHub, MR IID for GitLab, PR ID for Bitbucket).
  • Behavior when --diff-file is omitted:
    1. Resolve base/head branches from the ID via platform API.
    2. git fetch --all; fetch base/head; compute git diff base...head.
    3. Run the analysis pipeline on that diff.
  • --comment posts the summary back via the adapter.

6. Coding guidelines file

  • You can provide a project coding standard or style guide via guidelines_file in .aicodereview.yml.
  • When set, its content is embedded into the LLM prompts as a base64 string. The prompt explicitly instructs the model to base64-decode the guidelines and follow them strictly during the review.
  • No provider-specific file uploads are performed: all supported providers (OpenAI, Gemini, Anthropic, Ollama) receive the same base64-embedded guidelines in the prompt.

7. AI providers and token budgeting

  • Supported providers in this repository: openai, gemini, anthropic, ollama, mock.
  • Select via providers.default and configure each provider section accordingly (see src/Providers/* for options).
  • Token budgeting is approximate (chars/4). Global and per-file caps are configurable; overflow_strategy defaults to trim.

7.1 Advanced Token Optimization Features

The system includes sophisticated token cost optimization capabilities:

  • Semantic Chunking: Enable with enable_semantic_chunking: true to group related code changes by context (classes, methods, etc.)
  • Diff Compression: Enable with enable_diff_compression: true to intelligently compress diffs while maintaining semantic meaning
  • Trivial Change Filtering: Automatically filters out whitespace-only changes, TODO comments, and import statements
  • Similar Finding Consolidation: Set consolidate_similar_findings: true to aggregate similar issues across multiple files
  • Per-file Limits: Control review scope with max_findings_per_file to prevent overwhelming output
  • Severity Limits: Fine-tune output with severity_limits to cap the number of findings by severity level

These optimizations can reduce token usage by 30-50% for input and 40-60% for output while maintaining review quality. See docs/token-cost-optimization.md for detailed implementation guide.

8. Security & Performance

Introduces significant enhancements focusing on security hardening, performance optimization, and code quality improvements:

Security Enhancements

  • InputSanitizer: Comprehensive input validation and sanitization for all external data
    • Branch name, repository name, and file path validation
    • API response sanitization to prevent injection attacks
    • URL and commit SHA validation with strict patterns
  • Resource Management: Safe resource handling with automatic cleanup
    • Temporary file and directory management
    • Resource leak prevention with shutdown handlers
    • Exception-safe cleanup with try-finally patterns

Performance Optimizations

  • Intelligent Chunking: Enhanced ChunkBuilder with semantic analysis
    • Batch processing for better memory management
    • Parallel-friendly architecture for large diffs
    • Context-aware chunking for improved AI analysis
  • Advanced Token Management: Improved TokenBudget with compression
    • Per-file token caps to prevent oversized chunks
    • Diff compression for large files
    • Smart budget allocation and overflow handling
  • API Response Caching: New ApiCache system for improved performance
    • TTL-based caching with automatic expiration
    • Size-limited cache with LRU eviction
    • Request deduplication and response reuse

Code Quality Improvements

  • Constants Centralization: All magic numbers and strings moved to Constants class
  • Enhanced Error Handling: Standardized exception handling across all providers
  • Improved Documentation: Comprehensive PHPDoc comments and inline documentation
  • Security Audit: Fixed potential security issues identified in code review

Configuration Enhancements

New configuration options available:

context:
  enable_semantic_chunking: true    # Enable context-aware chunking
  enable_diff_compression: true     # Enable diff compression for large files
  cache_ttl: 3600                  # API response cache TTL in seconds
  max_cache_size: 52428800         # Maximum cache size in bytes (50MB)

9. Output formats

  • json (default): machine-readable findings array.
  • summary: human-readable bulleted list. This is also the format used for PR/MR comments.
  • markdown: structured markdown format with emojis, metadata, and organized findings by severity and file.

9. Development and QA

  • Requires PHP and Composer.
  • Run unit and E2E tests with PHPUnit:
./vendor/bin/phpunit
  • Coding standards and static analysis:
composer analyse
  • The codebase uses declare(strict_types=1) and Symfony components (Console, YAML, Filesystem, Process).

10. Credits

  • Author: Raffaele Carelle
  • Contributors: Thanks to everyone who reports issues or submits PRs.

11. License

This project is open-sourced under the MIT License. See the LICENSE file for details.