iamgerwin/php-pdf-to-markdown-parser

A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.

v0.0.1 2025-09-30 16:56 UTC

This package is auto-updated.

Last update: 2025-09-30 17:02:51 UTC


README

Tests Latest Version on Packagist Total Downloads

A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.

Because sometimes PDFs just need to chill out and become Markdown.

Features

  • 📝 Text Extraction with Styling - Preserves headings, bold, italic, and strikethrough formatting
  • 📊 Table Parsing - Extracts tables with proper headers and body formatting
  • 🎨 Diagram Support - Converts diagrams to Mermaid and dbdiagram.io formats
    • Flowcharts
    • Sequence diagrams
    • Entity Relationship Diagrams (ERD)
    • Gantt charts
    • Class diagrams
    • State diagrams
    • Pie charts
  • 📋 List Detection - Automatically converts bullet points and numbered lists
  • 💻 Code Block Recognition - Identifies and formats code snippets
  • 🚀 PHP 8.3 Compatible - Built with modern PHP features
  • PSR-12 Compliant - Follows PHP coding standards

Installation

You can install the package via composer:

composer require iamgerwin/php-pdf-to-markdown-parser

Usage

Basic Usage

use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;

$parser = new PdfToMarkdownParser();

// Parse a PDF file
$markdown = $parser->parseFile('path/to/document.pdf');

// Parse PDF content
$pdfContent = file_get_contents('path/to/document.pdf');
$markdown = $parser->parseContent($pdfContent);

// Output the markdown
echo $markdown;

Working with Tables

The parser automatically detects and converts tables in your PDF:

| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Row 1 Col 1 | Row 1 Col 2 | Row 1 Col 3 |
| Row 2 Col 1 | Row 2 Col 2 | Row 2 Col 3 |

Diagram Extraction

Diagrams are automatically detected and converted to appropriate formats:

Mermaid Flowcharts:

```mermaid
flowchart TD
    Start --> Process --> End

**ERD (dbdiagram.io format):**
```markdown
```dbdiagram
Table users {
  id int
  name varchar
  email varchar
}

**Sequence Diagrams:**
```markdown
```mermaid
sequenceDiagram
    User->>System: Request
    System->>Database: Query
    Database->>System: Response
    System->>User: Result

### Text Styling

The parser preserves text styling from PDFs:

- Headings (H1-H6) based on font size and formatting
- **Bold text**
- *Italic text*
- ~~Strikethrough text~~
- Lists (bulleted and numbered)
- Code blocks

## Advanced Configuration

### Custom Extractors

You can extend the parser with custom extractors:

```php
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;
use Iamgerwin\PdfToMarkdownParser\Extractors\TextExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\TableExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\DiagramExtractor;

$parser = new PdfToMarkdownParser();

// The parser uses these extractors internally:
// - TextExtractor: Handles text and styling
// - TableExtractor: Processes tables
// - DiagramExtractor: Converts diagrams

Testing

Run the test suite:

composer test

Run tests with coverage:

composer test-coverage

Run PHPStan static analysis:

composer analyse

Format code with Laravel Pint:

composer format

Requirements

  • PHP 8.3 or higher
  • ext-mbstring

How It Works

The parser uses a multi-stage extraction process:

  1. PDF Parsing - Uses the robust smalot/pdfparser library to extract raw content
  2. Text Analysis - Identifies text styling, headings, and formatting patterns
  3. Table Detection - Recognizes table structures (pipe, tab, or space-separated)
  4. Diagram Recognition - Detects diagram patterns and converts to Mermaid/dbdiagram formats
  5. Markdown Generation - Combines all elements into properly formatted Markdown

Limitations

  • Images: Currently, images are not extracted (coming in future versions)
  • Complex Layouts: Multi-column layouts may require manual adjustment
  • Font Styling: Basic bold/italic detection is simplified (font metadata parsing is limited)
  • Diagrams: Pattern matching may not catch all diagram types

Changelog

Please see CHANGELOG for more information on what has changed recently.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Security

If you discover any security related issues, please email iamgerwin@live.com instead of using the issue tracker.

Credits

License

The MIT License (MIT). Please see License File for more information.

Acknowledgments

Built with inspiration from the PHP community and the need to make PDF content more accessible and reusable. Special thanks to the maintainers of smalot/pdfparser for their excellent PDF parsing library.