insightbase/invoice-parser-nette

Nette package for parsing invoice/accounting documents from PDF (including scanned PDFs) using Azure Document Intelligence, LLM normalization and Czech-specific validation.

Maintainers

Package info

github.com/insightbase/InvoiceParser-nette

pkg:composer/insightbase/invoice-parser-nette

Statistics

Installs: 2

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

v1.0.1 2026-03-15 11:42 UTC

This package is auto-updated.

Last update: 2026-03-15 11:42:40 UTC


README

Nette balíček pro vytěžování faktur a účetních dokladů z PDF (včetně skenů) přes:

  • Azure Document Intelligence (OCR + strukturovaná extrakce)
  • LLM normalizaci (Azure OpenAI)
  • české regex fallbacky (VS, DUZP, IČO, DIČ)
  • validační vrstvu a asynchronní worker pattern

Instalace

composer require insightbase/invoice-parser-nette

Registrace extension

extensions:
    invoiceParser: InsightBase\InvoiceParserNette\DI\InvoiceParserExtension

invoiceParser:
    azureDi:
        endpoint: %env(AZURE_DI_ENDPOINT)%
        apiKey: %env(AZURE_DI_KEY)%
        model: prebuilt-invoice
        apiVersion: 2023-07-31
        maxPollAttempts: 25
        pollIntervalMs: 1000
    llm:
        enabled: true
        endpoint: %env(AZURE_OPENAI_ENDPOINT)%
        deployment: %env(AZURE_OPENAI_DEPLOYMENT)%
        apiKey: %env(AZURE_OPENAI_KEY)%
        apiVersion: 2024-10-21

Použití

<?php

declare(strict_types=1);

use InsightBase\InvoiceParserNette\Parser\InvoiceParser;

final class InvoiceService
{
    public function __construct(
        private InvoiceParser $invoiceParser,
    ) {
    }

    public function parse(string $pdfPath): array
    {
        $pdfContent = file_get_contents($pdfPath);
        $result = $this->invoiceParser->parsePdf((string) $pdfContent);

        return $result->invoice->toArray();
    }
}

Asynchronní worker (Contributte RabbitMQ)

Knihovna obsahuje worker service InvoiceParseWorker::process(array $message).

Příklad payloadu zprávy:

{
  "pdfPath": "/data/invoices/invoice-2026-001.pdf"
}

Nebo:

{
  "pdfBase64": "JVBERi0xLjQKJ..."
}

Ukázková integrace je v examples/rabbitmq.neon a examples/InvoiceConsumer.php.

Poznámky

  • Pro oskenované PDF se OCR řeší na straně Azure Document Intelligence.
  • Regex fallback slouží jako doplněk, když DI/LLM vrátí neúplná data.
  • Validátor hlídá základní konzistenci částek a dat.