labrodev/document-sampler

Extracts a structured representative sample from long documents for downstream AI processing.

Maintainers

Package info

github.com/labrodev/document-sampler

pkg:composer/labrodev/document-sampler

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

1.0.0 2026-04-24 09:28 UTC

This package is auto-updated.

Last update: 2026-04-24 09:33:10 UTC


README

Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.

Designed as the input layer for downstream AI-powered packages such as relevance checkers, prompt injection detectors, and depersonalisation services.

Requirements

  • PHP ^8.5

Installation

composer require labrodev/document-sampler

Basic usage

use Labrodev\DocumentSampler\DocumentSampler;

$result = (new DocumentSampler())->sample($rawText);

$result->intro             // opening chars — title and introduction
$result->outline           // extracted section headings from anywhere in the document
$result->middle            // fixed window centred on the document midpoint
$result->tail              // closing chars — conclusion and sign-off
$result->text              // all samples joined with separators
$result->charCount         // character count of the combined sample
$result->originalCharCount // character count of the original document

Custom window sizes

By default each zone uses the window defined on the DocumentPart enum. Pass any subset to the constructor to override:

// Override specific zones — unset zones use the enum defaults
$sampler = new DocumentSampler(
    intro:   2000,
    middle:  300,
);

$result = $sampler->sample($rawText);

How it works

The sampler partitions every document into four fixed-size windows regardless of document length:

Zone Default window What it captures
intro 1000 chars Title, abstract, opening paragraphs
outline 500 chars Section headings (# Markdown, 1.1 Numbered, ALL-CAPS lines)
middle 500 chars Window centred on the document midpoint
tail 500 chars Closing paragraphs, conclusion, signature

Windows are fixed — a 400-page PDF gets the same sized sample as a one-page memo. The goal is a compact, representative fingerprint of the document, not a summary.

Exporting results

JSON

$result->toJson();
{
    "meta": {
        "originalCharCount": 50000,
        "sampledCharCount": 2300
    },
    "samples": {
        "intro": "...",
        "outline": "...",
        "middle": "...",
        "tail": "..."
    }
}

Markdown

$result->toMd();
## Document Sample

**Original size:** 50,000 chars
**Sampled size:** 2,300 chars

### Intro
...

### Outline
...

### Middle
...

### Tail
...

Empty zones are omitted from both outputs.

Default window sizes

Window sizes are defined on the DocumentPart enum and can be read at runtime:

use Labrodev\DocumentSampler\Enums\DocumentPart;

DocumentPart::Intro->chars();   // 1000
DocumentPart::Outline->chars(); // 500
DocumentPart::Middle->chars();  // 500
DocumentPart::Tail->chars();    // 500

When to use this

  • Before calling an AI API — reduce a large document to a structured excerpt that fits in a context window without losing structural information.
  • Relevance checking — feed $result->text to a classifier to decide whether a document is relevant before processing it in full.
  • Prompt injection detection — scan a compact sample for malicious instructions before passing untrusted documents to an LLM.
  • Depersonalisation — run PII detection over a representative sample before deciding whether to redact the full document.
  • Document classification — use the outline and intro zones to classify document type without reading the entire file.

Testing

composer test

Static analysis

composer analyse

Author

Petro Lashyncontact@labrodev.com

License

MIT