kreuzberg-dev / kreuzcrawl
High-performance Rust web crawling engine — PHP bindings
Package info
github.com/kreuzberg-dev/kreuzcrawl
Language:Rust
Type:php-ext
Ext name:ext-kreuzcrawl
pkg:composer/kreuzberg-dev/kreuzcrawl
0.3.0-rc.19
2026-04-30 07:09 UTC
Requires
- php: ^8.4
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.95
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^13.1
- dev-main
- 0.3.0-rc.19
- 0.3.0-rc.18
- 0.3.0-rc.17
- 0.3.0-rc.16
- 0.3.0-rc.15
- 0.3.0-rc.14
- 0.3.0-rc.13
- 0.3.0-rc.12
- 0.3.0-rc.11
- 0.3.0-rc.10
- 0.3.0-rc.9
- 0.3.0-rc.8
- 0.3.0-rc.7
- 0.3.0-rc.6
- 0.3.0-rc.5
- 0.3.0-rc.4
- 0.3.0-rc.3
- 0.3.0-rc.2
- 0.3.0-rc.1
- 0.2.0
- 0.1.2
- 0.1.1
- 0.1.0
- 0.1.0-rc.10
- 0.1.0-rc.9
- 0.1.0-rc.8
- 0.1.0-rc.7
- 0.1.0-rc.6
- 0.1.0-rc.5
- 0.1.0-rc.4
- 0.1.0-rc.3
- 0.1.0-rc.2
- 0.1.0-rc.1
- dev-fix/elixir-force-build-config
This package is auto-updated.
Last update: 2026-05-02 07:30:47 UTC
README
High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 10 languages — same engine, identical results across every runtime.
Key Features
- Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
- Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
- Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
- 10 language bindings — Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, and WebAssembly
- Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
- Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
- Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
- Streaming — Real-time crawl events via async streams for progress tracking
- Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
- Rate limiting — Per-domain request throttling with configurable delays
- Asset download — Download, deduplicate, and filter images, documents, and other linked assets
- MCP server — Model Context Protocol integration for AI agents
- REST API — HTTP server with OpenAPI spec
Installation
| Language | Package | Install |
|---|---|---|
| Python | kreuzcrawl | pip install kreuzcrawl |
| Node.js | @kreuzberg/kreuzcrawl | npm install @kreuzberg/kreuzcrawl |
| Rust | kreuzcrawl | cargo add kreuzcrawl |
| Go | pkg.go.dev | go get github.com/kreuzberg-dev/kreuzcrawl/packages/go |
| Java | Maven Central | See README |
| C# | NuGet | dotnet add package Kreuzcrawl |
| Ruby | kreuzcrawl | gem install kreuzcrawl |
| PHP | kreuzberg-dev/kreuzcrawl | composer require kreuzberg-dev/kreuzcrawl |
| Elixir | kreuzcrawl | {:kreuzcrawl, "~> 0.2"} |
| WASM | @kreuzberg/kreuzcrawl-wasm | npm install @kreuzberg/kreuzcrawl-wasm |
| C FFI | GitHub Releases | C header + shared library |
| CLI | crates.io | cargo install kreuzcrawl-cli |
| CLI (Homebrew) | kreuzberg-dev/tap | brew install kreuzberg-dev/tap/kreuzcrawl |
Quick Start
Python — Full docs
from kreuzcrawl import create_engine, scrape engine = create_engine() result = scrape(engine, "https://example.com") print(result.metadata.title) print(result.markdown.content) print(len(result.links))
Node.js / TypeScript — Full docs
import { createEngine, scrape } from "@kreuzberg/kreuzcrawl"; const engine = createEngine(); const result = await scrape(engine, "https://example.com"); console.log(result.metadata.title); console.log(result.markdown.content); console.log(result.links.length);
Rust — Full docs
let engine = kreuzcrawl::create_engine(None)?; let result = kreuzcrawl::scrape(&engine, "https://example.com").await?; println!("{}", result.metadata.title); println!("{}", result.markdown.content); println!("{}", result.links.len());
Go — Full docs
engine, _ := kcrawl.CreateEngine() result, _ := kcrawl.Scrape(engine, "https://example.com") fmt.Println(result.Metadata.Title) fmt.Println(result.Markdown.Content) fmt.Println(len(result.Links))
Java — Full docs
var engine = Kreuzcrawl.createEngine(null); var result = Kreuzcrawl.scrape(engine, "https://example.com"); System.out.println(result.metadata().title()); System.out.println(result.markdown().content()); System.out.println(result.links().size());
C# — Full docs
var engine = KreuzcrawlLib.CreateEngine(null); var result = await KreuzcrawlLib.Scrape(engine, "https://example.com"); Console.WriteLine(result.Metadata.Title); Console.WriteLine(result.Markdown.Content); Console.WriteLine(result.Links.Count);
Ruby — Full docs
engine = Kreuzcrawl.create_engine(nil) result = Kreuzcrawl.scrape(engine, "https://example.com") puts result.metadata.title puts result.markdown.content puts result.links.length
PHP — Full docs
$engine = Kreuzcrawl::createEngine(null); $result = Kreuzcrawl::scrape($engine, "https://example.com"); echo $result->metadata->title; echo $result->markdown->content; echo count($result->links);
Elixir — Full docs
{:ok, engine} = Kreuzcrawl.create_engine(nil) {:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com") IO.puts(result.metadata.title) IO.puts(result.markdown.content) IO.puts(length(result.links))
Platform Support
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | — |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
Architecture
Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
│
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
│
Rust Core Engine (async, concurrent, SIMD-optimized)
│
├── HTTP Client (reqwest + tower middleware stack)
├── HTML Parser (html5ever + lol_html)
├── Markdown Converter (html-to-markdown-rs)
├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
├── Link Discovery (robots.txt, sitemaps, anchor analysis)
└── Browser Rendering (optional headless Chrome/Firefox)
Contributing
Contributions are welcome! See our Contributing Guide.