dprmc / biz-journals
A PHP library to interface with the BizJournals.com website.
v0.4
2026-03-25 23:03 UTC
Requires
- php: ^8.1
- symfony/browser-kit: ^6.4 || ^7.3
- symfony/css-selector: ^6.4 || ^7.3
- symfony/dom-crawler: ^6.4 || ^7.3
- symfony/http-client: ^6.4 || ^7.3
- symfony/mime: ^6.4 || ^7.3
Requires (Dev)
- phpunit/phpunit: ^11.5
README
A PHP library for authenticating with BizJournals, crawling section pages, and returning article data as structured JSON.
Scope
This repository now includes a first-pass scraping framework for:
- establishing an authenticated BizJournals session
- crawling one or more root section index pages
- discovering article URLs from those section pages
- fetching a specific article page
- normalizing article content into JSON-serializable models
The initial root URL targeted by the example CLI is:
https://www.bizjournals.com/news/commercial-real-estate
Install
composer install
Install development dependencies and run the test suite:
composer test
Usage
Set credentials if the crawl requires an authenticated session:
export BIZJOURNALS_EMAIL="you@example.com" export BIZJOURNALS_PASSWORD="your-password"
Run the example spider:
php bin/bizjournals-spider php bin/bizjournals-spider https://www.bizjournals.com/news/commercial-real-estate 3 25 php bin/bizjournals-spider https://www.bizjournals.com/news/commercial-real-estate 3 25 --debug php bin/bizjournals-article https://www.bizjournals.com/boston/news/2026/03/25/lender-s-95m-offer-is-winning-bid-for-back-bay-of.html php bin/bizjournals-article https://www.bizjournals.com/boston/news/2026/03/25/lender-s-95m-offer-is-winning-bid-for-back-bay-of.html --debug --debug-dir=/tmp/bizjournals-debug
Architecture
Dprmc\BizJournals\Http\BizJournalsSession: owns the HTTP client, cookies, and login flow.Dprmc\BizJournals\Http\ChromiumLoginAuthenticator: performs the login flow in a real Chromium browser so JavaScript executes and inputs are typed into the page.Dprmc\BizJournals\Crawler\BizJournalsSpider: exposescrawlIndex()for section indexes andcrawlArticle()for individual story pages.Dprmc\BizJournals\Debug\DebugArtifactRecorder: saves response HTML and screenshots for each unique loaded URL when debug mode is enabled.Dprmc\BizJournals\Parser\CategoryPageParser: extracts article URLs from a section page.Dprmc\BizJournals\Parser\ArticleParser: extracts normalized article data from a story page.Dprmc\BizJournals\Model\StoryandSpiderResult: JSON-ready output objects.
Testing
- PHPUnit is configured through
phpunit.xml.dist. - The sample login test is in
tests/Http/BizJournalsSessionTest.php. - The login test uses Symfony's mocked HTTP client, so it validates the session flow without making live requests to BizJournals.
Notes
- The login form field names and success detection are configurable through
Dprmc\BizJournals\Config\LoginConfig. - The default login URL now matches the captured BizJournals login page in
development/login.html:https://www.bizjournals.com/bizjournals/login?r=%2F. - Authentication now uses a real Chromium-driven login flow instead of raw form posts, so JavaScript can render the email/password steps and the automation can type into the page before cookies are imported back into the crawler session.
- Story discovery currently uses URL heuristics for BizJournals article paths.
- Index crawling supports a
pageLimitvalue and currently expands pages using BizJournals'?page=Npagination format. - Debug mode can be enabled with
--debug; it saves.htmland.pngfiles for each newly loaded URL and reports the output directory in the index JSON. - Article extraction prefers JSON-LD metadata when available, then falls back to DOM selectors.
- If BizJournals changes its login flow or introduces JavaScript-only auth or bot mitigation, the session layer is the place to swap in a browser automation implementation.
- On March 25, 2026, the commercial real estate section returned a Cloudflare mitigation response (
403withcf-mitigated: challenge) during verification. The current framework now throws an explicit access-blocked exception in that case instead of returning an empty crawl.