topshelfcraft / scraper
Easily fetch, parse, and rejigger HTML or XML from anywhere.
Installs: 2 148
Dependents: 0
Suggesters: 0
Security: 0
Stars: 17
Watchers: 3
Forks: 3
Open Issues: 5
Type:craft-plugin
Requires
- craftcms/cms: ^4.2.1
- fabpot/goutte: ^v4.0.2
- topshelfcraft/base: ^4.0.1
This package is auto-updated.
Last update: 2024-11-08 01:47:28 UTC
README
Easily fetch, slice, dice, and output HTML (or XML) content from anywhere.
A Top Shelf Craft creation
Michael Rog, Proprietor
Installation
-
From your project directory, use Composer to require the plugin package:
composer require topshelfcraft/scraper
-
In the Control Panel, go to Settings → Plugins and click the “Install” button for Scraper.
-
There is no Step 3.
Scraper is also available for installation via the Craft CMS Plugin Store.
Usage
The Scraper plugin exposes a full-featured crawler object to your Twig template, allowing you to fetch, parse, and filter DOM elements from a remote source document.
Instantiating a client
When invoking the plugin, you can choose whether to use SimpleHtmlDom or Symfony components to instantiate your crawler:
{% set crawler = craft.scraper.using('symfony').get('https://zombo.com') %}
{% set crawler = craft.scraper.using('simplehtmldom').get('https://zombo.com') %}
I generally recommend using the Symfony components; they are more powerful and resilient to malformed source code. (The SimpleHtmlDom crawler is included to provide backwards compatibility with Craft 2 projects.)
Using the Symfony client
When you opt for Symfony components, the get
method instantiates a full BrowserKit client, giving you access to all the BrowserKit and DomCrawler methods.
You can iterate over the DOM elements from your source document like this:
{% for node in crawler.filter('h2 > a') %} {{ node.text() }} {% endfor %}
Using the SimpleHtmlDom client
When you opt for the SimpleHtmlDom crawler, the get
method instantiates a SimpleHtmlDom client, giving you access to all the SimpleHtmlDom methods.
You can iterate over the DOM elements from your source document like this:
{% for node in crawler.find('h1') %} {{ node.innertext() }} {% endfor %}
This is great! I still have questions.
Ask a question on StackExchange, and ping me with a URL via email or Discord.
What are the system requirements?
Craft 4.2.1+
I found a bug.
Please open a GitHub Issue, submit a PR to the 4.x.dev
branch, or just email me.
Contributors:
- Plugin development: Michael Rog / @michaelrog
- Includes the "Simple HTML DOM" library, created by S. C. Chen
- Includes the Symfony DomCrawler via Goutte, created by Fabian Potencier / @fabpot
- Icon: "Upright vacuum cleaner" by Creaticca Creative Agency, via The Noun Project