renanbr / bibtex-parser
BibTex Parser provides an API to read .bib files programmatically
Installs: 136 098
Dependents: 9
Suggesters: 0
Security: 0
Stars: 40
Watchers: 5
Forks: 17
Open Issues: 1
Requires
- php: >=5.6.0
Requires (Dev)
- phpunit/phpunit: >=5.7
- ryakad/pandoc-php: ^1.0
Suggests
- ryakad/pandoc-php: Needed to support LaTeX decoder in class RenanBr\BibTexParser\Processor\LatexToUnicodeProcessor
- ueberdosis/pandoc: Alternate Pandoc PHP package which (if available) will be preferred over ryakad/pandoc-php
README
This is a BibTeX parser written in PHP.
You are browsing the documentation of BibTeX Parser 2.x, the latest version.
Table of contents
Installing
composer require renanbr/bibtex-parser
Usage
use RenanBr\BibTexParser\Listener; use RenanBr\BibTexParser\Parser; use RenanBr\BibTexParser\Processor; require 'vendor/autoload.php'; $bibtex = <<<BIBTEX @article{einstein1916relativity, title={Relativity: The Special and General Theory}, author={Einstein, Albert}, year={1916} } BIBTEX; // Create and configure a Listener $listener = new Listener(); $listener->addProcessor(new Processor\TagNameCaseProcessor(CASE_LOWER)); // $listener->addProcessor(new Processor\NamesProcessor()); // $listener->addProcessor(new Processor\KeywordsProcessor()); // $listener->addProcessor(new Processor\DateProcessor()); // $listener->addProcessor(new Processor\FillMissingProcessor([/* ... */])); // $listener->addProcessor(new Processor\TrimProcessor()); // $listener->addProcessor(new Processor\UrlFromDoiProcessor()); // $listener->addProcessor(new Processor\LatexToUnicodeProcessor()); // ... you can append as many Processors as you want // Create a Parser and attach the listener $parser = new Parser(); $parser->addListener($listener); // Parse the content, then read processed data from the Listener $parser->parseString($bibtex); // or parseFile('/path/to/file.bib') $entries = $listener->export(); print_r($entries);
This will output:
Array
(
[0] => Array
(
[_type] => article
[citation-key] => einstein1916relativity
[title] => Relativity: The Special and General Theory
[author] => Einstein, Albert
[year] => 1916
)
)
Vocabulary
BibTeX is all about "entry", "tag's name" and "tag's content".
A BibTeX entry consists of the type (the word after @), a citation-key and a number of tags which define various characteristics of the specific BibTeX entry. (...) A BibTeX tag is specified by its name followed by an equals sign, and the content.
Source: http://www.bibtex.org/Format/
Note: This library considers "type" and "citation-key" as tags. This behavior can be changed implementing your own Listener.
Processors
Processor
is a callable that receives an entry as argument and returns a modified entry.
This library contains three main parts:
Parser
class, responsible for detecting units inside a BibTeX input;Listener
class, responsible for gathering units and transforming them into a list of entries;Processor
classes, responsible for manipulating entries.
Despite you can't configure the Parser
, you can append as many Processor
as you want to the Listener
through Listener::addProcessor()
before exporting the contents.
Be aware that Listener
provides, by default, these features:
- Found entries are reachable through
Listener::export()
method; - Tag content concatenation;
- e.g.
hello # " world"
tag's content will generatehello world
string
- e.g.
- Tag content abbreviation handling;
- e.g.
@string{foo="bar"} @misc{bar=foo}
will make$entries[1]['bar']
assumebar
as value
- e.g.
- Publication's type exposed as
_type
tag; - Citation key exposed as
citation-key
tag; - Original entry text exposed as
_original
tag.
This project ships some useful processors.
Tag name case
In BibTeX the tag's names aren't case-sensitive.
This library exposes entries as array, in which keys are case-sensitive.
To avoid this misunderstanding, you can force the tags' name character case using TagNameCaseProcessor
.
Usage
use RenanBr\BibTexParser\Processor\TagNameCaseProcessor; $listener->addProcessor(new TagNameCaseProcessor(CASE_UPPER)); // or CASE_LOWER
@article{ title={BibTeX rocks} }
Array
(
[0] => Array
(
[TYPE] => article
[TITLE] => BibTeX rocks
)
)
Authors and editors
BibTeX recognizes four parts of an author's name: First Von Last Jr.
If you would like to parse the author
and editor
tags included in your entries, you can use the NamesProcessor
class.
Usage
use RenanBr\BibTexParser\Processor\NamesProcessor; $listener->addProcessor(new NamesProcessor());
@article{ title={Relativity: The Special and General Theory}, author={Einstein, Albert} }
Array
(
[0] => Array
(
[type] => article
[title] => Relativity: The Special and General Theory
[author] => Array
(
[0] => Array
(
[first] => Albert
[von] =>
[last] => Einstein
[jr] =>
)
)
)
)
Keywords
The keywords
tag contains a list of expressions represented as string, you might want to read them as an array instead.
Usage
use RenanBr\BibTexParser\Processor\KeywordsProcessor; $listener->addProcessor(new KeywordsProcessor());
@misc{ title={The End of Theory: The Data Deluge Makes the Scientific Method Obsolete}, keywords={big data, data deluge, scientific method} }
Array
(
[0] => Array
(
[type] => misc
[title] => The End of Theory: The Data Deluge Makes the Scientific Method Obsolete
[keywords] => Array
(
[0] => big data
[1] => data deluge
[2] => scientific method
)
)
)
Date
It adds a new tag _date
as DateTimeImmutable.
This processor adds the new tag if and only if this the tags month
and year
are fulfilled.
Usage
use RenanBr\BibTexParser\Processor\DateProcessor; $listener->addProcessor(new DateProcessor());
@misc{ month="1~oct", year=2000 }
Array
(
[0] => Array
(
[type] => misc
[month] => 1~oct
[year] => 2000
[_date] => DateTimeImmutable Object
(
[date] => 2000-10-01 00:00:00.000000
[timezone_type] => 3
[timezone] => UTC
)
)
)
Fill missing tag
It puts a default value to some missing field.
Usage
use RenanBr\BibTexParser\Processor\FillMissingProcessor; $listener->addProcessor(new FillMissingProcessor([ 'title' => 'This entry has no title', 'year' => 1970, ]));
@misc{ } @misc{ title="I do exist" }
Array
(
[0] => Array
(
[type] => misc
[title] => This entry has no title
[year] => 1970
)
[1] => Array
(
[type] => misc
[title] => I do exist
[year] => 1970
)
)
Trim tags
Apply trim() to all tags.
Usage
use RenanBr\BibTexParser\Processor\TrimProcessor; $listener->addProcessor(new TrimProcessor());
@misc{ title=" too much space " }
Array
(
[0] => Array
(
[type] => misc
[title] => too much space
)
)
Determine URL from the DOI
Sets url
tag with DOI if doi
tag is present and url
tag is missing.
Usage
use RenanBr\BibTexParser\Processor\UrlFromDoiProcessor; $listener->addProcessor(new UrlFromDoiProcessor());
@misc{ doi="qwerty" } @misc{ doi="azerty", url="http://example.org" }
Array
(
[0] => Array
(
[type] => misc
[doi] => qwerty
[url] => https://doi.org/qwerty
)
[1] => Array
(
[type] => misc
[doi] => azerty
[url] => http://example.org
)
)
LaTeX to unicode
BibTeX files store LaTeX contents.
You might want to read them as unicode instead.
The LatexToUnicodeProcessor
class solves this problem, but before adding the processor to the listener you must:
- install Pandoc in your system; and
- add ryakad/pandoc-php or ueberdosis/pandoc as a dependency of your project.
Usage
use RenanBr\BibTexParser\Processor\LatexToUnicodeProcessor; $listener->addProcessor(new LatexToUnicodeProcessor());
@article{ title={Caf\\'{e}s and bars} }
Array
(
[0] => Array
(
[type] => article
[title] => Cafés and bars
)
)
Note: Order matters, add this processor as the last.
Custom
The Listener::addProcessor()
method expects a callable as argument.
In the example shown below, we append the text with laser
to the title
tags for all entries.
Usage
$listener->addProcessor(static function (array $entry) { $entry['title'] .= ' with laser'; return $entry; });
@article{
title={BibTeX rocks}
}
Array
(
[0] => Array
(
[type] => article
[title] => BibTeX rocks with laser
)
)
Handling errors
This library throws two types of exception: ParserException
and ProcessorException
.
The first one may happen during the data extraction.
When it occurs it probably means the parsed BibTeX isn't valid.
The second exception may happen during the data processing.
When it occurs it means the listener's processors can't handle properly the data found.
Both implement ExceptionInterface
.
use RenanBr\BibTexParser\Exception\ExceptionInterface; use RenanBr\BibTexParser\Exception\ParserException; use RenanBr\BibTexParser\Exception\ProcessorException; try { // ... parser and listener configuration $parser->parseFile('/path/to/file.bib'); $entries = $listener->export(); } catch (ParserException $exception) { // The BibTeX isn't valid } catch (ProcessorException $exception) { // Listener's processors aren't able to handle data found } catch (ExceptionInterface $exception) { // Alternatively, you can use this exception to catch all of them at once }
Advanced usage
The core of this library contains these main classes:
RenanBr\BibTexParser\Parser
responsible for detecting units inside a BibTeX input;RenanBr\BibTexParser\ListenerInterface
responsible for treating units found.
You can attach listeners to the parser through Parser::addListener()
.
The parser is able to detect BibTeX units, such as "type", "tag's name", "tag's content".
As the parser finds a unit, it triggers the listeners attached to it.
You can code your own listener! All you have to do is handle units.
namespace RenanBr\BibTexParser; interface ListenerInterface { /** * Called when an unit is found. * * @param string $text The original content of the unit found. * Escape character will not be sent. * @param string $type The type of unit found. * It can assume one of Parser's constant value. * @param array $context Contains details of the unit found. */ public function bibTexUnitFound($text, $type, array $context); }
$type
may assume one of these values:
Parser::TYPE
Parser::CITATION_KEY
Parser::TAG_NAME
Parser::RAW_TAG_CONTENT
Parser::BRACED_TAG_CONTENT
Parser::QUOTED_TAG_CONTENT
Parser::ENTRY
$context
is an array with these keys:
offset
contains the$text
's beginning position. It may be useful, for example, to seek on a file pointer;length
contains the original$text
's length. It may differ from string length sent to the listener because may there are escaped characters.