scriptotek / simplemarcparser
A simple MARC21/XML parser
Requires
- php: >=5.6
- danmichaelo/quitesimplexmlelement: ~1.0
- illuminate/support: ~4.1|~5.0
- mrjgreen/php-cli: 1.*
- nesbot/carbon: 1.*
Requires (Dev)
- phpunit/phpunit: ~5.7|~6.0
- satooshi/php-coveralls: ~1.0
README
SimpleMarcParser
is currently a minimal MARC21/XML parser for use with QuiteSimpleXMLElement
,
with support for the MARC21 Bibliographic, Authority and Holdings formats.
Note: This project is not actively developed anymore, but I will still process issues. The aim of this project was to produce “simple” JSON representations of MARC21 records. I'm now working on php-marc, a wrapper for File_MARC.
Example:
require_once('vendor/autoload.php'); use Danmichaelo\QuiteSimpleXMLElement\QuiteSimpleXMLElement, Scriptotek\SimpleMarcParser\Parser; $data = file_get_contents('http://sru.bibsys.no/search/biblio?' . http_build_query(array( 'version' => '1.2', 'operation' => 'searchRetrieve', 'recordSchema' => 'marcxchange', 'query' => 'bs.isbn="0-521-43291-x"' ))); $doc = new QuiteSimpleXMLElement($data); $doc->registerXPathNamespaces(array( 'srw' => 'http://www.loc.gov/zing/srw/', 'marc' => 'http://www.loc.gov/MARC21/slim', 'd' => 'http://www.loc.gov/zing/srw/diagnostic/' )); $parser = new Parser(); $record = $parser->parse($doc->first('/srw:searchRetrieveResponse/srw:records/srw:record/srw:recordData/marc:record')); print $record->title; foreach ($record->subjects as $subject) { print $subject['term'] . '(' . $subject['system'] . ')'; }
Transformation/normalization
This parser is aimed at producing machine actionable output, and does some non-reversible transformations to achieve this. Transformation rules expect AACR2-like records, and are tested mainly against the Norwegian version of AACR2 (Norske katalogregler), but might work well with other editions as well.
Examples:
title
is a combination of 300 $a and $b, separated by:
.year
is an integer extracted from 260 $c by extracting the first four digit integer found (c2013
→2013
,2009 [i.e. 2008]
→2009
(this might be a bit rough…))pages
is an integer extracted from 300 $a. The raw value, useful for e.g. non-verbal content, is stored inextent
creators[].name
are transformed from ', ' to ' '
Form and material
Form and material is encoded in the leader and in control fields 006, 007 and 008. Encoding this information in a format that makes sense is a work-in-progress.
Electronic and printed material is currently distinguished using the boolean valued electronic
key.
Printed book:
{ "material": "book", "electronic": false }
Electronic book:
{ "material": "book", "electronic": true }