mdsills / cccedict
Parser for CC-CEDICT Chinese-English dictionary
Installs: 231
Dependents: 0
Suggesters: 0
Security: 0
Stars: 55
Watchers: 8
Forks: 15
Open Issues: 0
pkg:composer/mdsills/cccedict
Requires
- php: >=7.0
 
This package is auto-updated.
Last update: 2025-10-05 14:25:11 UTC
README
PHP Version
This parser is written for >= PHP7. It will not work on PHP5.
Demo
Download the current CC-CEDICT file from https://www.mdbg.net/chinese/dictionary?page=cc-cedict into the demo folder.
cd demo
composer install
wget -O cedict.gz https://www.mdbg.net/chinese/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz
php -f index.php
About
Reads from a CC-CEDICT Chinese dictionary file, and outputs structured data.
Options
Required settings
- setFilePath() sets path of file to extract and process
 
Optional settings
- setBlockSize(int) sets block size to read and parse at a time
 - setStartLine(int) in case you don't want to start from the beginning
 - setNumberOfBlocks(float) in case you don't want to read all the way to the end. You can use INF.
 - setOptions(array) define which data you want returned (see below)
 
Returned data
The parser will return an array with:
- an array of Entry objects filled with data as per your configuration (see below)
 - an array of any skipped lines
 - the number of parsed lines
 - the number of skipped lines
 
Basic Entry object
By default, the parser will fill the Entry object with:
- an array of English translations from the dictionary entry
 - an array of traditional characters from the dictionary entry
 - an array of simplified characters from the dictionary entry
 
Customising the Entry object
With setOptions(array) (see above), you can change the data included in the Entry object. If any options are set, the Entry will not include any data that is not specified with setOptions()!
Entry::F_ORIGINALincludes the original unparsed line from CC-CEDICTEntry::F_TRADITIONALincludes a string with the dictionary entry in traditional charactersEntry::F_SIMPLIFIEDsame as above but in simplified charactersEntry::F_PINYINincludes a string of pinyin as formatted in CC-CEDICT (numeric but with ideosyncrasies)Entry::F_PINYIN_NUMERICincludes a string of pinyin converted to numeric Hanyu PinyinEntry::F_PINYIN_DIACRITICincludes a string of pinyin converted to Hanyu Pinyin with diacriticsEntry::F_ENGLISHincludes a string with all the English translations for the dictionary entryEntry::F_ENGLISH_EXPANDEDincludes an array with the above English translationsEntry::F_TRADITIONAL_CHARSincludes an array of all traditional characters in the dictionary entryEntry::F_SIMPLIFIED_CHARSsame as above but with simplified characters
Limitations, bugs, roadmap
Opportunities for improvement
- Well perhaps it could output various formats (e.g. JSON) instead of simply arrays?
 - Any further Chinese in the English translation (references, alternative spellings, or full forms of abbreviations) could be structured and nested
 - getFull() still needs to be described (and made accessible, or removed)