README

A PHP library for counting short DNA sequences for use in Bioinformatics. Helix consists of tools for data extraction as well as an ultra-low memory hash table called DNA Hash specialized for counting DNA sequences. DNA Hash stores sequence counts by their up2bit encoding - a two-way hash that exploits the fact that each DNA base need only 2 bits to be fully encoded. Accordingly, DNA Hash uses less memory than a lookup table that stores raw gene sequences. In addition, DNA Hash's novel layered Bloom filter eliminates the need to explicitly store counts for sequences that have only been seen once.

Ultra-low memory footprint
Compatible with FASTA and FASTQ formats
Supports canonical sequence counting
Open-source and free to use commercially

Note: The maximum sequence length is platform dependent. On a 64-bit machine, the max length is 31. On a 32-bit machine, the max length is 15.

Note: Due to the probabilistic nature of the Bloom filter, DNA Hash may over count sequences at a bounded rate.

Installation

Install into your project using Composer:

$ composer require andrewdalpino/helix

Requirements

PHP 7.4 or above

Example

use Helix\DNAHash;
use Helix\Extractors\FASTA;
use Helix\Tokenizers\Canonical;
use Helix\Tokenizers\Kmer;

$extractor = new FASTA('example.fa');

$tokenizer = new Canonical(new Kmer(25));

$hashTable = new DNAHash(0.001);

foreach ($extractor as $sequence) {
    $tokens = $tokenizer->tokenize($sequence);

    foreach ($tokens as $token) {
        $hashTable->increment($token);
    }
}

$top10 = $hashTable->top(10);

print_r($top10);

Array
(
    [GCTATAAAAAGAAAATTTTGGAATA] => 19
    [ATTCCAAAATTTTCTTTTTATAGCC] => 19
    [TAAAAAGAAAATTTTGGAATAAAAA] => 18
    [ATAAAAAGAAAATTTTGGAATAAAA] => 18
    [TATAAAAAGAAAATTTTGGAATAAA] => 18
    [CTATAAAAAGAAAATTTTGGAATAA] => 18
    [AAATAATTTCAATTTTCTATCTCAA] => 17
    [AAAATAATTTCAATTTTCTATCTCA] => 17
    [CAAAATAATTTCAATTTTCTATCTC] => 17
    [AGATAGAAAATTGAAATTATTTTGA] => 17
)

Testing

To run the unit tests:

$ composer test

Static Analysis

To run static code analysis:

$ composer analyze

Benchmarks

To run the benchmarks:

$ composer benchmark

References

[1] https://github.com/JohnLonginotto/ACGTrie/blob/master/docs/UP2BIT.md.
[2] P. Melsted et al. (2011). Efficient counting of k-mers in DNA sequences using a bloom filter.
[3] S. Deorowicz et al. (2015). KMC 2: fast and resource-frugal k-mer counting.

andrewdalpino / helix

Maintainers

Details