Bioinformatics data

I recently had a chance to learn a little about Bioinformatics, and ended up browsing the NIH's database of genomes here. Inside the genome data for any particular strain of a species, you'll find various files with file extensions like "ffa", "fna", "ffn" and "frn". These are FASTA files.

If you'd like an example, here's the genomic data for a certain strain of E-coli.

The file format of FASTA files is described pretty well on the Wikipedia link. I immediately wondered how difficult it would be to read the entire files and import them into a relational database. The difficult part of this work is, of course, parsing the FASTA files. In order to support that, I wrote an ANLTR4 grammar for FASTA files. The result is here. Once the parser is built, it's trivial to walk the AST and insert appropriate rows.

If you're interested, the human genome is here, listed by chromosome. However, those files are in GenBank format, which is a grammar for another day.

Update: the link to the source on the Antlr4 git: antlr/grammars-v4

Leave a Reply