April 2014

Bioinformatics data

I recently had a chance to learn a little about Bioinformatics, and ended up browsing the NIH’s database of genomes here.  Inside the genome data for any particular strain of a species, you’ll find various files with file extensions like “ffa”, “fna”, “ffn” and “frn”.  These are FASTA files. If you’d like an example, here’s the genomic data for a certain strain of E-coli. The file format of FASTA files is described pretty well on the Wikipedia link.  I immediately wondered how difficult it would be to read the entire files and import them into a relational database.  The difficult part of this work is, of course, parsing the FASTA files.  In order to support that, I wrote an ANLTR4 grammar for FASTA files.  The result is here.  Once the parser is built, it’s trivial to walk the AST and insert appropriate rows. If you’re interested, the human genome is here, listed by chromosome.  However, those files are in GenBank format, which is a grammar for another day. Update: the link to the source on the Antlr4 git: antlr/grammars-v4

Reading Paradox Files

Back in the day, Paradox was a pretty amazing database.  I recently had a reason to read some Paradox files for a friend, who had a client with a Paradox database.  They needed their data out of the database to insert into something more modern. Googling for the file format of a Paradox DB didn’t turn up much, other than this excellent document written in 1996.  It was enough to give me a good start.  I also found some sample DB files, interestingly from the Paradox documentation. The end result was paradoxReader, which you can download here.  It handles most of the data types other than BLOBs, so far.  The documentation on how to use it is here.  It uses the visitor pattern, which means all you need to do is pass it an InputStream to a .db file, and implement a single interface which is called once per row, with the row data. The default implementation currently outputs CSV for each table, however it wouldn’t be difficult to have it output SQL or even just connect to a JDBC database and insert the data into a table.

FreeBSD Wandboard, new build with multicore

The FreeBSD ARM team have completed the work to support multicore on ARM.  You can read the announcement here.  I’ve made a new build of FreeBSD ARM for Wandboard, which you can download here. The dmesg output is here, and here is the sysctl showing four cores [Kroot@wandboard:~ # sysctl -a | grep hw.ncpu hw.ncpu: 4