Java

khubla.com Java code released to Maven Central

I was recently asked to release some of my code to Maven Central, and therefore had to figure out how to do it.  I’ve now released these khubla.com libraries: cBean Pragmatach antlr4test-maven-plugin OLMReader simpleIOC ParadoxReader The maven coordinates for each are documented on the github pages.  

Outlook for Mac Archives

I recently had a reason to parse a large data set, for another project.  I decided that an ideal “large data set” would be my Outlook mail saved archives.  Sadly, Outlook for Mac doesn’t output PST files, it outputs OLM archives, which are, essentially, giant zip files full of XML.  I was coding this all in Java, so I needed a Java library to parse OLM files. The resulting source code is here.  Schema for OLM XML is here.    

Building a simple RIAK ORM

I’ve been interested in RIAK for a while, and ORM’s are nothing short of fascinating. I decided to try writing an ORM for Riak, and the results are here: https://github.com/teverett/cbean My ORM is not an ORM of course, because RIAK is not relational. However it is ORM-like; I can store POJO’s and retrieve them. The features I wanted for my ORM were those that I am accustomed to with Hibernate, or eBean. Ability to store POJO’s and retrieve them Ability to store object trees of POJO’s which contain POJO’s Support for Lists of POJO’s composed inside a POJO Lazy Loading In the end, I ended up with a ORM-like layer that can store data into any Key-Value store. There is a plugin which supports RIAK, and there is an emulated Key-Value store built on a filesystem, which is useful for development purposes. Theoretically cBean could work on any Key-Value store, such as MongoDB, but I haven’t built that support yet. Adding support for a Key-Value store is as simple as implementing the the interface KVService. Supported Types All java simple types (int, String, long, etc) All java wrapper types (Integer, Long, etc) Contained POJOs java.util.List of POJOs java.util.UUID java.util.Date Annotations Similar to Hibernate or eBean, cBean POJO’s must be annotated. I didn’t chose to use the JPA annotations, but instead defined my own. They are here.  There numerous cBean annotated POJO’s in the tests, here. @Entity Similar to JPA, the @Entity annotation simply marks the POJO as one which is of interest to cBean. @Id Every POJO must have an id field and it must be a String or UUID.  The @Id field is a little different in cBean than in JPA.  Inserting a new object with the same JPA @Id as one that already exists in the RDBMS is an error.  In cBean, you simply overwrite the existing object. @Property The @Property is used to indicate that a specific property (simple type, wrapper type, object type or list type) will be persisted.  POJO fields without the @Property annotation are ignored.  There are a number of properties which are valid on an @Property annotation. cascadeSave cascadeDelete cascadeLoad ignore nullable cascadeSave, cascadeDelete, and cascadeDelete are only relevant for POJO fields which are themselves POJOs, or lists of POJOs.  Setting “cascadeLoad=false”, naturally, indicates that the field is lazy-loaded. @Version An POJO can define an Integer field and annotate it with @Version.  cBean will increment the annotated field by 1 each time the POJO is saved. Referential Integrity Frameworks like Hibernate or eBean provide referential integrity because RDBMS’s provide referential integrity.  Key-Value stores, such as RIAK do not provide referential integrity, and therefore neither does cBean.  Therefore is it entirely possible to persist a POJO which contains a POJO, and have the contained POJO be deleted underneath the parent.  In a RDBMS this can be prevented with a foreign key; there is no such protection in cBean.  Therefore application code using cBean must be aware that POJOs in lists, for example, may not be resolvable when the list is reloaded. The strategy that cBean uses for handling broken “foreign keys” is two-fold: If a POJO contains a child POJO and the child is deleted, that Object will be set to null on reload If a POJO contains a list of POJO’s and one of the elements is deleted, the element Object will be set to null.  The List size() will remain the same as when it was persisted. Example Code There is a working example at https://github.com/teverett/cbean/tree/master/example.

MusicBrainzTagger

I’ve tried a couple different mp3 taggers to tag my mp3 library, however, most seem to have trouble with large mp3 libraries.  So, after doing some reading about AcoustID and MusicBrainz I decided to quickly code up my own tagger, MusicBrainzTagger. MusicBrainzTagger is a command-line application which recurses a directory of mp3 files and tags each one, one by one.  This approach allows it to handle very large libraries; it only processes one file at a time.  File processing consists of reading any ID3 tags in the input mp3, and then calculating the Acoustic ID fingerprint.  The fingerprint is then resolved to a MusicBrainz ID which is used to look up the recording. MusicBrainzTagger then tags the file, renames it, and moves it to a new directory.

TNT

If you haven’t read this book, I highly recommend it.  I discovered it in high school and finally purchased my first copy at the now-gone Duthie Books in Kitsilano.  Without going into the details of the book, the author uses a simple Peano arithmetic called Typographical Number Theory  (TNT) to illustrate some of his points. An example expression in TNT could look like this (from Wikipedia): ∀a:∀b:(a + b) = (b + a) Which means “for every number a and every number b, a plus b equals b plus a” I decided to write a simple ANTLR grammar for TNT, which you can find here.

Parsing HTML

I’ve been interested in HTML parsing for a while now.  There are a number of reasons to do this, such as: Validating that what claims to be HTML, is HTML Finding every style sheet and script in an HTML file Pretty-printing Syntax highlighting Linting Translating between markup languages, for example generating JSPs from PHP, or perhaps generating JSPs from ASPs. One of the most difficult aspects of modern web programming is that any example server-side markup file likely contains 4 programming languages: HTML CSS Javascript The markup language, such as PHP, JSP or C#.  Or maybe VB. So, if you’re going to write an HTML parser, you need to be able to not only parse the HTML, but also to find the style and script sections, and pull them out.  You also need to be able to find the scriptlets where the markup is generated. Additionally, there is the fact that modern HTML is messy.  It’s perfectly valid to have missing end-tags, or attribute values that aren’t quoted.  These edge cases just add to the difficultly in writing the parser. If the end goal is to read .php source and emit similar .jsp source, then one needs an HTML parser that can do all of the above.  The .php source will have to be pulled out of each scriptlet, and fed to another parser, which can parse the PHP.  Strange as it may sound, this is not actually as difficult as it seems.  It’s not hard to imagine doing something similar with legacy .asp pages. There are perfectly legitimate reasons to convert source from one language to another.  For example, an organization may have  significant investment in an application that works, but is in an outdated language such as ASP.  Re-writing the application is an option, however it’s usually an expensive option.  Conversion from one language to another might be cheaper, and approaches of that sort have been used before. The tree of ANTLR4 grammars didn’t have a HTML parser, and I like ANTLR, so I wrote an HTML grammar for ANTLR4 which, I believe, does all of the above.  You can take a look here. In order to show the parser working, I wrote a quick java program that reads an HTML input file and dumps all scripts and styles to the console.  It’s here. If you’re interested to see what the generated AST looks like for an HTML page, here’s the front page of reddit this morning, as an AST.

Bioinformatics data

I recently had a chance to learn a little about Bioinformatics, and ended up browsing the NIH’s database of genomes here.  Inside the genome data for any particular strain of a species, you’ll find various files with file extensions like “ffa”, “fna”, “ffn” and “frn”.  These are FASTA files. If you’d like an example, here’s the genomic data for a certain strain of E-coli. The file format of FASTA files is described pretty well on the Wikipedia link.  I immediately wondered how difficult it would be to read the entire files and import them into a relational database.  The difficult part of this work is, of course, parsing the FASTA files.  In order to support that, I wrote an ANLTR4 grammar for FASTA files.  The result is here.  Once the parser is built, it’s trivial to walk the AST and insert appropriate rows. If you’re interested, the human genome is here, listed by chromosome.  However, those files are in GenBank format, which is a grammar for another day. Update: the link to the source on the Antlr4 git: antlr/grammars-v4

Reading Paradox Files

Back in the day, Paradox was a pretty amazing database.  I recently had a reason to read some Paradox files for a friend, who had a client with a Paradox database.  They needed their data out of the database to insert into something more modern. Googling for the file format of a Paradox DB didn’t turn up much, other than this excellent document written in 1996.  It was enough to give me a good start.  I also found some sample DB files, interestingly from the Paradox documentation. The end result was paradoxReader, which you can download here.  It handles most of the data types other than BLOBs, so far.  The documentation on how to use it is here.  It uses the visitor pattern, which means all you need to do is pass it an InputStream to a .db file, and implement a single interface which is called once per row, with the row data. The default implementation currently outputs CSV for each table, however it wouldn’t be difficult to have it output SQL or even just connect to a JDBC database and insert the data into a table.

Pragmatch version 1.38 Released

Pragmatach version 1.38 has been released Here are the notable changes: Numerous bugs that were identified by Findbugs have been resolved. Upgraded from Antlr3 to Antlr4 Upgraded to scannotation 1.03 Added url_for API to make generation of urls in templates simpler