More ANTLR Grammars
A complete list of my ANTLR grammars HTML Redcode gff3 6502 Assembler fasta BASIC Creole Logo brainfuck
A complete list of my ANTLR grammars HTML Redcode gff3 6502 Assembler fasta BASIC Creole Logo brainfuck
If you haven’t read this book, I highly recommend it. I discovered it in high school and finally purchased my first copy at the now-gone Duthie Books in Kitsilano. Without going into the details of the book, the author uses a simple Peano arithmetic called Typographical Number Theory (TNT) to illustrate some of his points. An example expression in TNT could look like this (from Wikipedia): ∀a:∀b:(a + b) = (b + a) Which means “for every number a and every number b, a plus b equals b plus a” I decided to write a simple ANTLR grammar for TNT, which you can find here.
I’ve been interested in HTML parsing for a while now. There are a number of reasons to do this, such as: Validating that what claims to be HTML, is HTML Finding every style sheet and script in an HTML file Pretty-printing Syntax highlighting Linting Translating between markup languages, for example generating JSPs from PHP, or perhaps generating JSPs from ASPs. One of the most difficult aspects of modern web programming is that any example server-side markup file likely contains 4 programming languages: HTML CSS Javascript The markup language, such as PHP, JSP or C#. Or maybe VB. So, if you’re going to write an HTML parser, you need to be able to not only parse the HTML, but also to find the style and script sections, and pull them out. You also need to be able to find the scriptlets where the markup is generated. Additionally, there is the fact that modern HTML is messy. It’s perfectly valid to have missing end-tags, or attribute values that aren’t quoted. These edge cases just add to the difficultly in writing the parser. If the end goal is to read .php source and emit similar .jsp source, then one needs an HTML parser that can do all of the above. The .php source will have to be pulled out of each scriptlet, and fed to another parser, which can parse the PHP. Strange as it may sound, this is not actually as difficult as it seems. It’s not hard to imagine doing something similar with legacy .asp pages. There are perfectly legitimate reasons to convert source from one language to another. For example, an organization may have significant investment in an application that works, but is in an outdated language such as ASP. Re-writing the application is an option, however it’s usually an expensive option. Conversion from one language to another might be cheaper, and approaches of that sort have been used before. The tree of ANTLR4 grammars didn’t have a HTML parser, and I like ANTLR, so I wrote an HTML grammar for ANTLR4 which, I believe, does all of the above. You can take a look here. In order to show the parser working, I wrote a quick java program that reads an HTML input file and dumps all scripts and styles to the console. It’s here. If you’re interested to see what the generated AST looks like for an HTML page, here’s the front page of reddit this morning, as an AST.
I recently had a chance to learn a little about Bioinformatics, and ended up browsing the NIH’s database of genomes here. Inside the genome data for any particular strain of a species, you’ll find various files with file extensions like “ffa”, “fna”, “ffn” and “frn”. These are FASTA files. If you’d like an example, here’s the genomic data for a certain strain of E-coli. The file format of FASTA files is described pretty well on the Wikipedia link. I immediately wondered how difficult it would be to read the entire files and import them into a relational database. The difficult part of this work is, of course, parsing the FASTA files. In order to support that, I wrote an ANLTR4 grammar for FASTA files. The result is here. Once the parser is built, it’s trivial to walk the AST and insert appropriate rows. If you’re interested, the human genome is here, listed by chromosome. However, those files are in GenBank format, which is a grammar for another day. Update: the link to the source on the Antlr4 git: antlr/grammars-v4
Back in the day, Paradox was a pretty amazing database. I recently had a reason to read some Paradox files for a friend, who had a client with a Paradox database. They needed their data out of the database to insert into something more modern. Googling for the file format of a Paradox DB didn’t turn up much, other than this excellent document written in 1996. It was enough to give me a good start. I also found some sample DB files, interestingly from the Paradox documentation. The end result was paradoxReader, which you can download here. It handles most of the data types other than BLOBs, so far. The documentation on how to use it is here. It uses the visitor pattern, which means all you need to do is pass it an InputStream to a .db file, and implement a single interface which is called once per row, with the row data. The default implementation currently outputs CSV for each table, however it wouldn’t be difficult to have it output SQL or even just connect to a JDBC database and insert the data into a table.
Pragmatach version 1.38 has been released Here are the notable changes: Numerous bugs that were identified by Findbugs have been resolved. Upgraded from Antlr3 to Antlr4 Upgraded to scannotation 1.03 Added url_for API to make generation of urls in templates simpler
AntlrI4DE is an Eclipse plugin for editing Antlr4 grammars. One of the features I used to really like in the original Antlr IDE was the ability to generate railroad diagrams. It turns out that Antlr4IDE has this feature, so I decided to give it a try. My jvmBasic grammar is here, and you can take a look at the railroad diagram here.
Commons Math is an Apache library which includes a variety of mathematical tools, including 1st and 2nd-order ODE solvers. In order to familiarize myself with the ODE solvers, I wrote a simple program to solve an RC charging circuit. The circuit has these parameters: R (ohms), the resistance of the resistor C (farads), the capacitance of the capacitor V (volts), the voltage of the supply current Differential equation solvers need initial conditions, so I’ve supplied the voltage across the capacitor as zero. That is, the capacitor starts uncharged. Commons Math requires that all equations are provided in the form y’ = f(y, t) That is, the equation must provide the differential in terms of time, and the current state: y. Note that both y’ and y can be vectors. In my case, the equation I’ve supplied is: Vc’ = – (Vc-Vin)/RC; Where Vc’ is the derivative with respect to time of the capacitor charge Vc is the current capacitor charge Vin is the supply voltage This equation is coded as an implementation of FirstOrderDifferentialEquations. Note that I could have supplied an multiple equations using this interface, and I could have done something more complex using SecondOrderDifferentialEquations. My implementation is here. Once I’ve defined equations, there are multiple implementations of ODE solvers. For simplicity I chose ClassicalRungeKuttaIntegrator, however any of the implementations of AdaptiveStepsizeIntegrator might be faster. My code to invoke the integrator with a step size of 1e-6 seconds is here. The output of the test using R = 100KOhm V = 12V C = 10 nF Vc(0) = 0 V step size 1e-6 seconds T (0) = 0 seconds T (N) = .1 seconds Starts like this: 1.0E-6,0.0011999400019999497 2.0E-6,0.0023997600159991993 3.0E-6,0.0035994600539959497 4.0E-6,0.0047990401279872 4.9999999999999996E-6,0.005998500249968752 5.999999999999999E-6,0.007197840431935207 6.999999999999999E-6,0.008397060685879965 8.0E-6,0.009596161023795232 9.0E-6,0.010795141457672009 1.0E-5,0.0119940019995001 1.1000000000000001E-5,0.01319274266126811 1.2000000000000002E-5,0.014391363454963448 1.3000000000000003E-5,0.01558986439257232 1.4000000000000003E-5,0.016788245486079736 1.5000000000000004E-5,0.017986506747469506 1.6000000000000003E-5,0.019184648188724243 1.7000000000000003E-5,0.020382669821825364 1.8000000000000004E-5,0.021580571658753083 1.9000000000000004E-5,0.02277835371148642 2.0000000000000005E-5,0.02397601599200319 2.1000000000000006E-5,0.025173558512280026 and ends like this: 0.09998900000007933,11.99945460123407 0.09999000000007933,11.99945465577122 0.09999100000007934,11.999454710302917 0.09999200000007934,11.99945476482916 0.09999300000007934,11.999454819349952 0.09999400000007934,11.999454873865291 0.09999500000007934,11.99945492837518 0.09999600000007934,11.999454982879616 0.09999700000007934,11.999455037378603 0.09999800000007934,11.99945509187214 0.09999900000007934,11.999455146360228 Pretty much exactly what you’d expect; the capacitor charges to ~12V, exponentially.
I’ve recently becoming interested in porting legacy PHP sites to JSPs. It seemed to me that one of the hardest parts of this problem was parsing the PHP code. Once a parse tree was created, the next step would be to emit equivalent JSP code. I went looking for an ANTL4 grammar for PHP, but could only find an ANTLR3 grammar, so I went to work updating the ANTLR3 grammar to ANTLR4 and writing a very simple validation suite. The github project is here, and the resulting grammar is here.
I’ve always had a fascination with compilers. As a Java geek, I’m also quite interested in the JVM. In order to learn a little more about both, and as a way to contribute to the open source world, I decided to implement a compiler for BASIC. So, jvmBasic consumes BASIC code and emits .class files. The first step was to build a parser and lexer for BASIC. I decided to define an ANTLR4 grammar and use it to generate the lexer and parser. BASIC is a fairly simple language, so the grammar was not difficult to define. However, there are numerous BASIC dialects, so I had to pick a simple dialect. jvmBASIC syntax looks much like Integer BASIC, but could easily be extended to parse GW-Basic, or maybe VB. The resulting grammar is here. Once ANTLR has generated a parser and lexer, it’s possible to generate a parse tree for any BASIC input and then walk the tree emitting bytecode. I used ASM to emit the bytecode. An example BAS input file looks like this: 100 PRINT “Hello world” The generated parse tree from jvmBASIC debug output looks like – [1 line] – [3 linenumber] – [120 NUMBER] 100 – [4 amprstmt] – [5 statement] – [7 printstmt1] – [4 ‘PRINT’] PRINT – [8 printlist] – [66 expression] – [60 func] – [118 STRINGLITERAL] “Hello world” – [122 CR] Because there is no concept of functions, methods or classes in BASIC, I chose to enclose the generated code in a single method, of a single class. The classname is the name of the BASIC input file, and the single method is: public static void main(String[] args) The class has two fields: public InputStream inputStream; public OutputStream outputStream; The default values of inputStream and outputStream are System.in and System.out respectively. However, in the case of jvmbasicwww, I replace them with HTTP input and output streams. BASIC doesn’t have new, delete, malloc, or free, or really any analogue of those. Additionally, methods such as MID$ or perhaps VAL have certain semantics and behaviour. In order to as closely as possible emulate BASIC, I implement jvmbasicrt. Inside jvmbasicrt are implementation of each BASIC function, as well as a class called ExecutionContext. ExecutionContext includes the “guts” of a BASIC runtime: A stack. Similar to many programming languages, BASIC needs a stack. All variables. This is simple a hashtable of Values, keyed on the Variable name. Additionally there is Value which implements a variable with BASIC semantics. There is a maven mojo which wraps jvmbasicc. The mojo jvmbasicmojo, compiles all BASIC files in “/src/main/bas” and produces a .class file for each one. This mojo can be used to incorporate BASIC files into any normal maven project and then link them into a .jar file. An additional example BASIC file is: 10 REM this is a comment 20 PRINT “13” 30 PRINT “hi” 40 PRINT 10 50 PRINT 15.55 60 LET x = 12 70 PRINT “hihi” 80 PRINT x 90 LET y = 1+2 100 LET z = 3*6 110 LET d= y+z 120 PRINT d The maven pom file that uses jvmbasicmojo is here. The javap output for the generated .class file is: public class EXAMPLE1 { public com.khubla.jvmbasic.jvmbasicrt.ExecutionContext executionContext; public java.io.InputStream inputStream; public java.io.PrintStream outputStream; public EXAMPLE1(); public static void main(java.lang.String[]); public void program() throws java.lang.Exception; } There isn’t a big demand, that I’m aware of, for bytecode compilers for BASIC. Two potential applications that come to mind are: Running VB code on the JVM. Theoretically it would be possible to extend the grammar to include VB, and then to emit bytecode for VB programs. This would form the foundation of technology to run .asp applications on the JVM. The VB standard library would have to be implemented too. Cross-compilation. Again, theoretically, it should be possible to use the grammar file to implement a cross compiler which consumes VB code and emits JSP code, or even PHP code.