Parsing HTML

I've been interested in HTML parsing for a while now.  There are a number of reasons to do this, such as:

  • Validating that what claims to be HTML, is HTML
  • Finding every style sheet and script in an HTML file
  • Pretty-printing
  • Syntax highlighting
  • Linting
  • Translating between markup languages, for example generating JSPs from PHP, or perhaps generating JSPs from ASPs.

One of the most difficult aspects of modern web programming is that any example server-side markup file likely contains 4 programming languages:

So, if you're going to write an HTML parser, you need to be able to not only parse the HTML, but also to find the style and script sections, and pull them out.  You also need to be able to find the scriptlets where the markup is generated.

Additionally, there is the fact that modern HTML is messy.  It's perfectly valid to have missing end-tags, or attribute values that aren't quoted.  These edge cases just add to the difficultly in writing the parser.

If the end goal is to read .php source and emit similar .jsp source, then one needs an HTML parser that can do all of the above.  The .php source will have to be pulled out of each scriptlet, and fed to another parser, which can parse the PHP.  Strange as it may sound, this is not actually as difficult as it seems.  It's not hard to imagine doing something similar with legacy .asp pages.

There are perfectly legitimate reasons to convert source from one language to another.  For example, an organization may have  significant investment in an application that works, but is in an outdated language such as ASP.  Re-writing the application is an option, however it's usually an expensive option.  Conversion from one language to another might be cheaper, and approaches of that sort have been used before.

The tree of ANTLR4 grammars didn't have a HTML parser, and I like ANTLR, so I wrote an HTML grammar for ANTLR4 which, I believe, does all of the above.  You can take a look here.

In order to show the parser working, I wrote a quick java program that reads an HTML input file and dumps all scripts and styles to the console.  It's here.

If you're interested to see what the generated AST looks like for an HTML page, here's the front page of reddit this morning, as an AST.

Bioinformatics data

I recently had a chance to learn a little about Bioinformatics, and ended up browsing the NIH's database of genomes here.  Inside the genome data for any particular strain of a species, you'll find various files with file extensions like "ffa", "fna", "ffn" and "frn".  These are FASTA files.

If you'd like an example, here's the genomic data for a certain strain of E-coli.

The file format of FASTA files is described pretty well on the Wikipedia link.  I immediately wondered how difficult it would be to read the entire files and import them into a relational database.  The difficult part of this work is, of course, parsing the FASTA files.  In order to support that, I wrote an ANLTR4 grammar for FASTA files.  The result is here.  Once the parser is built, it's trivial to walk the AST and insert appropriate rows.

If you're interested, the human genome is here, listed by chromosome.  However, those files are in GenBank format, which is a grammar for another day.

Update: the link to the source on the Antlr4 git: antlr/grammars-v4

Reading Paradox Files

Back in the day, Paradox was a pretty amazing database.  I recently had a reason to read some Paradox files for a friend, who had a client with a Paradox database.  They needed their data out of the database to insert into something more modern.

Googling for the file format of a Paradox DB didn't turn up much, other than this excellent document written in 1996.  It was enough to give me a good start.  I also found some sample DB files, interestingly from the Paradox documentation.

The end result was paradoxReader, which you can download here.  It handles most of the data types other than BLOBs, so far.  The documentation on how to use it is here.  It uses the visitor pattern, which means all you need to do is pass it an InputStream to a .db file, and implement a single interface which is called once per row, with the row data.

The default implementation currently outputs CSV for each table, however it wouldn't be difficult to have it output SQL or even just connect to a JDBC database and insert the data into a table.

Bootstrapping pkgng on FreeBSD ARM

There is no official pkgng repo for FreeBSD-arm, yet.  There is an unofficial one, here, however in order to use it, you have to have pkgng installed.  As far as I can tell, the only way to install pkgng on ARM is to builds and install from source.
On the platform I'm using, Wandboard, the mmc device isn't working 100% yet, so I decided to compile on a RAM disk.  I did this to create the RAM disk

mkdir /mnt/tmpdisk
mount -t tmpfs tmpfs /mnt/tmpdisk
cd /mnt/tmpdisk

The second step is:


Once you have the source, untar it, build and install

tar zxvf pkg-1.1.tar.xz
cd pkg-1.1
make install

In my case, "make install" on FreeBSD-Current failed due to lack of certain directories.  This may help:

mkdir /usr/local/lib
mkdir /usr/local/man
mkdir /usr/local/libdata
mkdir /usr/local/sbin
mkdir /usr/local/man/man8

Once you've installed pkgng, you should be able to verify that it's available

root@wandboard:/mnt/tmpfs/pkg-1.1.4 # /usr/local/sbin/pkg -v

From here, there are a couple of options.

  • You can use the unofficial repo provided here.
  • You can download the packages you need from here, and install them.

Building my own wireless point

I got interested in building my own wireless point after seeing some of the wireless firmware issues like this.  Besides, I've always been interested in embedded devices and FreeBSD.

So, the first step was a device.  I chose to use a Wandboard.  I'm a committer to Crochet-FreeBSD, so I built out the device support for Crochet-FreeBSD.  You can take a look here.

For the wireless interface I used an Cisco AE1000 wireless interface. The AE1000 uses the run driver.

Starting the wireless interface and scanning for wireless points looks like this

ifconfig wlan0 create wlandev run0
ifconfig wlan0 up scan

On this board I have two interfaces:

  • ffec0.  The wired interface
  • run0.  The Cisco USB wireless

ffec0 is configured to get an IP, gateway and DNS via DHCP, in /etc/rc.conf


I had these design criteria.

  • I already have a DHCP server, so I didn't want to assign IP leases on the wireless point; I want to delegate to my existing DHCP server
  • I prefer to use WPA Personal for authentication
  • I'd like to install as little software as possible; this doesn't need to be complicated
  • It would be great to automatically firewall any IPs that fail to log in more than a couple times
  • A simple web administration interface would be very helpful

Of course, I'm not interesting in connecting to an existing wireless point, instead I want to be the wireless point.   I need only one piece of software installed to function as a wireless point; hostapd.  Fortunately hostapd is part of the base FreeBSD install.

There are a couple kernel features I needed, so I loaded them at boot time.  My /boot/loader.conf looks like:





# run driver

# bridge

# set wandboard to use 1 cpu

These options give me various wlan capabilties, the pf devices, the bridge device, and altq.  I've also loaded the kernel module for the run driver.

The strategy I want to use for leveraging my existing DHCP server and existing network is to configure my wireless point as a transparent proxy. The bridge device provides me exactly what I want, by enabling me to bridge the ffec0 and run0 interfaces.

My /etc/rc.conf includes:

# hostname

# services

# pf

# lan

# turn off sendmail

# wireless
create_args_wlan0="wlanmode hostap mode 11g"
ifconfig_wlan0="ssid snagglepuss11 channel 11"

# bridge
ifconfig_bridge0="addm ffec0 addm wlan0 up"

This configuration sets up the lan interface on DHCP, the wifi interface as an 11g access point on channel 11, and then bridges the interface.  At this point, we have a working wifi interface.  However, it's not secured yet.

My /etc/hostapd.conf file, the configuration file for hostapd looks like this

wpa_pairwise=CCMP TKIP

It's pretty simple; I have WEP authentication on the interface wlan0, with the ssid khublacom1.

Finally, I decided to implement some simple packet filtering.  /etc/pf.conf looks like this:

# interfaces

# options
set block-policy return
set optimization conservative

# normalization
scrub in all
scrub out all

# anti-spoof
antispoof for $lan_if inet

# pass on lo
set skip on lo

# default, deny everything
block in log all

# out is ok
pass out quick

# pass inet4 and inet6 traffic in on wifi and lan
pass in on $wifi_if inet
pass in on $wifi_if inet6
pass in on $lan_if inet
pass in on $lan_if inet6

# icmp all good
pass out inet proto icmp from any to any keep state
pass in quick inet proto icmp from any to any keep state

I allow all IP4 and IP6 traffic in on the wifi interface.

I don't have a web interface yet; I've had some trouble reliably compiling on the current builds of FreeBSD ARM.  However, I'm sure that'll be worked out shortly.

Pragmatach-1.32 Released

The latest release of the Pragmatach Framework is version 1.32.  Changes in this release include:

  • Bug fix in JMX registration.  Web contexts can now be reloaded by the container and the JMX beans are gracefully unregistered and re-registered
  • Addition of the @Startup and @Shutdown annotations to allow methods to be run at container shutdown and startup.  An example is here.
  • Addition of @BeforeInvoke and @AfterInvoke annotations which can be used to register methods which are run before and after controller method invocations.  Example is here.
  • Null pointer violation when POSTing empty forms fixed
  • Updated various libraries to the latest including
    • org.apache.httpcomponents.httpclient
    • org.apache.httpcomponents.httpmime
    • org.hibernate.hibernate-core
    • org.apache.openjpa.openjpa
    • org.thymeleaf..thymeleaf
    • com.thoughtworks.xstream.xstream
    • org.yaml.snakeyaml
    • commons-beanutils.commons-beanutils
    • commons-fileupload.commons-fileupload
    • org.antlr.antlr
    • commons-codec.commons-codec
    • asm.asm-util
  • Updated the default J2EE containers to the latest versions including
    • jetty
    • tomcat7
    • tomcat6
    • tomee
    • jboss-as
    • glassfish
  • Updated the supported databases to the latest version including
    • hsqldb
    • derby
    • h2
  • Added prototypical MongoDB support
  • Switched from cglib to javaassist for proxy generation


How does the Crochet-FreeBSD ARM boot work?


This is my 3rd blog posting on the topic of the Crochet-FreeBSD ARM boot process.  The other two are here and here.   At long last, I have the Crochet-FreeBSD build for Wandboard working properly, with U-boot and ubldr.  This article will serve, I hope, to document the entire process and give others a place to start in booting FreeBSD other embedded ARM devices.   If you want to follow along via the boot log, it's here.  Bootable images are here.

Booting FreeBSD on an ARM device has three primary steps:

  1. U-Boot
  2. ubldr
  3. FreeBSD kernel

Diagrammatically, it looks like this:

boot process

When the Wandboard starts, it loads a boot image from the mmc , at a known location on the disk.  In the case of Crochet-FreeBSD this image is U-boot.  After U-boot starts it loads and runs ubldr, which in turn loads the FreeBSD kernel and boots it.

Disk Layout

The Wandboard documentation here shows the basic requirements of the disk layout on the mmc card.  In particular:

  • The MBR is at the start of the disk, and is about 0x200 bytes
  • The U-boot boot loader is expected to be at offset 0x400 (1024) on the disk.

In the case of the Crochet-FreeBSD image for Wandboard, I've chosen this layout

  • MBR at the start of the disk, about 0x200 bytes long
  • At offset 0x400, is my U-boot boot loader, as required by the Wandboard
  • A FAT partition 50MB in size at offset 16384 (0x4000) on the disk.   This is the partition that the U-Boot configuration file, and ubldr will live on.
  • A UFS partition which is the remainder of the disk.  This will be the FreeBSD root filesystem.

For FAT32 partitions of less than 512MB size, the block size is 4KB.  So, 0x4000 blocks from the start of the disk is 64MB into the disk.  Given that U-boot is about a 300KB binary, we can be quite sure that the FAT partition will not overlap with U-Boot.  ubldr compiles to a 250KB binary and is stored on a 50MB partition; also plenty of space.

The disk layout looks like this

root@wandboard:~ # gpart show
=> 63 30678977 mmcsd0 MBR (15G)
 63 16380 - free - (8.0M)
 16443 102375 1 !12 [active] (50M)
 118818 1881180 2 freebsd (919M)
 1999998 28679042 - free - (14G)
=> 0 1881180 mmcsd0s2 BSD (919M)
 0 1881180 1 freebsd-ufs (919M)


U-Boot is a very capable boot loader, that can boot a variety of architectures, of which one is ARM.   In the case of Wandboard, however a couple changes are needed to the configuration.   The patch files are here.  The primary requirements are:

  • U-Boot needs to be configured to read ELF binaries
  • U-Boot needs to be configured to include the U-Boot API, a feature which ubdlr requires.
  • The Makefile needs to be changed, to include libc specifically

When U-Boot starts, it looks for the file "uEnv.txt"' on the file system.  It's very important that the first partition on the file system be FAT, since U-Boot doesn't include UFS support.  The contents of uEnv.txt are instructions to U-Boot to load ubldr and start it.  Specifically:

uenvcmd=fatload mmc 0:1 88000000 ubldr;bootelf 88000000;

These U-Boot commands mean:

  • From the FAT disk unit "mmc 0" on the 1st slice load "ubldr" into RAM location 0x88000000.
  • Boot the ELF image at 0x88000000

From here, ubldr will start.  The reason we had compiled U-Boot with ELF support is that ubldr is an ELF binary, so we needed the U-Boot command "bootelf".

The choice of memory address 0x88000000 is mostly arbitrary.  According to the manuals, the RAM starts at address 0x100000, so this number has to be larger than 0x100000, smaller than the physical size of the RAM, and not conflict with the memory address that U-Boot was loaded at.  I suspect the Wandboard loaded U-Boot at 0x100000.


ubldr is an ARM implementation of loader(8).  It's not technically necessary to use ubldr; U-Boot could just boot the kernel directly, but having an implementation of loader(8) is quite useful.  For example, it provides a serial console for kernel debugging, and it allows passing flags to the kernel at startup time.   Some drivers, such as urtwn, for example require passing flags to the kernel to accept licensing terms for binary blobs.

Since ubldr is not relocatable, it's necessary to compile it with the memory address that it will be loaded at.  If you look here, you will notice that the address 0x8800000 was passed to the compile command.

An important aspect of ubldr for Wandboard is that the FDT is compiled into the kernel.   ubldr can use external device blobs, or it can use device blobs that are compiled into the kernel.  If you look at the kernel config for Wandboard here, you can see that it specifies compiled-in device blobs.  This message from ubldr shows that it's using a device tree blob (DTB) compiled into the kernel

Booting [/boot/kernel/kernel]...               

Using DTB compiled into kernel.

When ubldr starts, it'll mount the UFS root file system, and read the file /boot/loader.conf.  In that file we can configure the boot loader, including passing kernel parameters, setting the serial console speed, configuring boot loader menus, etc.   It's worth noting that ubldr configurations are coded in FORTH.

Another aspect of ubldr which is important is that it uses the U-Boot API.  If you're interested to know how that works, the ubldr source is here.  The Makefile pulls in U-Boot deps, and they are referred to in conf.c.

On the topic of ubldr, there is an excellent blog post here that is worth reading.

FreeBSD Kernel

Finally, ubldr will boot the kernel.  The default location for the kernel on the root filesystem is /boot/kernel/kernel.   The file /etc/fstab should be configured to mount appropriate file systems when the kernel starts.  If you look at the kernel configuration for Wandboard here, you can see that the Kernel is configured to find the root file system on the first mmc device, on the second slice.   Specifically:

# U-Boot stuff lives on slice 1, FreeBSD on slice 2.
options ROOTDEVNAME=\"ufs:mmcsd0s2a\"


One trick that was used for Crochet-FreeBSD's Beaglebone build was to configure /etc/fstab to mount the FAT partition to /boot/msdos.  The specific configuration is

/dev/mmcsd0s1   /boot/msdos     msdosfs rw,noatime      0 0

This is interesting because uEnv.txt is on this partition. We can modify it if we want to try new configurations without rebuilding the image.