I've been interested in HTML parsing for a while now. There are a number of reasons to do this, such as:
- Validating that what claims to be HTML, is HTML
- Finding every style sheet and script in an HTML file
- Syntax highlighting
- Translating between markup languages, for example generating JSPs from PHP, or perhaps generating JSPs from ASPs.
One of the most difficult aspects of modern web programming is that any example server-side markup file likely contains 4 programming languages:
- The markup language, such as PHP, JSP or C#. Or maybe VB.
So, if you're going to write an HTML parser, you need to be able to not only parse the HTML, but also to find the style and script sections, and pull them out. You also need to be able to find the scriptlets where the markup is generated.
Additionally, there is the fact that modern HTML is messy. It's perfectly valid to have missing end-tags, or attribute values that aren't quoted. These edge cases just add to the difficultly in writing the parser.
If the end goal is to read .php source and emit similar .jsp source, then one needs an HTML parser that can do all of the above. The .php source will have to be pulled out of each scriptlet, and fed to another parser, which can parse the PHP. Strange as it may sound, this is not actually as difficult as it seems. It's not hard to imagine doing something similar with legacy .asp pages.
There are perfectly legitimate reasons to convert source from one language to another. For example, an organization may have significant investment in an application that works, but is in an outdated language such as ASP. Re-writing the application is an option, however it's usually an expensive option. Conversion from one language to another might be cheaper, and approaches of that sort have been used before.
The tree of ANTLR4 grammars didn't have a HTML parser, and I like ANTLR, so I wrote an HTML grammar for ANTLR4 which, I believe, does all of the above. You can take a look here.
In order to show the parser working, I wrote a quick java program that reads an HTML input file and dumps all scripts and styles to the console. It's here.
If you're interested to see what the generated AST looks like for an HTML page, here's the front page of reddit this morning, as an AST.