Joe Walnes
  Blog



Recent Entries

New blog: http://joewalnes.com

Creative uses of Hamcrest matchers

Hamcrest 1.1 released

Testing on the Toilet

Building testable AJAX apps (Does my button look big in this?)

QDox is back - 1.6 released

Java and .NET RESTful interoperability with XStream

I've joined Google

OSCon: SiteMesh, SiteMesh, SiteMesh, SiteMesh

Flexible JUnit assertions with assertThat()

SiteMesh and Content Management @ O'Reilly OpenSource Conference

XStream 1.1.2 released. Java 5 Enums, JavaBeans, field aliasing, StAX, and more...

VB.Net is the bestest

XStream 1.1.1 released

Accessing generic type information at runtime

XStream 1.1 released

JUnit tip: Setting the default timezone with a TestDecorator

XStream: how to serialize objects to non XML formats

How my backflip went...

Backflippin' in 4 hours.

Is 100% test coverage a BAD thing?

Looking back at the SiteMesh HTML parser

The road ahead for SiteMesh 3

Joe's Backflipping for Autistic Research - time is nearly up...

SiteMesh 2.2 Released

More... [RSS | RDF]

About Joe Walnes

I am a software engineer for Google, based in London.

Open Source

WebStuff (coming soon)

XStream

ActiveMQ

SiteMesh

QDox

nMock

jMock

Pico Container

Nano Container

OpenSymphony

Squiggle

MockDoclet

MockObjects

Jelly

Groovy

PatternStitcher

XJB

Books

Java Open Source Programming, Wiley JSP Site Design, Wrox

Talks

Mock Roles, not Objects
October 26 2004, Vancouver, Canada. OOPSLA'04

Personal Development Practices Map
June 24 2004, Salt Lake City, Utah. Agile Development Conference

SiteMesh.NET and ASP.NET MasterPages
May 20 2004, Bangalore, India. Bangalore .NET User Group

Mock Objects: Driving Top Down Development
March 29 2004, St Neots, UK. OT2004

Mock Objects
December 2 2003, London, UK. XP Day 3


Looking back at the SiteMesh HTML parser

Before talking about how the new SiteMesh HTML processor works (to be released in SiteMesh 3), I thought I'd write a bit about how the current parser has evolved since it's first attempt in 1999 - purely in the interest of nostalgia.

The original version used a bunch of regular expressions to extract the necessary chunks of text from the document. This was easy to get running, but very error prone as the matches had no context about where they were in a document. For example, a <title> element in a <head> block is very important to SiteMesh, however sometimes they appear elsewhere, such as in a comment, <script> or <xml> block.

This was dumped, in favour of a DOM based parser, which initially used JTidy to convert HTML to XHTML so it could be traversed as a standard DOM tree. Much nicer, but very slooow. Too slow, so I switched to OpenXML, an XML parser that was tolerant to nasty HTML, giving a slight boost to performance. I was much happier with OpenXML - even though it still added a fair amount of overhead and rewrote bits of HTML that I didn't want it to.

Annoyingly, not long after that, the OpenXML project merged with the IBM XML4J parser project, rebranded itself as the mighty Apache Xerces and promptly dropped support for HTML parsing. So now I was dependant on a library that no longer existed.

By this time, SiteMesh had been open-sourced, and along came Victor Salaman, who was the third user to discover it (after Mike Cannon-Brookes and Joseph Ottinger). He saw the potential but hated the parser. About three hours later, he'd produced his own version that used low-level string manipulation. It wasn't pretty, but it went like the clappers - twelve times faster than the OpenXML one, with the bonus feature of not rewriting great chunks of the document. This brought SiteMesh into the mainstream as it was now ready for use on high-traffic sites. 1.0 was released.

This parser really is the core of SiteMesh. It's been our friend thanks to its speed and reliability. It's been our enemy because of it's awkwardness to understand and change. For a couple of years it remained barely untouched, except when we occasionally poked at it from afar with a long pointy stick for the odd change. Three years later, Chris Miller and Hani Suleiman took the plunge and gave its guts an overhaul - making it six times faster! Very brave.

Despite its awkwardness, it proudly lived on and is still the primary ingredient of SiteMesh today. It's even been ported to VB.Net!

I've kept my eye on other HTML parsers, such as HotSAX, NekoHTML and TagSoup, always with the intention of implementing an easier to maintain parser, but I just couldn't get the performance to be anything like what Victor, Chris and Hani achieved.

The problem is that most HTML parsers try to represent an HTML document as tree of nodes, like XML. This makes sense as that's what HTML is meant to be, however, to do this, every single tag in a document must be analysed and balanced accordingly. This is hard, error-prone and adds a lot of overhead.

There's another approach though. The new parser focusses on ease-of-use and ability to customize, without compromising on performance and robustness. I hope you'll like it...

Update: Sorry, I forget to mention Hani in the original posting of this. how could I forget!

Comments

Brett

Hi,

I've created a class file named parser.vb and pasted the vb.net code (followed from above vb.net link) into it. I get an error on the second line here:

Namespace Parser
Public Class HtmlPageParser : Implements IPageParser


The error is "Type 'IPageParser' is not defined"

Any idea why the above occurs? Are there other files I need?

Thanks,
Brett

Name:
Email:
URL:

ThoughtBloggers

Martin Fowler

Dan North

Aslak Hellesoy

Darren Hobbs

Geoff Oliphant

Mike Roberts

Chris Stevenson

Jon Tirsen

Loads More...

Agile Bloggers

Ken Arnold

Ward Cunningham

Brian Marick

Robert Martin

Bret Pettichord

Java Bloggers

Ara Abrahamian

Mike Cannon-Brookes

Vincent Massol

Bob McWhirter

Rickard Oberg

Joseph Ottinger

James Strachan

Hani Suleiman

Communities

eXtreme Tuesday Club (XTC)

Thursday GeekSpeek

ThoughtWorks GeekNight

London Java Meetup

The Codehaus

[RSS | RDF]
© 2001-2004, Joe Walnes

Powered by SiteMesh and Moveable Type.