Well. The W3C HTML Mailing List is going nuts. I’m somewhat worried that I have contributed to this, but I think this was bound to happen because of the disconnect between a lot of web developers and the WHATWG. It’s not that the WHATWG has been insular, closed and secret, but that a lot of web developers haven’t been interested enough to keep track of a lot of the decisions that were hammered out carefully there. And so we have a lot of issues being brought up about these decisions.
The one I care about most is error recovery. The WHATWG spec defines exactly how a document claiming to be HTML is to be parsed, even if that document contains gross errors. This is the polar opposite of the XML approach which is to mandate that a conforming parser cannot recover from an error. The XML approach (“draconian error handling”) is a terrible idea and is the absolute worst part of the whole spec.
Proponents of draconian error handling seem to think that we can create a “clean, lean” language and rid the web of all its ills. They are worried that new browser vendors will have to live with the legacy of bad HTML that is so predominant on the web. They view the new HTML spec as a throwback to the days of tag soup. Unfortunately, these arguments do not pan out.
- There already exists a “clean, lean” syntax: XML. It hasn’t worked on the web, for a number of reasons (not the least of which is its draconian error handling). The last attempt at creating a “clean, lean” semantic language was XHTML 2, and we’ve seen how well that’s worked.
- There’s a large body of documents on the web that are crappy, horrible HTML. This is not going to change no matter what some standards body does.
- Because of the previous point, browser vendors are going to have to live with the legacy of bad HTML one way or the other. Adding another language to deal with will simply make their job harder.
- While the WHATWG spec does give exact handling for erroneous markup, it does not suggest or recommend that authors generate erroneous markup. Conformance checkers will reject bad markup.
Furthermore, there are a number of problems with the draconian stance.
- It’s bad practice. We’ve known since before the web that the best policy in these matters is Postel’s law: be conservative in what you produce and liberal in what you consume. Error handling is something that exists on the client (or consumer) side. The draconian approach is conservative, so it’s the wrong thing to do. Now, it’s important to recognize that this doesn’t mean that we be lax in what we produce. We should produce very strict HTML. That’s best practice.
- It’s a bad division of labor. It puts the burden of conformance checking where it doesn’t belong. It doesn’t make sense to have a browser running all the overhead of a conformance checker when its user doesn’t even know what conformance checking (or HTML) is.
- It ignores the possibility of bugs. If a producer puts best effort into outputting strict HTML and fails, it is not an appropriate response to deny service to the consumer. Firstly, this is bad for business. I can’t imagine Ebay adopting HTML5 if it means that a bug in their frontend causes lost business. Secondly, the draconian approach doesn’t necessarily eliminate bugs before they are deployed because it is an instance of testing. Testing is remarkably good at showing the presence of bugs and remarkably bad at showing their absence.
- In the case of HTML, it is problematic from an implementation standpoint. For one, it obviously requires the addition of another “rendering mode” (or actually a parsing mode) alongside existing modes. This in and of itself is difficult for user agents to maintain. Remember that the old modes can’t be eliminated because doing so would (and always will) break compatibility with the rest of the web. So this is a burden placed on user agents. However, the most problematic issue is that of triggering the new mode. Some have suggested switching on the doctype, but the big browser makers have made it clear that is not a workable solution. The alternative is to switch using a mime-type, but this completely breaks backwards compatibility since current user agents won’t know how to handle it. So it’s impracticable.
- In the end, it doesn’t prevent tag soup because market pressure compels consumers to follow Postel’s law. This has already been demonstrated with XML. A lot of the XML producers on the web screw up, and XML consumers have followed suit by disobeying the standard. And of course, since there is no standard governing this disobedience, they all do it differently (as is done with HTML today), which means interoperability is lost.
The WHATWG approach:
- Allows the use of XML or the more liberal HTML syntax,
- Specifies a method of consuming the tag soup that exists on the web,
- Is one language that is well specified and therefore easy to implement,
- Encourages best practices for producers and consumers,
- Follows Postel’s law,
- Does not require User Agents to perform the unnecessary labor of conformance checking, nor to pester their users with error messages they don’t and shouldn’t care about,
- Acknowledges that authors are not and never will be perfect, and defines how best to deal with the inevitable problems that will come up,
- Does not require the User Agent to implement yet another mode, and
- Does all this in an interoperable way.
Finally, it’s important to realize that all this debate is moot. If the spec adopts draconian error handling, it will be DOA, as none of the major browser manufacturers will implement it.