one of the joys of Atom is that processing text in Atom is not quite as easy as it seems. the reason is that Atom supports three types of content for text constructs,
upconvert Atom feeds to pure XHTML (i.e., turn escaped HTML into proper XHTML), you have to deal with the possibility that HTML content not only may be non-XML, it may actually be broken HTML that has been produced manually or by broken tools.
browsers have to deal with broken HTML all the time, studies of real-world HTML pages on the web show that the overwhelming majority of HTML pages on the web is broken. my guess is that this ratio will be much better for Atom, but there probably still are quite a number of feeds which contain HTML snippets generated by hand or by broken tools.
what this comes down to is that a good Atom implementation should process HTML content in a fault-tolerant way. browsers implement their own proprietary parsing, which not only leads to various interpretations of the same (broken) HTML, but also makes it hard to decide on the
right way to fix broken HTML.
luckily, HTML 5 introduces its own parsing model. it starts with parsing a sequence of bytes, and has a second phase which works on unicode characters. if you are operating in an XML environment, however, you already have proper unicode to work with, so Atom can skip the byte parsing process.
my initial idea was to try to implement that algorithm in XSLT, because it would be the ideal candidate for turning HTML in Atom into XHTML, so that such a cleanup process could be based entirely on XML tools. however, so far the specification of the parsing process looks pretty much impenetrable to me, it looks mostly like spaghetti code that has been translated into english (actually, the writing style of that part of the HTML spec reminds me a lot of the XML Schema spec, which also does not really excel at clarity).
i have some general doubts about using XSLT for a job like that, because parsing like that probably does not really work too well with XSLT's language design. but at least i would be interested to see whether it's possible. or more generally, has anybody implemented the HTML 5 parsing algorithm? in any language? since it is presented in a rather unstructured way, it is hard to validate by eye. is there any validation that is works in principle? maybe a state machine or something along these lines? and is there some assurance that the text truthfully and completely describes the algorithm?
don't get me wrong, i really think it would be great to have some well-defined way of how HTML should be parsed, that would be the right step to make browser behavior more predictable. but the current spec may need some improvement. or maybe i just have to spend more time reading it? or maybe somebody else is interested in implementing it in XSLT?