Re: What route for the XHTML importer?

From: Ryan Pavlik <abiryan_at_ryand.net>
Date: Tue May 18 2004 - 01:30:37 CEST

Hubert Figuiere wrote:

>Hi,
>
>On Fri, 2004-05-14 at 14:31 +1000, Martin Sevior wrote:
>
>
>
>> As it currently stands the XHTML importer is very fragile and
>>very strict. If a HTML file doesn;t exactly fit XHTML spec we barf on
>>it.
>>
>>Now as you all know there are a lot of broken HTML files that render
>>just fine in IE, Mozilla and many other browsers.
>>
>>So my question is:
>>
>>Should we attempt to import broken HTML files or just barf on them and
>>say "Illegal document"?
>>
>>I would MUCH rather attempt to import them as well as possible.
>>
>>
>
>Here is my 0.02 CAD opinion:
>
>We should not limit to XHTML ? Why ? Simply because Joe Average wants to
>import "HTML documents" made by crappy software, and there is simply too
>much out there. What about HTML ? We should be as permissive as we can.
>
>Thing we can assume to fail:
>-frames
>-script
>-HTML markup generated by scripts
>
>Thing we must eat:
>-mixed-case tags
>-not closed tags
>-inconsistent tags
>-tag in the wrong context
>-some extenstions
>
>Parsing HTML is a lot of work. I'd pretty much prefer us "stealing" some
>code from another Free software project, something that would come from
>Lynx, links, w3m, khtml (or Apple's incarnation), gtkhtml, etc. Even
>Mozilla, but I'm not sure it does not bring too much.
>
>So in short: don't barf too soon on a document.
>
>For the test bed, just use wget <sigh>
>
>Hub
>
>
>
>
Can we somehow use tidylib (from HTML Tidy) to pretty up documents into
XHTML? I've had a fair amount of success using it stand-alone to clean
up some pretty lousy data, and it's good at cleaning up exported
HTML-like stuff from meaningless little applications like Word. Even if
as some sort of pre-processing plugin, it might prove a good solution.
http://tidy.sourceforge.net/

Ryan
Received on Tue May 18 01:24:44 2004

This archive was generated by hypermail 2.1.8 : Tue May 18 2004 - 01:24:44 CEST