Thursday, April 30, 2009

Structured Content: PDF to HTML

A while back I included the following as one of the areas of interest of the PDF/D Consortium:

Structured Documents and Single Sourcing: improving round-trips to document software
What did I mean by Structured Documents? For years Solid Documents has been converting PDF files to Word documents with a focus on retaining format and layout to allow customers to repurpose the content. While this is a great solution for a large amount of customers, it is not the only type of reconstruction that is interesting.

PDF is by nature a "document" format: the layout is in the form of pages. Content also needs to exist in alternate formats like a continuously flowing stream. Use cases for continuously flowing content include:
  • conversion to HTML to reflow for form factors other than "pages"
  • conversion to content management systems where structure is more important than layout and formatting
  • conversion for alternate readers for people with disabilities (text to speech, etc)
Reconstruction for these use cases focuses more on the structure of the document than on the layout and formatting. For example, we need to take unstructured PDF files and recognize columns, tables, lists, headers and footers, etc. This allows us to organize the content in a logical structure. Ultimately, we'll recognize topics and sections too so that we can produce logical hierarchies from plain old non-tagged PDF files.

One great example of where conventional PDF pages are not the most appropriate way to read a document are on small screens of handheld devices. For example, the typical Blackberry has a 3"x2" screen with a resolution something like 320x240 pixels.

In this diagram the little rectangles represent the viewing area on a Blackberry when viewing a document laid out on 8.5"x11" pages.

For 100% zoom we get about 100 pixels per inch. Think bad quality fax machine resolution.

For 50% we get a mere 50 pixels per inch which is worse than really bad fax quality. However, because of the layout, you need to move your little screen "window" both left-to-right and top-to-bottom to scroll the page. With or without columns, the amount of scrolling to read a single page is quite tedious.

There is already a much better format for reading documents at lower resolution. This format is HTML. Back in the 90's when the internet was becoming popular for web browsing, screen resolutions for desktop machines were in the same ball park as handheld device resolutions today. Even with a 640x480 pixel handheld screen resolution, the physical size is still a limitation, typically still 3"x2".

Assuming one can reconstruct PDF files as continuously flowing documents, then the next step would be to convert them to HTML. If the target device is a handheld, then the complexity of the HTML should be kept to a minimum. This means simplifying the fonts, using CSS for styles and using HTML elements that look great even in the simplest browsers. Based on experimentation we've seen that XHTML 1.0 is well supported by the HTML viewers on most handheld devices.

To see how well our PDF to HTML reconstruction works, you can experiment with it at without needing a mobile device.

Next, we want to make it really easy to use from any handheld device. Assuming you receive an e-mail on your Blackberry with a PDF document attached to it, simply forward it to
The service will convert it to HTML and e-mail it back. Alternatively, if you have a handheld device with an e-mail client that renders HTML then you can forward your e-mail to - it will be returned as an HTML e-mail rather than an HTML attachment.

We're interested in your feedback ( on our conversion and our HTML format. This PDF to HTML conversion functionality will be available for other uses in the next release of Solid Converter PDF.

No comments:

Post a Comment