Thursday, April 30, 2009

Structured Content: PDF to HTML

A while back I included the following as one of the areas of interest of the PDF/D Consortium:

Structured Documents and Single Sourcing: improving round-trips to document software
What did I mean by Structured Documents? For years Solid Documents has been converting PDF files to Word documents with a focus on retaining format and layout to allow customers to repurpose the content. While this is a great solution for a large amount of customers, it is not the only type of reconstruction that is interesting.

PDF is by nature a "document" format: the layout is in the form of pages. Content also needs to exist in alternate formats like a continuously flowing stream. Use cases for continuously flowing content include:
  • conversion to HTML to reflow for form factors other than "pages"
  • conversion to content management systems where structure is more important than layout and formatting
  • conversion for alternate readers for people with disabilities (text to speech, etc)
Reconstruction for these use cases focuses more on the structure of the document than on the layout and formatting. For example, we need to take unstructured PDF files and recognize columns, tables, lists, headers and footers, etc. This allows us to organize the content in a logical structure. Ultimately, we'll recognize topics and sections too so that we can produce logical hierarchies from plain old non-tagged PDF files.

One great example of where conventional PDF pages are not the most appropriate way to read a document are on small screens of handheld devices. For example, the typical Blackberry has a 3"x2" screen with a resolution something like 320x240 pixels.

In this diagram the little rectangles represent the viewing area on a Blackberry when viewing a document laid out on 8.5"x11" pages.


For 100% zoom we get about 100 pixels per inch. Think bad quality fax machine resolution.

For 50% we get a mere 50 pixels per inch which is worse than really bad fax quality. However, because of the layout, you need to move your little screen "window" both left-to-right and top-to-bottom to scroll the page. With or without columns, the amount of scrolling to read a single page is quite tedious.

There is already a much better format for reading documents at lower resolution. This format is HTML. Back in the 90's when the internet was becoming popular for web browsing, screen resolutions for desktop machines were in the same ball park as handheld device resolutions today. Even with a 640x480 pixel handheld screen resolution, the physical size is still a limitation, typically still 3"x2".

Assuming one can reconstruct PDF files as continuously flowing documents, then the next step would be to convert them to HTML. If the target device is a handheld, then the complexity of the HTML should be kept to a minimum. This means simplifying the fonts, using CSS for styles and using HTML elements that look great even in the simplest browsers. Based on experimentation we've seen that XHTML 1.0 is well supported by the HTML viewers on most handheld devices.

To see how well our PDF to HTML reconstruction works, you can experiment with it at www.pdf2mobile.com without needing a mobile device.

Next, we want to make it really easy to use from any handheld device. Assuming you receive an e-mail on your Blackberry with a PDF document attached to it, simply forward it to convert@pdf2mobile.com.
The service will convert it to HTML and e-mail it back. Alternatively, if you have a handheld device with an e-mail client that renders HTML then you can forward your e-mail to detach@pdf2mobile.com - it will be returned as an HTML e-mail rather than an HTML attachment.

We're interested in your feedback (standards@soliddocuments.com) on our conversion and our HTML format. This PDF to HTML conversion functionality will be available for other uses in the next release of Solid Converter PDF.

XML Comments in XMP

Nowhere in the XMP or RDF specifications is any mention of XML comments.


On validating our vast set of PDF files gathered from the wild, thanks to sites like www.freepdftoword.org, www.pdf2mobile.com and www.validatepdfa.com we have run into a multitude of cases where XMP produced by reputable (read "Adobe") products includes XML comments.

After consulting with collegues at Adobe, Solid Documents and PDFlib we reached consesus on this topic. 

Two conclusions:
  1. Since XML comments are legal XML and not explicitly prohibited, we conclude that they are allowed.
  2. XML comments may be dropped when converting PDF files based on this clause from the XML specification:
"an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments"

Tuesday, April 28, 2009

Non-1.4 Features in PDF/A

Can PDF/A files include features from PDF 1.5, 1.6 or 1.7 (ISO 32000-1)?

This is a recurring FAQ that comes from a superficial understanding of the PDF/A ISO-19005-1 specification. While PDF/A-1 was defined based on the PDF Reference for 1.4 of the format, it quite clearly allows non-1.4 features. One of the great features of PDF is that "unknown things" are ignored by conforming readers. This feature is part of all PDF specifications that have ever existed, including 1.4 on which PDF/A-1 is based.

To quote Leonard Rosenthol, PDF Standards Architect for Adobe Systems:

"There is no question about this by the committee. In fact, we just rediscussed this last week at our meeting in Germany due to a comment from one of the various national bodies about potentially changing this position (aka allowing 'private data' or 'unknown keys') and the agreement was that we are still in agreement that 'unknown things' are allowed PROVIDED THEY DO NOT CHANGE the visual appearance of the page."
The ISO 19005-1 Application Notes from AIIM provide answers to many of these issues. To quote the Application Notes:

“A conforming PDF/A file has three kinds of content:
  • content that affects the final visual reproduction of the composite entity;
  • other visual content such as annotations, form fields, etc.
  • non-printing content such as bookmarks, metadata, etc.
The PDF/A-1 standard states that a conforming file may include valid PDF features beyond those described in the standard provided they do not affect final visual reproduction of the composite entity and are included as part of PDF Reference 1.4.”


What does this mean?


One of the goals of PDF/A is to make the appearance of PDF files predictable and reproducible over time. Including "private" features does not affect this goal or break PDF/A-1 assuming the rules are followed.

Examples

If a feature from a more recent version of PDF can be ignored by a PDF/A-1 Compliant Reader without affecting the visual appearance of the PDF then this goal is met. An example of such a feature would be /PrintScaling (PDF 1.6) in the /ViewerPreferences dictionary. Any “future” feature that does not affect appearance is exactly the same as a “private” feature between reader and writer: conforming PDF/A Readers should ignore it.

Features from later versions of PDF which do indeed affect visual appearance are explicitly prohibited in the PDF/A 19005-1 specification. For example, 6.5.2 states that annotation types not defined in PDF Reference 1.4 are prohibited (along with a few that are defined in 1.4). This means newer annotation types like 3D are clearly prohibited. Another good example of a "feature from the future" that clearly alters appearance is /UserUnits in the /Page dictionary which is obviously prohibited because it certainly affects appearance.

Other features are implicitly forbidden. For example, an image compressed using JPXDecode filter (JPEG2000 - PDF 1.5) would be ignored by a conforming PDF/A-1 reader but the absence of this image would affect the visual appearance of the PDF. Hence, JPEG2000 should not be used in PDF/A-1 files. Another example is setting BitsPerPixel to 16 for images (PDF 1.5): since this value was introduced after 1.4, it is obviously forbidden because ignoring it would lead to undefined behavior in readers capable only of rendering 1.4. Many of these cases are covered explicitly in the PDF/A Competence Center's Isartor Test Suite and also the PDF/D Consortium extensions to the PDF/A test suite: PDF/D Compliance Tests.

Comments and questions most welcome.

Tuesday, April 21, 2009

Ambiguities in PDF/A Extension Schemas

The PDF/A XMP Technotes are not clear on the subject of optional/required for properties of the extension schemas.


Discussion with engineers from other PDF/A companies has resulted in the "if it doesn't say 'Optional' then it must be 'Required' assumption" which most of us are trying to abide by. The only properties in any of the extension schemas marked as Optional are:
  • pdfaSchema:schema - Optional description of schema
  • pdfaType:field - Optional description of the structured fields
That's it!  All the rest must thus be 'Required'.  Not so fast!

If this were true then both pdfaSchema:property and pdfaSchema:valueType are always required which means that all extension schemas must include both properties and custom value types. When we were creating RDF definitions for all the pre-defined PDF/A schemas, we noticed this issue because it made it impossible to correctly define the "Dimensions" valueType schema: this schema has one custom value type and no properties.

Exception #1: at least one of pdfaSchema:property and pdfaSchema:valueType should be present.

We've noticed with our vast test set accumulated through our free online services like www.freepdftoword.org and www.validatepdfa.com that several Adobe products create schemas which omit one or more of pdfaProperty:description, pdfaType:description and pdfaField:description. All three of these properties are purely descriptive in the same sense as the two properties mentioned about as 'Optional'.  We believe that these fields should also be optional but, for now, our validator still flags their absence as an error (not a fatalError though since we can add these fields to the schemas, containing filler content, to "fix" the issue).

Proposed Exception #2: pdfaProperty:description, pdfaType:description and pdfaField:description should be 'Optional' properties. Existing PDF/A creators are omitting them and it makes sense.

A value type containing fields is required to have a pdfaType:namespaceURI property. We've noticed customer samples created by reputable products which omit this field. In the case of the omission, the assumed namespace for the value type is simply the same as the namespace of the schema with a slash and the name of the type appended to it.  Our validator marks this issue as an Error to (and not a fatalError) since it can easily be repaired by explicitly inserting the assumed namespace.

Example:
Schema namespace:
  http://www.acme.com/ns/email/1/
Value type name:
  mailaddress
Assumed namespace of value type if pdfaType:namespaceURI  is absent:
   http://www.acme.com/ns/email/1/mailaddress/

Proposed Exception #3: if pdfaType:namespaceURI is absent, construct a default namespace for the property as described above.