Saturday, September 19, 2009

PDF/A: what's new in Solid PDF Tools v6

Last week Solid Documents released an upgrade to Solid PDF Tools. In a nutshell, with Solid PDF Tools you can:

For a complete list visit the Solid PDF Tools features page.

With version 6, the product now exports PDF/A validation and conversion reports as per the specifications from the PDF/D Consortium. The validator and converter have also been greatly improved to:
  • provide much improved support for XMP validation
  • pass 100% the Bavaria Test Suite cases (v5 already passed the Isartor cases)
With the work we've done to improve our PDF/A technology, we think version 6 is now one of the best PDF/A tools on the market. This PDF/A functionality is also available to both .NET and C++ developers through our Solid Framework SDK product.

Thursday, September 10, 2009

RDF for PDF/A-1 Predefined XMP Schemas Updated

Today we shared the latest update of the pdfa.rdf (now 1.1) schema used by the Solid Documents PDF/A Validator and PDF to PDF/A Converter.


At over 2500 lines long this is probably the largest use of the PDF/A extension schema definitions on the planet. Our own pdfaValidate schema has been updated too. It now includes some new properties ('default', 'subst', 'predefined' and 'count') to help us use this RDF schema to build our data-driven PDF/A XMP validator.

Feel free to use this data to build your own PDF/A XMP validator but remember to give back: if you have corrections or improvements, please share them with the PDF/A community. Better still, join the PDF/D Consortium.

Friday, July 24, 2009

Info Dictionary vs XMP Metadata

The PDF/A-1 specification goes to great lengths to describe a mapping between the Entries in the legacy PDF Document Information Dictionary and their corresponding values in the more modern Document XMP Metadata. Section 3.4 in TechNote 0003 describes how the values in the Document Information Dictionary must be mirrored in XMP. Section 3.3 describes the requirements for Document Information Entries.


However, these requirements are not symmetric. What I mean by this is that it is perfectly legal for a PDF/A document to contain Document XMP Metadata and not to include a Document Information Dictionary.

Keeping the entries of the legacy and the more modern structures in sync is a headache for the software developer and this pursuit is littered with ambiguous scenarios. For example, many of the XMP Metadata fields can have multiple values. For example, multiple dc:title values for multiple languages or a seq of multiple authors for dc:creator rather than a single author. Each Entry in the legacy Document Information Dictionary is a simple string. There are no conventions on how to order or delimit these strings when mapping multiple fields from XMP to these single string values.

The solution is simple: don't use Document Information Dictionaries! Accept that Document XMP Metadata is the way forward and move on.

We'll be adding this to PDF/D as a constraint: the Info dictionary will be illegal in PDF/D - legacy software be damned.

Wednesday, July 1, 2009

Solid Framework v6 includes PDF/A Validation

It was a quiet month here at Pragmatic PDF central. We've been hard at work on the finishing touches of our Solid Framework v6 upgrade.


Major changes include:
  • new enterprise license model (in addition to republisher model)
  • PDF/A Validation
  • PDF to PDF/A Conversion
  • PDF to flowing HTML conversion
  • support for 64 bit Windows
We also have a much more elaborate set of sample code than before in the form of some free applications. Solid PDF Navigator is 100% free and illustrates what can be achieved with the Free license of SolidFramework. Solid PDF Mechanic uses the new free Developer license to allow exploration of all the premium Solid Framework features. All features are fully functional but include watermarks and "not for resale" text. To take advantage of either the Free or Developer license, simply download Solid Framework and start using it immediately.


PDF viewer including Page Pane and standard navigation controls similar to Acrobat Reader.

Explorer view allows navigation and examination of PDF internals.

Monday, May 11, 2009

PDF: right up there with COBOL

And this is a good thing.


PDF is an amazing document format: it is both backward and forward compatible:
  • PDF 1.1 files from 1993 can still be perfectly understood by today's PDF tools
  • PDF files created by today's tools can still be viewed by older PDF software
Could this be one of the reasons, along with technological soundness, why PDF is ubiquitous?

What other parts of our industry can claim such success without leaving data or customers behind every three years after "upgrade season"?

Not Google
A few years back Google offered a very simple Google SOAP Search API to allow 3rd parties to easily use the Google search engine to add native search to their websites. By native, I mean no ads from Google and 100% custom UI. We used this API as a quick fix to get search on the Solid Documents web site.  In 2006, Google "deprecated" this API and required web developers to migrate to their new and improved AJAX version of the same thing. in August 2009, the API will cease to function altogether.

To be fair, the service was free.  However, that's supposed to be the benefit of going with Google rather than Microsoft.  It is hardly a benefit if they pull the rug out from under you. The least they could have done was provide some sort of legacy wrapper for the new API.

If you cannot rely on an API to exist for the life of your business, then it would be foolish to build your infrastructure on it. Luckily search was a cheap way for us to learn to steer well clear of any "enterprise" offerings from Google in future. No, we will not be using Google Apps (the "enterprise" version of GMail plus Google Docs). And we certainly will not be building anything using the Google App Engine. I don't care how cool it is: I'm willing to bet that your app will no longer be running in 10 years from now. This Blog uses a free service acquired by Google. Hmm....

Not Microsoft
What set me off on this tirade was our hosted Exchange upgrade this week. We drank the Kool-Aid and outsourced 'generic' parts of our IT including our e-mail. This week they upgraded us from Exchange 2003 to Exchange 2007. 

On the positive side, they didn't lose my e-mail. However, the transition has been anything but smooth. It included instructions like clearing your Blackberry to 'out of box' state. In other words, assuming that the only thing you do with your Blackberry is use it as a client for their e-mail server. Most people I know have at least one other app that they regularly use on their Blackberry ("telephone" anyone?). So, plenty of time was wasted backing up and restoring address books and re-installing 3rd party applications. 

Pretty much the only thing that worked after the transition was e-mail. One of the primary reasons we originally switched from our own simple open source e-mail server to Exchange was to take advantage of collaborative features of Outlook like shared calendars and address books. None of that worked after the transition.

If it ain't broke ..
.. don't fix it! One of the key features expected from any "Enterprise Solution" should be longevity. Just like railways and roads, one should expect a bit of maintenence over the lifetime of the tool but one does not expect to have to toss the whole thing out and replace it every 4 years. Some of the open source projects deal with this issue a little better but that's not all roses either: anyone else remember the upgrade to PHP 5 or is it just me?

I understand that sometimes you need to throw out the legacy to make progress. Shutting down analog TV in the US is a great example of this. However, when it comes to expectations for enterprise business solutions, 4 years is a very low bar. For Exchange, part of the blame goes to Apptix and part to Microsoft:
  • When I look for Exchange 2003 on Microsoft's site I'm redirected to the Exchange 2010 pages. You have to dig deep on technet to find 2003 info. Even then, it is not clear how long Microsoft intends to support it.
  • Apptix should have offered the 2007 migration as an option rather than a compulsory disruption to all of their clients and their businesses. Part of their plan should have been to keep running Exchange 2003 for Luddites like me. Remind me again what the benefit of the 2007 upgrade was?
In the event that breaking changes to an API, file format or service are unavoidable, a responsible enterprise service provider will provide a smooth transition path to their customers.

Back to Solid PDF
Aside from one small change in the way table reconstruction worked in a very early version of Solid Converter PDF, the publically exposed APIs of our SDK have remained constant for 7 years now. That first minor change we made taught us our lesson: even as we've migrated from a COM SDK to our more recent .NET Solid Framework, we've taken great care to avoid breaking customer apps that rely on our older APIs.

When we released Solid Script, our command line syntax for our desktop applications had to change but we offered a legacy wrapper that translates old command lines into the newer scripts. Even this is not a big issue though since the software we created 7 years ago still works just as well as it did the day it was purchased. No forced upgrades due to changing file formats or 'deprecated' APIs.

When PDF/A was announced in 2005 we immediately recognized the value this added to an already awesome file format and decided to make archiving functionality one of the pillars of our business. The PDF/A standard underlines the already proven long term vision we have for both customer documents and PDF products:
  • Think 40 years, not 4 years
  • Think incremental non-breaking improvements, not disruptive change
Wouldn't it be grand if the bigger players had a similar definition of long term? With all the focus today on sustainability on conservation, why do they continue to waste our time, money and energy?

Thursday, April 30, 2009

Structured Content: PDF to HTML

A while back I included the following as one of the areas of interest of the PDF/D Consortium:

Structured Documents and Single Sourcing: improving round-trips to document software
What did I mean by Structured Documents? For years Solid Documents has been converting PDF files to Word documents with a focus on retaining format and layout to allow customers to repurpose the content. While this is a great solution for a large amount of customers, it is not the only type of reconstruction that is interesting.

PDF is by nature a "document" format: the layout is in the form of pages. Content also needs to exist in alternate formats like a continuously flowing stream. Use cases for continuously flowing content include:
  • conversion to HTML to reflow for form factors other than "pages"
  • conversion to content management systems where structure is more important than layout and formatting
  • conversion for alternate readers for people with disabilities (text to speech, etc)
Reconstruction for these use cases focuses more on the structure of the document than on the layout and formatting. For example, we need to take unstructured PDF files and recognize columns, tables, lists, headers and footers, etc. This allows us to organize the content in a logical structure. Ultimately, we'll recognize topics and sections too so that we can produce logical hierarchies from plain old non-tagged PDF files.

One great example of where conventional PDF pages are not the most appropriate way to read a document are on small screens of handheld devices. For example, the typical Blackberry has a 3"x2" screen with a resolution something like 320x240 pixels.

In this diagram the little rectangles represent the viewing area on a Blackberry when viewing a document laid out on 8.5"x11" pages.


For 100% zoom we get about 100 pixels per inch. Think bad quality fax machine resolution.

For 50% we get a mere 50 pixels per inch which is worse than really bad fax quality. However, because of the layout, you need to move your little screen "window" both left-to-right and top-to-bottom to scroll the page. With or without columns, the amount of scrolling to read a single page is quite tedious.

There is already a much better format for reading documents at lower resolution. This format is HTML. Back in the 90's when the internet was becoming popular for web browsing, screen resolutions for desktop machines were in the same ball park as handheld device resolutions today. Even with a 640x480 pixel handheld screen resolution, the physical size is still a limitation, typically still 3"x2".

Assuming one can reconstruct PDF files as continuously flowing documents, then the next step would be to convert them to HTML. If the target device is a handheld, then the complexity of the HTML should be kept to a minimum. This means simplifying the fonts, using CSS for styles and using HTML elements that look great even in the simplest browsers. Based on experimentation we've seen that XHTML 1.0 is well supported by the HTML viewers on most handheld devices.

To see how well our PDF to HTML reconstruction works, you can experiment with it at www.pdf2mobile.com without needing a mobile device.

Next, we want to make it really easy to use from any handheld device. Assuming you receive an e-mail on your Blackberry with a PDF document attached to it, simply forward it to convert@pdf2mobile.com.
The service will convert it to HTML and e-mail it back. Alternatively, if you have a handheld device with an e-mail client that renders HTML then you can forward your e-mail to detach@pdf2mobile.com - it will be returned as an HTML e-mail rather than an HTML attachment.

We're interested in your feedback (standards@soliddocuments.com) on our conversion and our HTML format. This PDF to HTML conversion functionality will be available for other uses in the next release of Solid Converter PDF.

XML Comments in XMP

Nowhere in the XMP or RDF specifications is any mention of XML comments.


On validating our vast set of PDF files gathered from the wild, thanks to sites like www.freepdftoword.org, www.pdf2mobile.com and www.validatepdfa.com we have run into a multitude of cases where XMP produced by reputable (read "Adobe") products includes XML comments.

After consulting with collegues at Adobe, Solid Documents and PDFlib we reached consesus on this topic. 

Two conclusions:
  1. Since XML comments are legal XML and not explicitly prohibited, we conclude that they are allowed.
  2. XML comments may be dropped when converting PDF files based on this clause from the XML specification:
"an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments"