Pragmatic PDF: 2009

Saturday, September 19, 2009

PDF/A: what's new in Solid PDF Tools v6

Last week Solid Documents released an upgrade to Solid PDF Tools. In a nutshell, with Solid PDF Tools you can:

convert from PDF to Word
export tables from PDF to Excel
scan directly to Word
edit PDF files (page manipulation, text touchup, etc.)
validate PDF/A
convert PDF to PDF/A
create structure PDF files from Office applications

For a complete list visit the Solid PDF Tools features page.

With version 6, the product now exports PDF/A validation and conversion reports as per the specifications from the PDF/D Consortium. The validator and converter have also been greatly improved to:

provide much improved support for XMP validation
pass 100% the Bavaria Test Suite cases (v5 already passed the Isartor cases)

With the work we've done to improve our PDF/A technology, we think version 6 is now one of the best PDF/A tools on the market. This PDF/A functionality is also available to both .NET and C++ developers through our Solid Framework SDK product.

Thursday, September 10, 2009

RDF for PDF/A-1 Predefined XMP Schemas Updated

Today we shared the latest update of the pdfa.rdf (now 1.1) schema used by the Solid Documents PDF/A Validator and PDF to PDF/A Converter.

At over 2500 lines long this is probably the largest use of the PDF/A extension schema definitions on the planet. Our own pdfaValidate schema has been updated too. It now includes some new properties ('default', 'subst', 'predefined' and 'count') to help us use this RDF schema to build our data-driven PDF/A XMP validator.

Feel free to use this data to build your own PDF/A XMP validator but remember to give back: if you have corrections or improvements, please share them with the PDF/A community. Better still, join the PDF/D Consortium.

Friday, July 24, 2009

Info Dictionary vs XMP Metadata

The PDF/A-1 specification goes to great lengths to describe a mapping between the Entries in the legacy PDF Document Information Dictionary and their corresponding values in the more modern Document XMP Metadata. Section 3.4 in TechNote 0003 describes how the values in the Document Information Dictionary must be mirrored in XMP. Section 3.3 describes the requirements for Document Information Entries.

However, these requirements are not symmetric. What I mean by this is that it is perfectly legal for a PDF/A document to contain Document XMP Metadata and not to include a Document Information Dictionary.

Keeping the entries of the legacy and the more modern structures in sync is a headache for the software developer and this pursuit is littered with ambiguous scenarios. For example, many of the XMP Metadata fields can have multiple values. For example, multiple dc:title values for multiple languages or a seq of multiple authors for dc:creator rather than a single author. Each Entry in the legacy Document Information Dictionary is a simple string. There are no conventions on how to order or delimit these strings when mapping multiple fields from XMP to these single string values.

The solution is simple: don't use Document Information Dictionaries! Accept that Document XMP Metadata is the way forward and move on.

We'll be adding this to PDF/D as a constraint: the Info dictionary will be illegal in PDF/D - legacy software be damned.

Wednesday, July 1, 2009

Solid Framework v6 includes PDF/A Validation

It was a quiet month here at Pragmatic PDF central. We've been hard at work on the finishing touches of our Solid Framework v6 upgrade.

Major changes include:

new enterprise license model (in addition to republisher model)
PDF/A Validation
PDF to PDF/A Conversion
PDF to flowing HTML conversion
support for 64 bit Windows

We also have a much more elaborate set of sample code than before in the form of some free applications. Solid PDF Navigator is 100% free and illustrates what can be achieved with the Free license of SolidFramework. Solid PDF Mechanic uses the new free Developer license to allow exploration of all the premium Solid Framework features. All features are fully functional but include watermarks and "not for resale" text. To take advantage of either the Free or Developer license, simply download Solid Framework and start using it immediately.

PDF viewer including Page Pane and standard navigation controls similar to Acrobat Reader.

Explorer view allows navigation and examination of PDF internals.

Monday, May 11, 2009

PDF: right up there with COBOL

And this is a good thing.

PDF is an amazing document format: it is both backward and forward compatible:

PDF 1.1 files from 1993 can still be perfectly understood by today's PDF tools
PDF files created by today's tools can still be viewed by older PDF software

Could this be one of the reasons, along with technological soundness, why PDF is ubiquitous?

What other parts of our industry can claim such success without leaving data or customers behind every three years after "upgrade season"?

Not Google

A few years back Google offered a very simple Google SOAP Search API to allow 3rd parties to easily use the Google search engine to add native search to their websites. By native, I mean no ads from Google and 100% custom UI. We used this API as a quick fix to get search on the Solid Documents web site. In 2006, Google "deprecated" this API and required web developers to migrate to their new and improved AJAX version of the same thing. in August 2009, the API will cease to function altogether.

To be fair, the service was free. However, that's supposed to be the benefit of going with Google rather than Microsoft. It is hardly a benefit if they pull the rug out from under you. The least they could have done was provide some sort of legacy wrapper for the new API.

If you cannot rely on an API to exist for the life of your business, then it would be foolish to build your infrastructure on it. Luckily search was a cheap way for us to learn to steer well clear of any "enterprise" offerings from Google in future. No, we will not be using Google Apps (the "enterprise" version of GMail plus Google Docs). And we certainly will not be building anything using the Google App Engine. I don't care how cool it is: I'm willing to bet that your app will no longer be running in 10 years from now. This Blog uses a free service acquired by Google. Hmm....

Not Microsoft

What set me off on this tirade was our hosted Exchange upgrade this week. We drank the Kool-Aid and outsourced 'generic' parts of our IT including our e-mail. This week they upgraded us from Exchange 2003 to Exchange 2007.

On the positive side, they didn't lose my e-mail. However, the transition has been anything but smooth. It included instructions like clearing your Blackberry to 'out of box' state. In other words, assuming that the only thing you do with your Blackberry is use it as a client for their e-mail server. Most people I know have at least one other app that they regularly use on their Blackberry ("telephone" anyone?). So, plenty of time was wasted backing up and restoring address books and re-installing 3rd party applications.

Pretty much the only thing that worked after the transition was e-mail. One of the primary reasons we originally switched from our own simple open source e-mail server to Exchange was to take advantage of collaborative features of Outlook like shared calendars and address books. None of that worked after the transition.

If it ain't broke ..

.. don't fix it! One of the key features expected from any "Enterprise Solution" should be longevity. Just like railways and roads, one should expect a bit of maintenence over the lifetime of the tool but one does not expect to have to toss the whole thing out and replace it every 4 years. Some of the open source projects deal with this issue a little better but that's not all roses either: anyone else remember the upgrade to PHP 5 or is it just me?

I understand that sometimes you need to throw out the legacy to make progress. Shutting down analog TV in the US is a great example of this. However, when it comes to expectations for enterprise business solutions, 4 years is a very low bar. For Exchange, part of the blame goes to Apptix and part to Microsoft:

When I look for Exchange 2003 on Microsoft's site I'm redirected to the Exchange 2010 pages. You have to dig deep on technet to find 2003 info. Even then, it is not clear how long Microsoft intends to support it.
Apptix should have offered the 2007 migration as an option rather than a compulsory disruption to all of their clients and their businesses. Part of their plan should have been to keep running Exchange 2003 for Luddites like me. Remind me again what the benefit of the 2007 upgrade was?

In the event that breaking changes to an API, file format or service are unavoidable, a responsible enterprise service provider will provide a smooth transition path to their customers.

Back to Solid PDF

Aside from one small change in the way table reconstruction worked in a very early version of Solid Converter PDF, the publically exposed APIs of our SDK have remained constant for 7 years now. That first minor change we made taught us our lesson: even as we've migrated from a COM SDK to our more recent .NET Solid Framework, we've taken great care to avoid breaking customer apps that rely on our older APIs.

When we released Solid Script, our command line syntax for our desktop applications had to change but we offered a legacy wrapper that translates old command lines into the newer scripts. Even this is not a big issue though since the software we created 7 years ago still works just as well as it did the day it was purchased. No forced upgrades due to changing file formats or 'deprecated' APIs.

When PDF/A was announced in 2005 we immediately recognized the value this added to an already awesome file format and decided to make archiving functionality one of the pillars of our business. The PDF/A standard underlines the already proven long term vision we have for both customer documents and PDF products:

Think 40 years, not 4 years
Think incremental non-breaking improvements, not disruptive change

Wouldn't it be grand if the bigger players had a similar definition of long term? With all the focus today on sustainability on conservation, why do they continue to waste our time, money and energy?

Thursday, April 30, 2009

Structured Content: PDF to HTML

A while back I included the following as one of the areas of interest of the PDF/D Consortium:

Structured Documents and Single Sourcing: improving round-trips to document software

What did I mean by Structured Documents? For years Solid Documents has been converting PDF files to Word documents with a focus on retaining format and layout to allow customers to repurpose the content. While this is a great solution for a large amount of customers, it is not the only type of reconstruction that is interesting.

PDF is by nature a "document" format: the layout is in the form of pages. Content also needs to exist in alternate formats like a continuously flowing stream. Use cases for continuously flowing content include:

conversion to HTML to reflow for form factors other than "pages"
conversion to content management systems where structure is more important than layout and formatting
conversion for alternate readers for people with disabilities (text to speech, etc)

Reconstruction for these use cases focuses more on the structure of the document than on the layout and formatting. For example, we need to take unstructured PDF files and recognize columns, tables, lists, headers and footers, etc. This allows us to organize the content in a logical structure. Ultimately, we'll recognize topics and sections too so that we can produce logical hierarchies from plain old non-tagged PDF files.

One great example of where conventional PDF pages are not the most appropriate way to read a document are on small screens of handheld devices. For example, the typical Blackberry has a 3"x2" screen with a resolution something like 320x240 pixels.

In this diagram the little rectangles represent the viewing area on a Blackberry when viewing a document laid out on 8.5"x11" pages.

For 100% zoom we get about 100 pixels per inch. Think bad quality fax machine resolution.

For 50% we get a mere 50 pixels per inch which is worse than really bad fax quality. However, because of the layout, you need to move your little screen "window" both left-to-right and top-to-bottom to scroll the page. With or without columns, the amount of scrolling to read a single page is quite tedious.

There is already a much better format for reading documents at lower resolution. This format is HTML. Back in the 90's when the internet was becoming popular for web browsing, screen resolutions for desktop machines were in the same ball park as handheld device resolutions today. Even with a 640x480 pixel handheld screen resolution, the physical size is still a limitation, typically still 3"x2".

Assuming one can reconstruct PDF files as continuously flowing documents, then the next step would be to convert them to HTML. If the target device is a handheld, then the complexity of the HTML should be kept to a minimum. This means simplifying the fonts, using CSS for styles and using HTML elements that look great even in the simplest browsers. Based on experimentation we've seen that XHTML 1.0 is well supported by the HTML viewers on most handheld devices.

To see how well our PDF to HTML reconstruction works, you can experiment with it at www.pdf2mobile.com without needing a mobile device.

Next, we want to make it really easy to use from any handheld device. Assuming you receive an e-mail on your Blackberry with a PDF document attached to it, simply forward it to convert@pdf2mobile.com.
The service will convert it to HTML and e-mail it back. Alternatively, if you have a handheld device with an e-mail client that renders HTML then you can forward your e-mail to detach@pdf2mobile.com - it will be returned as an HTML e-mail rather than an HTML attachment.

We're interested in your feedback (standards@soliddocuments.com) on our conversion and our HTML format. This PDF to HTML conversion functionality will be available for other uses in the next release of Solid Converter PDF.

XML Comments in XMP

Nowhere in the XMP or RDF specifications is any mention of XML comments.

On validating our vast set of PDF files gathered from the wild, thanks to sites like www.freepdftoword.org, www.pdf2mobile.com and www.validatepdfa.com we have run into a multitude of cases where XMP produced by reputable (read "Adobe") products includes XML comments.

After consulting with collegues at Adobe, Solid Documents and PDFlib we reached consesus on this topic.

Two conclusions:

Since XML comments are legal XML and not explicitly prohibited, we conclude that they are allowed.
XML comments may be dropped when converting PDF files based on this clause from the XML specification:

"an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments"

Tuesday, April 28, 2009

Non-1.4 Features in PDF/A

Can PDF/A files include features from PDF 1.5, 1.6 or 1.7 (ISO 32000-1)?

This is a recurring FAQ that comes from a superficial understanding of the PDF/A ISO-19005-1 specification. While PDF/A-1 was defined based on the PDF Reference for 1.4 of the format, it quite clearly allows non-1.4 features. One of the great features of PDF is that "unknown things" are ignored by conforming readers. This feature is part of all PDF specifications that have ever existed, including 1.4 on which PDF/A-1 is based.

To quote Leonard Rosenthol, PDF Standards Architect for Adobe Systems:

"There is no question about this by the committee. In fact, we just rediscussed this last week at our meeting in Germany due to a comment from one of the various national bodies about potentially changing this position (aka allowing 'private data' or 'unknown keys') and the agreement was that we are still in agreement that 'unknown things' are allowed PROVIDED THEY DO NOT CHANGE the visual appearance of the page."

The ISO 19005-1 Application Notes from AIIM provide answers to many of these issues. To quote the Application Notes:

“A conforming PDF/A file has three kinds of content:
content that affects the final visual reproduction of the composite entity;
other visual content such as annotations, form fields, etc.
non-printing content such as bookmarks, metadata, etc.
The PDF/A-1 standard states that a conforming file may include valid PDF features beyond those described in the standard provided they do not affect final visual reproduction of the composite entity and are included as part of PDF Reference 1.4.”

What does this mean?

One of the goals of PDF/A is to make the appearance of PDF files predictable and reproducible over time. Including "private" features does not affect this goal or break PDF/A-1 assuming the rules are followed.

Examples

If a feature from a more recent version of PDF can be ignored by a PDF/A-1 Compliant Reader without affecting the visual appearance of the PDF then this goal is met. An example of such a feature would be /PrintScaling (PDF 1.6) in the /ViewerPreferences dictionary. Any “future” feature that does not affect appearance is exactly the same as a “private” feature between reader and writer: conforming PDF/A Readers should ignore it.

Features from later versions of PDF which do indeed affect visual appearance are explicitly prohibited in the PDF/A 19005-1 specification. For example, 6.5.2 states that annotation types not defined in PDF Reference 1.4 are prohibited (along with a few that are defined in 1.4). This means newer annotation types like 3D are clearly prohibited. Another good example of a "feature from the future" that clearly alters appearance is /UserUnits in the /Page dictionary which is obviously prohibited because it certainly affects appearance.

Other features are implicitly forbidden. For example, an image compressed using JPXDecode filter (JPEG2000 - PDF 1.5) would be ignored by a conforming PDF/A-1 reader but the absence of this image would affect the visual appearance of the PDF. Hence, JPEG2000 should not be used in PDF/A-1 files. Another example is setting BitsPerPixel to 16 for images (PDF 1.5): since this value was introduced after 1.4, it is obviously forbidden because ignoring it would lead to undefined behavior in readers capable only of rendering 1.4. Many of these cases are covered explicitly in the PDF/A Competence Center's Isartor Test Suite and also the PDF/D Consortium extensions to the PDF/A test suite: PDF/D Compliance Tests.

Comments and questions most welcome.

Tuesday, April 21, 2009

Ambiguities in PDF/A Extension Schemas

The PDF/A XMP Technotes are not clear on the subject of optional/required for properties of the extension schemas.

Discussion with engineers from other PDF/A companies has resulted in the "if it doesn't say 'Optional' then it must be 'Required' assumption" which most of us are trying to abide by. The only properties in any of the extension schemas marked as Optional are:

pdfaSchema:schema - Optional description of schema
pdfaType:field - Optional description of the structured fields

That's it! All the rest must thus be 'Required'. Not so fast!

If this were true then both pdfaSchema:property and pdfaSchema:valueType are always required which means that all extension schemas must include both properties and custom value types. When we were creating RDF definitions for all the pre-defined PDF/A schemas, we noticed this issue because it made it impossible to correctly define the "Dimensions" valueType schema: this schema has one custom value type and no properties.

Exception #1: at least one of pdfaSchema:property and pdfaSchema:valueType should be present.

We've noticed with our vast test set accumulated through our free online services like www.freepdftoword.org and www.validatepdfa.com that several Adobe products create schemas which omit one or more of pdfaProperty:description, pdfaType:description and pdfaField:description. All three of these properties are purely descriptive in the same sense as the two properties mentioned about as 'Optional'. We believe that these fields should also be optional but, for now, our validator still flags their absence as an error (not a fatalError though since we can add these fields to the schemas, containing filler content, to "fix" the issue).

Proposed Exception #2: pdfaProperty:description, pdfaType:description and pdfaField:description should be 'Optional' properties. Existing PDF/A creators are omitting them and it makes sense.

A value type containing fields is required to have a pdfaType:namespaceURI property. We've noticed customer samples created by reputable products which omit this field. In the case of the omission, the assumed namespace for the value type is simply the same as the namespace of the schema with a slash and the name of the type appended to it. Our validator marks this issue as an Error to (and not a fatalError) since it can easily be repaired by explicitly inserting the assumed namespace.

Example:

Schema namespace:

http://www.acme.com/ns/email/1/

Value type name:

mailaddress

Assumed namespace of value type if pdfaType:namespaceURI is absent:

http://www.acme.com/ns/email/1/mailaddress/

Proposed Exception #3: if pdfaType:namespaceURI is absent, construct a default namespace for the property as described above.

Wednesday, March 4, 2009

Flatness: Ambiguity in ISO 32000-1

From the ISO 32000-1 specification:

Table 53, 8.4.1 describing initialization of graphic state at the start of each page:

The precision with which curves shall be rendered on the output device (see 10.6.2, "Flatness Tolerance"). The value of this parameter (positive number) gives the maximum error tolerance, measured in output device pixels; smaller numbers give smoother curves at the expense of more computation and memory use. Initial value: 1.0.

Table 57, 8.4.4 describing the "i" operator:

Set the flatness tolerance in the graphics state (see 10.6.2, "Flatness Tolerance"). flatness is a number in the range 0 to 100; a value of 0 shall specify the output device's default flatness tolerance.

Table 58, 8.4.5 describing the graphic state parameter dictionary entry

FL:

Number, (Optional; PDF 1.3) The flatness tolerance (see 10.6.2, "Flatness Tolerance").

10.6.2 Flatness Tolerance

The flatness tolerance controls the maximum permitted distance in device pixels between the mathematically correct path and an approximation constructed from straight line segments, as shown in Figure 54. Flatness may be specified as the operand of the i operator (see Table 57) or as the value of the FL entry in a graphics state parameter dictionary (see Table 58). It shall be a positive number.

Observation:

It appears to me that the above clauses are referring to exactly the same thing. If that is correct, then the range and default value for flatness tolerance is ambiguous:

Either the default is 1.0 or it is 0: pick one.

Either the range is 0 to 100, or is a positive number (any value > 0): pick one.

Comments?

Thursday, February 26, 2009

Anomalous Situations - Best Practices

PDF ISO-32000 has a note in clause 12.6.2 that is just dying to get the PDF/D Best Practices treatment:

"Conforming readers should attempt to provide reasonable behavior in anomalous situations. For example, self-referential actions should not be executed more than once, and actions that close the document or otherwise render the next action impossible should terminate the execution sequence."

How about insisting that the Next entry in Action dictionaries shall only contain acyclic graphs of actions? When would endless loops of action sequences ever be a good thing?

Preferred prefix for Colorant Basic Value Type

xapG vs xmpG

From the XMP Specification (2005):

From the XMP Specification Part 2 (2008):

I Googled Adobe's site for clarification on this change, hoping to find a note on the subject: nada.

For the purposes of our XMP validator we're obviously going to assume that the most recent version is correct. The reason I made this blog post is so that it will pop up in Google when the next person stumbles into this question, wondering if it is a typo or a deliberate change.

Wednesday, February 25, 2009

Open Source PDF/A RDF Schemas

Inspired by the Isartor test set for validating PDF/A compliance we are working on a similar style set of negative tests for basic XMP compliance (PDF/A XMP TechNotes).

While it is clear that this work needs to be done, nobody appears to be tackling it. PDF/A 19005-1 is now heading into its 3rd year so we're attempting to fill this gap.

While each vendor will obviously implement their own XMP validator for PDF/A validation and conversion, there are some areas where we can easily collaborate. We believe that it is in all our interests to openly share an RDF and PDF/A compliant XMP implementation of the pre-defined schemas required to validate PDF/A files.

Today we released our first version of the PDF/A pre-defined schemas in RDF form. You can find these resources at the PDF/D website.

Monday, February 23, 2009

Isartor Truth

As promised, we've posted more tools for standardized compliance testing.

Today we added:

- Isartor Truth: an XML file with the expected results of the Isartor PDF/A tests

- CompareReports.exe: a tool to compare the above truth file to output from a validator

For more on our efforts to improve mechanical comparison of compliance testing reports, please visit the PDF/D site.

Friday, February 20, 2009

XMP: bag vs Bag, seq vs Seq

The RDF specification clearly uses "Bag", "Alt" and "Seq" for the names of these container elements. This is a requirement for the names of these array container elements:

rdf:Bag, rdf:Alt and rdf:Seq

Starting with the XMP Specification Part 1, the use of "bag " (as in "bag Text") was introduced as a notation to describe array types in schemas. This document is consistent in using the lowercase variant for type descriptions only.

I believe that the titlecase variant of this notation, first seen in XMP Specification Part 2, was introduced in error (example: XMP Media Management property definition for xmpMM:Ingredients is "Bag ResourceRef").

This inconsistency really didn't matter while it was limited to being used as a notation format only in documentation. The arrival of PDF/A extension schemas changed all that. Specifically, as mentioned in TechNote 0009 clause 4.5, this notation is now used in the PDF/A extension schemas for the pdfaProperty:valueType and the pdfaField:valueType properties.

Our validator will support both variants but will generate warnings for the titlecase version. In other words, we are recommending the use of the lowercase variants as a best practice for PDF/D.

XMP pdfaValidate Schema

In building our new and improved validator we decided to use the pdfaExtension schema (and friends) to define all the schemas we are validating including all the pre-defined schemas. This process of eating our own dogfood has exposed numerous holes in both the XMP Specification and the PDF/A Specification.

The most obvious hole, which has already been discussed within the PDF/A Competence Center Working Group (TWG), is the loose nature of the definition of basic types in XMP. As mentioned earlier in my blog, one example is "Choice of " and "Open Choice of ". Another issue raised in TWG discussions is the ambigious use of case (seq vs Seg, bag vs Bag, etc).

The XMP Specification makes provision for extending existing Properties with Qualifier Properties that are ignored by applications that are not aware of them. We used this feature and the pdfaValidate schema to extend pdfaProperty and add validation information. When defining the schemas we wish to validate, we now add the following attributes:

status

Description: used by validator to flag errors of omission, inclusion or raise warnings.

Type: Closed Choice of Text

'deprecated' is similar to 'prohibited' only it is flagged as a warning and not an error by validators.

constraint

Description: Regular expression used to constrain "Closed Choice of " values. We still need a way to flag Open vs Closed.

Regular expressions always need to match all input (start with '^' and end with '$'). Other valid constraint values include:

'base64': used to validate Thumbnail xapGImg:image property for example.

Numeric ranges like: '[0,255]', '(0,)', '[-128,127]', etc.

Type: Text

standard

Description: This value determines which specification is violated when constraints are not met.

Type: Closed Choice of Text

Values: pdf|pdfa|pdfd|xmp

clause

Description: This is the clause in the specification which is violated when constraints are not met.

Type: Text

Value: string, typically dot delimited integers

We are continuing to work on our full set of these schemas for validation of PDF/A. These will then be available to PDF/D Consortium members. During this process, we may add more features to the pdfaValidate schema.

Sunday, February 15, 2009

XMP Validator

I've been working on building a better XMP validator. My idea was to define all the pre-defined schemas as pdfaExtension schemas and pre-load them into my validator. With this approach, I only need one validator (that validates pdfaExtension schemas) to validate all the pre-defined schemas as well as any user defined schemas.

Part of pulling this off requires that I have RDF schemas for all the common pre-defined schemas. I thought I'd start with the PDF/A identification schema since it appeared almost trivial. It didn't take long before I ran into "undefined" ground. I thought I could use "Closed Choice of Integer" to define the part property (only one choice: 1) and "Closed Choice of Text" to define the conformance property (two choices: A or B). So, using the samples I found on the in the tech notes on the PDF/A Competence Center site, I set out to create my first pdfaExtension schema.

Soon I discovered that how to define a "Choice" is not defined in these tech notes. Next step was to wade through XMP documentation at Adobe. This doesn't really help much because, being new to this domain, it is not easy to tell when something is specific to XMP, RDF or pdfaExtension. On page 62 of the XMP Specification a Closed Choice is described. A vocabulary and lists are mentioned. I can only assume this means defining a list of values using Bag, Alt or Seq. An example would really help to clarify.

I'm all ears ..

(Here is my work in progress: sample.rdf)

Next Idea: pdfaValidate Schema

There are not a lot of examples out there. Simple examples showing how to define a Closed Choice field would be great. The same goes for defining "Property Qualifiers". From what I read in the XMP specification they would be an ideal solution for me:

"Property qualifiers allow values to be extended without breaking existing usage."

The specification has pretty block diagrams but no sample code.

In the absense of decent implementation documentation I decided to just take a swing at it and came up with something that I think is probably what the XMP Specification describes as "Property Qualifiers". I created an RDF schema with two properties for validation:

status: Closed Choice of Text - required|prohibited|restricted|recommended|ignored
constraint: Text - regular expression for constraining simple literal fields for PDF/A compliance.

Here it is as RDF. I included the definition of pdfaValidate schema and included a "constrained" version of my pdfaid RDF schema definition as an example: pdfaValidate.rdf

Now I have what I need to make simple "constrained" RDF definitions for all the pre-defined schemas that we need to validate for PDF/A compliance. Moving right along ..

Monday, February 9, 2009

More on Numbers

Earlier I discussed Numbers in a general post about improving PDF for easier parsing.

I have two more notes to add on the subject of numbers.

"." is not a number

PDF ISO-32000-1:2008 states that:

A real value shall be written as one or more decimal digits with an optional sign and a leading, trailing, or embedded PERIOD (2Eh) (decimal point).

Adobe Acrobat Reader 9 clearly ignores this and accepts a single period as zero. This example (7-3-3-t01-fail-b.pdf) from our PDF 1.4 test set clearly shows that the colors red (on the RGB page) and black (on the CMYK page) were parsed with no problem.

1 0 . rg 72 72 72 72 re f

0 1 0 rg 72 216 72 72 re f

0 0 1 rg 72 360 72 72 re f

0 1 1 rg 72 72 72 72 re f

1 0 1 rg 72 216 72 72 re f

1 1 0 rg 72 360 72 72 re f

. 0 0 rg 72 504 72 72 re f

Numbers in PDF/D

In addition to earlier notes on parsing numbers, the above behavior will be considered an error in PDF/D.

Also, in our 10,000's of test files we have often seen number arguments terminated in content streams by the operator like this:

... 2 0 0 2 0 0cm ...

Acrobat does not tolerate this but we have seen other PDF software (including our own) look past this error. PDF/D will require delimiters or whitespace to terminate number tokens.

Sunday, February 8, 2009

Resources

Resources for a Page's Contents entry are defined in Resources dictionary of that Page or inherited from one of the ancestor nodes of that Page in the page node tree.

For XObjects, patterns, Type 3 Fonts and annotations that have content streams, the Resources dictionary will be included in the Content stream's dictionary. Unlike early versions of PDF, Resources cannot be inherited from the page tree for these objects (PDF 32000-1 mentions this obsolete functionality too).

ProcSets are obsolete and are excluded from PDF/D Resource dictionaries.

BX and EX

The last time a content operator was added to PDF was with PDF 1.2

Since we are defining a file format and not the behaviour of a conforming reader, it falls within the PDF/D philosophy of minimizing the cruft to drop these operators. In the unlikely event that a future version of PDF adds new operators, we can add them back in a future iteration of PDF/D so that conforming writers can use the new operators.

For now, no need for BX and EX: as with PDF/A any operator in a content stream that is not defined in PDF ISO-32000 is considered an error.

Thursday, February 5, 2009

Defining the Undefined

Despite being such an enormous specification, PDF ISO 32000-1 still has some holes in it. Each time I encounter such a scenario I'm going to write about it and start to lock down behavior for PDF/D. Please correct me if I miss something and if the scenario I'm describing is actually defined.

Empty Object

The specification does not mention the meaning of empty indirect objects like:

10 0 obj

endobj

I've tried to read between the lines to fathom the meaning of this emptiness but it simply is not defined. An obvious choice would be to treat such an object as the null object. I believe Acrobat Reader does this.

Variations on this theme that are defined include an indirect object containing an empty dictionary or an indirect object that is simply the null object:

11 0 obj

<<>>

endobj

12 0 obj

null

endobj

In addition, indirect references to undefined objects are treated as the null object (7.3.10) and a dictionary entry whose value is null shall be treated the same as if the entry does not exist (7.3.7).

PDF/D

To simplify the specification and reduce unnecessary bloat, the only null object shall be:

null

Illegal in PDF/D:

indirect references to undefined objects
empty indirect objects

Best Practices in PDF/D:

In addition, we consider it a best practice to omit an entry from a dictionary rather than to include an entry with a null value.

"Zero" Object

Another illegal behavior I've seen in customer files is indirect references that look like this:

... 0 0 R ...

Sometimes I've also seen this object actually defined like:

0 0 obj

...

endobj

ISO 32000-1 clearly states that this is illegal which means it is obviously illegal for PDF/D too. In 7.3.10 the object number is defined as a positive integer. Last time I checked, 0 is not one of the positive integers.

Tuesday, February 3, 2009

XRef stream vs xref

That didn't take long! I've been urged to compromise on legacy features already.

Members of the PDF/A camp, including one of my software engineers (Sergey), are concerned that dropping the old style xref tables means that no PDF/A-1 file can possibly be PDF/D compliant. I was looking forward with the hope that PDF/A-2 support would be good enough but they disagree.

So, I've decided to take a step back and allow old style xref tables to exist in PDF/D but with plenty of constraints:

only a single xref table (no Prev field in Trailer)
no hybrid files (no XRefStm in trailer)
no deleted objects (no f type in the xref table except for the first entry)
generation numbers always zero
only one section (implies consecutive object numbers starting at 1)

These simplifications mean that the end of the PDF file will always look like some variant of this:

xref

0 4

0000000000 65535 f

0000000009 00000 n

0000000122 00000 n

0000000175 00000 n

trailer

/Size 4

/Root 2 0 R

startxref

226

%%EOF

The other valid PDF/D entries in the Trailer are ID, Info and Encrypt.

As with my earlier PDF/D constraints, incremental updates and the dead objects that come with them are eliminated. So is linearization.

There you have it: the minimum required functionality of old style xref tables to make it possible for PDF/A-1 files to be PDF/D compliant.

Sunday, February 1, 2009

Parsing PDF/D

We’ve spent a lot of energy optimizing the C++ parser behind the Solid Documents products. Our C# parser has also been a learning experience. Once a PDF parser is well optimized, it will always end up spending the majority of the time in parsing numbers. This has proven to be true for both our native and managed code parsers.

Instead of focusing on the parser, I decided take a look at the other half of the problem: the file format itself. What if I could change the PDF format to make it easier to parse? PDF/D will be constraining the features of PDF to a subset so why not also make some improvements that will make parsing not just faster but also more reliable?

More reliable, you ask? Yes. Removing multiple ways of doing things obviously has minor performanace benefits but the bigger benefit is simplification of the code needed to deal with multiple variations of essentially the same thing.

EOL

PDF defines end-of-line as one or two characters that may be 0x0D, 0x0A or 0x0D followed by 0x0A. ISO 32000-1 and ISO 19005-1 go to some effort to constrain the end of line characters more tightly surrounding the data of streams.

Why not just define end-of-line as 0x0A and call it good? That would still be 100% PDF/A compliant too.

WhiteSpace

At the parser-level, whitespace is a special case of a delimiter for PDF. In string objects, it is data and, including UTF-16BE, there are at least 15 valid data whitespace characters. What I’m talking about here is at the parser-level and not the string objects.

While we are getting rid of 0x0d as an end-of-line character, we may as well get rid of a few of the whitespace alternatives too. Who needs tab (0x09) and form feed (0x0C) when space (0x20) and our new end-of-line (0x0A) will do just fine?

Comments

Comments are a pain for the PDF parser developer. They can appear anywhere whitespace is legal and they continue to the next end-of-line. More importantly, who cares? Aside from the pseudo comments used for the PDF file header and end-of-file tokens, comments serve no purpose whatsoever. There are other ways of putting application specific data in PDF files if that was what you were thinking so lets toss comments out too.

Numbers

The + character will have to go. It adds no value. Most PDF parsers attack numbers first as integers and then switch to a real mode as soon as a decimal point is encountered. Integer parsing is more efficient than real parsing. For this reason, whole numbers should be presented as integers and not as reals. For example, favor 42 over 42. or 42.0

Strings

A lot can be done to simplify string parsing. We can start by removing the escaped end-of-line for allowing multiple line strings. In addition, we can drop the idea of “matched parentheses” and simply escape all parentheses.

Hex strings are useful for representing byte strings as plain text and little else. Hex strings start with the same delimiter as dictionaries making parsing more complex than if they each used unique delimiters: <

Since most PDF files are binary anyway, regular strings can be used to represent byte strings and hex strings are no longer needed.

Fixed Formats

We should fix the format of the header and end-of-file comments. This way we can search for them as strings rather than parsing. Given \n as 0x0A, something like “%PDF-1.5\n%ÿÿÿÿ\n” for the header and “\n%%EOF\n” for the end-of-file marker should be fine.

In addition, we should lock down the syntax surrounding ‘obj’ and ‘endobj’ identifiers so that repairing of damaged PDF files can be done more reliably. For example, “\n\endobj\n\d+ 0 obj\n” makes an easy target for a regular expression search where “d+” is the object number.

So, any feedback or input? More ideas for putting a PDF parser on a diet? Comment here or find contact details at PDF/D.

Saturday, September 19, 2009

Thursday, September 10, 2009

Friday, July 24, 2009

Wednesday, July 1, 2009

Monday, May 11, 2009

Thursday, April 30, 2009

Tuesday, April 28, 2009

Tuesday, April 21, 2009

Wednesday, March 4, 2009

Thursday, February 26, 2009

Wednesday, February 25, 2009

Monday, February 23, 2009

Friday, February 20, 2009

Sunday, February 15, 2009

Monday, February 9, 2009

Sunday, February 8, 2009

Thursday, February 5, 2009

Tuesday, February 3, 2009

Sunday, February 1, 2009

ABOUT

Links

Categories

Blog Archive

Subscribe Now