Thursday, February 26, 2009

Anomalous Situations - Best Practices

PDF ISO-32000 has a note in clause 12.6.2 that is just dying to get the PDF/D Best Practices treatment:


"Conforming readers should attempt to provide reasonable behavior in anomalous situations. For example, self-referential actions should not be executed more than once, and actions that close the document or otherwise render the next action impossible should terminate the execution sequence."

How about insisting that the Next entry in Action dictionaries shall only contain acyclic graphs of actions?  When would endless loops of action sequences ever be a good thing?

Preferred prefix for Colorant Basic Value Type

xapG vs xmpG

















I Googled Adobe's site for clarification on this change, hoping to find a note on the subject: nada.
For the purposes of our XMP validator we're obviously going to assume that the most recent version is correct. The reason I made this blog post is so that it will pop up in Google when the next person stumbles into this question, wondering if it is a typo or a deliberate change.

Wednesday, February 25, 2009

Open Source PDF/A RDF Schemas

Inspired by the Isartor test set for validating PDF/A compliance we are working on a similar style set of negative tests for basic XMP compliance (PDF/A XMP TechNotes).


While it is clear that this work needs to be done, nobody appears to be tackling it. PDF/A 19005-1 is now heading into its 3rd year so we're attempting to fill this gap.

While each vendor will obviously implement their own XMP validator for PDF/A validation and conversion, there are some areas where we can easily collaborate. We believe that it is in all our interests to openly share an RDF and PDF/A compliant XMP implementation of the pre-defined schemas required to validate PDF/A files.

Today we released our first version of the PDF/A pre-defined schemas in RDF form. You can find these resources at the PDF/D website.

Monday, February 23, 2009

Isartor Truth

As promised, we've posted more tools for standardized compliance testing.


Today we added:
- Isartor Truth: an XML file with the expected results of the Isartor PDF/A tests
- CompareReports.exe: a tool to compare the above truth file to output from a validator

For more on our efforts to improve mechanical comparison of compliance testing reports, please visit the PDF/D site.

Friday, February 20, 2009

XMP: bag vs Bag, seq vs Seq

The RDF specification clearly uses "Bag", "Alt" and "Seq" for the names of these container elements. This is a requirement for the names of these array container elements:

rdf:Bag, rdf:Alt and rdf:Seq

Starting with the XMP Specification Part 1, the use of "bag " (as in "bag Text") was introduced as a notation to describe array types in schemas. This document is consistent in using the lowercase variant for type descriptions only.

I believe that the titlecase variant of this notation, first seen in XMP Specification Part 2, was introduced in error (example: XMP Media Management property definition for xmpMM:Ingredients is "Bag ResourceRef").  

This inconsistency really didn't matter while it was limited to being used as a notation format only in documentation. The arrival of PDF/A extension schemas changed all that. Specifically, as mentioned in TechNote 0009 clause 4.5, this notation is now used in the PDF/A extension schemas for the pdfaProperty:valueType and the pdfaField:valueType properties.

Our validator will support both variants but will generate warnings for the titlecase version. In other words, we are recommending the use of the lowercase variants as a best practice for PDF/D.

XMP pdfaValidate Schema

In building our new and improved validator we decided to use the pdfaExtension schema (and friends) to define all the schemas we are validating including all the pre-defined schemas. This process of eating our own dogfood has exposed numerous holes in both the XMP Specification and the PDF/A Specification.


The most obvious hole, which has already been discussed within the PDF/A Competence Center Working Group (TWG), is the loose nature of the definition of basic types in XMP. As mentioned earlier in my blog, one example is "Choice of " and "Open Choice of ". Another issue raised in TWG discussions is the ambigious use of case (seq vs Seg, bag vs Bag, etc).

The XMP Specification makes provision for extending existing Properties with Qualifier Properties that are ignored by applications that are not aware of them. We used this feature and the pdfaValidate schema to extend pdfaProperty and add validation information. When defining the schemas we wish to validate, we now add the following attributes:

status
Description: used by validator to flag errors of omission, inclusion or raise warnings.
Type: Closed Choice of Text
Values: required|prohibited|deprecated|restricted|recommended|ignored
'deprecated' is similar to 'prohibited' only it is flagged as a warning and not an error by validators.

constraint
Description: Regular expression used to constrain "Closed Choice of " values. We still need a way to flag Open vs Closed.
Regular expressions always need to match all input (start with '^' and end with '$'). Other valid constraint values include:
'base64': used to validate Thumbnail xapGImg:image property for example.
Numeric ranges like: '[0,255]',  '(0,)', '[-128,127]', etc.
Type: Text

standard
Description: This value determines which specification is violated when constraints are not met.
Type: Closed Choice of Text
Values: pdf|pdfa|pdfd|xmp

clause
Description: This is the clause in the specification which is violated when constraints are not met.
Type: Text
Value: string, typically dot delimited integers

We are continuing to work on our full set of these schemas for validation of PDF/A. These will then be available to PDF/D Consortium members. During this process, we may add more features to the pdfaValidate schema.

Sunday, February 15, 2009

XMP Validator

I've been working on building a better XMP validator. My idea was to define all the pre-defined schemas as pdfaExtension schemas and pre-load them into my validator. With this approach, I only need one validator (that validates pdfaExtension schemas) to validate all the pre-defined schemas as well as any user defined schemas.


Part of pulling this off requires that I have RDF schemas for all the common pre-defined schemas. I thought I'd start with the PDF/A identification schema since it appeared almost trivial. It didn't take long before I ran into "undefined" ground.  I thought I could use "Closed Choice of Integer" to define the part property (only one choice: 1) and "Closed Choice of Text" to define the conformance property (two choices: A or B). So, using the samples I found on the in the tech notes on the PDF/A Competence Center site, I set  out to create my first pdfaExtension schema.

Soon I discovered that how to define a "Choice" is not defined in these tech notes. Next step was to wade through XMP documentation at Adobe. This doesn't really help much because, being new to this domain, it is not easy to tell when something is specific to XMP, RDF or pdfaExtension. On page 62 of the XMP Specification a Closed Choice is described. A vocabulary and lists are mentioned.  I can only assume this means defining a list of values using Bag, Alt or Seq. An example would really help to clarify.

I'm all ears ..

(Here is my work in progress: sample.rdf)

Next Idea: pdfaValidate Schema
There are not a lot of examples out there. Simple examples showing how to define a Closed Choice field would be great. The same goes for defining "Property Qualifiers". From what I read in the XMP specification they would be an ideal solution for me:
"Property qualifiers allow values to be extended without breaking existing usage."
The specification has pretty block diagrams but no sample code.

In the absense of decent implementation documentation I decided to just take a swing at it and came up with something that I think is probably what the XMP Specification describes as "Property Qualifiers". I created an RDF schema with two properties for validation:
  1. status: Closed Choice of Text - required|prohibited|restricted|recommended|ignored
  2. constraint: Text - regular expression for constraining simple literal fields for PDF/A compliance.

Here it is as RDF. I included the definition of pdfaValidate schema and included a "constrained" version of my pdfaid RDF schema definition as an example: pdfaValidate.rdf

Now I have what I need to make simple "constrained" RDF definitions for all the pre-defined schemas that we need to validate for PDF/A compliance. Moving right along ..

Monday, February 9, 2009

More on Numbers

Earlier I discussed Numbers in a general post about improving PDF for easier parsing.


I have two more notes to add on the subject of numbers.

"." is not a number

PDF ISO-32000-1:2008 states that:

A real value shall be written as one or more decimal digits with an optional sign and a leading, trailing, or embedded PERIOD (2Eh) (decimal point).

Adobe Acrobat Reader 9 clearly ignores this and accepts a single period as zero. This example (7-3-3-t01-fail-b.pdf) from our PDF 1.4 test set clearly shows that the colors red (on the RGB page) and black (on the CMYK page) were parsed with no problem.

1 0 . rg 72 72 72 72 re f
0 1 0 rg 72 216 72 72 re f
0 0 1 rg 72 360 72 72 re f

..

0 1 1 rg 72 72 72 72 re f
1 0 1 rg 72 216 72 72 re f
1 1 0 rg 72 360 72 72 re f
. 0 0 rg 72 504 72 72 re f

Numbers in PDF/D

In addition to earlier notes on parsing numbers, the above behavior will be considered an error in PDF/D. 

Also, in our 10,000's of test files we have often seen number arguments terminated in content streams by the operator like this:

... 2 0 0 2 0 0cm ...

Acrobat does not tolerate this but we have seen other PDF software (including our own) look past this error. PDF/D will require delimiters or whitespace to terminate number tokens.

Sunday, February 8, 2009

Resources

Resources for a Page's Contents entry are defined in Resources dictionary of that Page or inherited from one of the ancestor nodes of that Page in the page node tree.


For XObjects, patterns, Type 3 Fonts and annotations that have content streams, the Resources dictionary will be included in the Content stream's dictionary. Unlike early versions of PDF, Resources cannot be inherited from the page tree for these objects (PDF 32000-1 mentions this obsolete functionality too).

ProcSets are obsolete and are excluded from PDF/D Resource dictionaries.

BX and EX

The last time a content operator was added to PDF was with PDF 1.2


Since we are defining a file format and not the behaviour of a conforming reader, it falls within the PDF/D philosophy of minimizing the cruft to drop these operators. In the unlikely event that a future version of PDF adds new operators, we can add them back in a future iteration of PDF/D so that conforming writers can use the new operators.

For now, no need for BX and EX: as with PDF/A any operator in a content stream that is not defined in PDF ISO-32000 is considered an error.

Thursday, February 5, 2009

Defining the Undefined

Despite being such an enormous specification, PDF ISO 32000-1 still has some holes in it.  Each time I encounter such a scenario I'm going to write about it and start to lock down behavior for PDF/D. Please correct me if I miss something and if the scenario I'm describing is actually defined.


Empty Object
The specification does not mention the meaning of empty indirect objects like:

10 0 obj
endobj

I've tried to read between the lines to fathom the meaning of this emptiness but it simply is not defined. An obvious choice would be to treat such an object as the null object. I believe Acrobat Reader does this.

Variations on this theme that are defined include an indirect object containing an empty dictionary or an indirect object that is simply the null object:

11 0 obj
<<>>
endobj

12 0 obj
null
endobj

In addition, indirect references to undefined objects are treated as the null object (7.3.10) and a dictionary entry whose value is null shall be treated the same as if the entry does not exist (7.3.7). 

PDF/D
To simplify the specification and reduce unnecessary bloat, the only null object shall be:

null

Illegal in PDF/D:
  • indirect references to undefined objects
  • empty indirect objects
Best Practices in PDF/D:

In addition, we consider it a best practice to omit an entry from a dictionary rather than to include an entry with a null value.

"Zero" Object
Another illegal behavior I've seen in customer files is indirect references that look like this:

... 0 0 R ...

Sometimes I've also seen this object actually defined like:

0 0 obj
...
endobj

ISO 32000-1 clearly states that this is illegal which means it is obviously illegal for PDF/D too. In 7.3.10 the object number is defined as a positive integer. Last time I checked, 0 is not one of the positive integers.

Tuesday, February 3, 2009

XRef stream vs xref

That didn't take long! I've been urged to compromise on legacy features already.


Members of the PDF/A camp, including one of my software engineers (Sergey), are concerned that dropping the old style xref tables means that no PDF/A-1 file can possibly be PDF/D compliant.  I was looking forward with the hope that PDF/A-2 support would be good enough but they disagree.

So, I've decided to take a step back and allow old style xref tables to exist in PDF/D but with plenty of constraints:
  • only a single xref table (no Prev field in Trailer)
  • no hybrid files (no XRefStm in trailer)
  • no deleted objects (no f type in the xref table except for the first entry)
  • generation numbers always zero
  • only one section (implies consecutive object numbers starting at 1)
These simplifications mean that the end of the PDF file will always look like some variant of this:

xref
0 4
0000000000 65535 f 
0000000009 00000 n 
0000000122 00000 n 
0000000175 00000 n 
trailer
<<
  /Size 4
  /Root 2 0 R
>>
startxref
226
%%EOF

The other valid PDF/D entries in the Trailer are ID, Info and Encrypt.

As with my earlier PDF/D constraints, incremental updates and the dead objects that come with them are eliminated. So is linearization.

There you have it: the minimum required functionality of old style xref tables to make it possible for PDF/A-1 files to be PDF/D compliant.

Sunday, February 1, 2009

Parsing PDF/D

We’ve spent a lot of energy optimizing the C++ parser behind the Solid Documents products. Our C# parser has also been a learning experience. Once a PDF parser is well optimized, it will always end up spending the majority of the time in parsing numbers. This has proven to be true for both our native and managed code parsers.

Instead of focusing on the parser, I decided take a look at the other half of the problem: the file format itself. What if I could change the PDF format to make it easier to parse? PDF/D will be constraining the features of PDF to a subset so why not also make some improvements that will make parsing not just faster but also more reliable?

More reliable, you ask?  Yes. Removing multiple ways of doing things obviously has minor performanace benefits but the bigger benefit is simplification of the code needed to deal with multiple variations of essentially the same thing.

EOL

PDF defines end-of-line as one or two characters that may be 0x0D, 0x0A or 0x0D followed by 0x0A.  ISO 32000-1 and ISO 19005-1 go to some effort to constrain the end of line characters more tightly surrounding the data of streams.

Why not just define end-of-line as 0x0A and call it good? That would still be 100% PDF/A compliant too. 

WhiteSpace

At the parser-level, whitespace is a special case of a delimiter for PDF. In string objects, it is data and, including UTF-16BE, there are at least 15 valid data whitespace characters. What I’m talking about here is at the parser-level and not the string objects.

While we are getting rid of 0x0d as an end-of-line character, we may as well get rid of a few of the whitespace alternatives too. Who needs tab (0x09) and form feed (0x0C) when space (0x20) and our new end-of-line (0x0A) will do just fine?

Comments

Comments are a pain for the PDF parser developer. They can appear anywhere whitespace is legal and they continue to the next end-of-line. More importantly, who cares? Aside from the pseudo comments used for the PDF file header and end-of-file tokens, comments serve no purpose whatsoever. There are other ways of putting application specific data in PDF files if that was what you were thinking so lets toss comments out too.

Numbers

The + character will have to go. It adds no value. Most PDF parsers attack numbers first as integers and then switch to a real mode as soon as a decimal point is encountered. Integer parsing is more efficient than real parsing. For this reason, whole numbers should be presented as integers and not as reals. For example, favor 42 over 42. or 42.0

Strings

A lot can be done to simplify string parsing. We can start by removing the escaped end-of-line for allowing multiple line strings. In addition, we can drop the idea of “matched parentheses” and simply escape all parentheses.

Hex strings are useful for representing byte strings as plain text and little else. Hex strings start with the same delimiter as dictionaries making parsing more complex than if they each used unique delimiters: <

Since most PDF files are binary anyway, regular strings can be used to represent byte strings and hex strings are no longer needed.

Fixed Formats

We should fix the format of the header and end-of-file comments. This way we can search for them as strings rather than parsing. Given \n as 0x0A, something like “%PDF-1.5\n%ÿÿÿÿ\n” for the header and “\n%%EOF\n” for the end-of-file marker should be fine.

In addition, we should lock down the syntax surrounding ‘obj’ and ‘endobj’ identifiers so that repairing of damaged PDF files can be done more reliably. For example, “\n\endobj\n\d+ 0 obj\n” makes an easy target for a regular expression search where “d+” is the object number.

So, any feedback or input? More ideas for putting a PDF parser on a diet? Comment here or find contact details at PDF/D.