Sunday, February 1, 2009

Parsing PDF/D

We’ve spent a lot of energy optimizing the C++ parser behind the Solid Documents products. Our C# parser has also been a learning experience. Once a PDF parser is well optimized, it will always end up spending the majority of the time in parsing numbers. This has proven to be true for both our native and managed code parsers.

Instead of focusing on the parser, I decided take a look at the other half of the problem: the file format itself. What if I could change the PDF format to make it easier to parse? PDF/D will be constraining the features of PDF to a subset so why not also make some improvements that will make parsing not just faster but also more reliable?

More reliable, you ask?  Yes. Removing multiple ways of doing things obviously has minor performanace benefits but the bigger benefit is simplification of the code needed to deal with multiple variations of essentially the same thing.

EOL

PDF defines end-of-line as one or two characters that may be 0x0D, 0x0A or 0x0D followed by 0x0A.  ISO 32000-1 and ISO 19005-1 go to some effort to constrain the end of line characters more tightly surrounding the data of streams.

Why not just define end-of-line as 0x0A and call it good? That would still be 100% PDF/A compliant too. 

WhiteSpace

At the parser-level, whitespace is a special case of a delimiter for PDF. In string objects, it is data and, including UTF-16BE, there are at least 15 valid data whitespace characters. What I’m talking about here is at the parser-level and not the string objects.

While we are getting rid of 0x0d as an end-of-line character, we may as well get rid of a few of the whitespace alternatives too. Who needs tab (0x09) and form feed (0x0C) when space (0x20) and our new end-of-line (0x0A) will do just fine?

Comments

Comments are a pain for the PDF parser developer. They can appear anywhere whitespace is legal and they continue to the next end-of-line. More importantly, who cares? Aside from the pseudo comments used for the PDF file header and end-of-file tokens, comments serve no purpose whatsoever. There are other ways of putting application specific data in PDF files if that was what you were thinking so lets toss comments out too.

Numbers

The + character will have to go. It adds no value. Most PDF parsers attack numbers first as integers and then switch to a real mode as soon as a decimal point is encountered. Integer parsing is more efficient than real parsing. For this reason, whole numbers should be presented as integers and not as reals. For example, favor 42 over 42. or 42.0

Strings

A lot can be done to simplify string parsing. We can start by removing the escaped end-of-line for allowing multiple line strings. In addition, we can drop the idea of “matched parentheses” and simply escape all parentheses.

Hex strings are useful for representing byte strings as plain text and little else. Hex strings start with the same delimiter as dictionaries making parsing more complex than if they each used unique delimiters: <

Since most PDF files are binary anyway, regular strings can be used to represent byte strings and hex strings are no longer needed.

Fixed Formats

We should fix the format of the header and end-of-file comments. This way we can search for them as strings rather than parsing. Given \n as 0x0A, something like “%PDF-1.5\n%ÿÿÿÿ\n” for the header and “\n%%EOF\n” for the end-of-file marker should be fine.

In addition, we should lock down the syntax surrounding ‘obj’ and ‘endobj’ identifiers so that repairing of damaged PDF files can be done more reliably. For example, “\n\endobj\n\d+ 0 obj\n” makes an easy target for a regular expression search where “d+” is the object number.

So, any feedback or input? More ideas for putting a PDF parser on a diet? Comment here or find contact details at PDF/D.

No comments:

Post a Comment