Friday, February 20, 2009

XMP pdfaValidate Schema

In building our new and improved validator we decided to use the pdfaExtension schema (and friends) to define all the schemas we are validating including all the pre-defined schemas. This process of eating our own dogfood has exposed numerous holes in both the XMP Specification and the PDF/A Specification.


The most obvious hole, which has already been discussed within the PDF/A Competence Center Working Group (TWG), is the loose nature of the definition of basic types in XMP. As mentioned earlier in my blog, one example is "Choice of " and "Open Choice of ". Another issue raised in TWG discussions is the ambigious use of case (seq vs Seg, bag vs Bag, etc).

The XMP Specification makes provision for extending existing Properties with Qualifier Properties that are ignored by applications that are not aware of them. We used this feature and the pdfaValidate schema to extend pdfaProperty and add validation information. When defining the schemas we wish to validate, we now add the following attributes:

status
Description: used by validator to flag errors of omission, inclusion or raise warnings.
Type: Closed Choice of Text
Values: required|prohibited|deprecated|restricted|recommended|ignored
'deprecated' is similar to 'prohibited' only it is flagged as a warning and not an error by validators.

constraint
Description: Regular expression used to constrain "Closed Choice of " values. We still need a way to flag Open vs Closed.
Regular expressions always need to match all input (start with '^' and end with '$'). Other valid constraint values include:
'base64': used to validate Thumbnail xapGImg:image property for example.
Numeric ranges like: '[0,255]',  '(0,)', '[-128,127]', etc.
Type: Text

standard
Description: This value determines which specification is violated when constraints are not met.
Type: Closed Choice of Text
Values: pdf|pdfa|pdfd|xmp

clause
Description: This is the clause in the specification which is violated when constraints are not met.
Type: Text
Value: string, typically dot delimited integers

We are continuing to work on our full set of these schemas for validation of PDF/A. These will then be available to PDF/D Consortium members. During this process, we may add more features to the pdfaValidate schema.

No comments:

Post a Comment