Pragmatic PDF

Wednesday, November 17, 2010

Solid Documents technology included in Acrobat X

Adobe has licensed Solid Framework SDK for Adobe® Acrobat® X.

Adobe Acrobat X takes advantage of Solid Documents’ PDF to Word and Excel conversion capabilities, allowing Acrobat X users to easily reuse and repurpose PDF content.

"After reviewing the available options, we chose to use Solid Framework technology for the conversion of PDF files to Microsoft® Word and Excel in Adobe® Acrobat® X. The document reconstruction quality is very good and the Solid Documents team has been a pleasure to work with on this project," said Aman Deep Nagpal, Senior Product Manager, Acrobat Solutions, at Adobe.

Wednesday, April 28, 2010

PDF to Word for the Mac

We reached a major milestone at Solid Documents today with the release of Solid PDF to Word for Mac.
For the last 8 months our engineering group has been hard at work porting our core technology first to 64-bit and then to OSX. Although the UI of the Mac product is Cocoa (Objective C++), the underlying engine is our new and improved portable Solid Framework Nucleus. This is our first application that uses our SDK the same way an SDK customer would.

In our NUnit-like automated test environment we run Solid Framework managed code using Mono on OSX. The C# ecosystem provides an excellent cross platform experience.

Here is a sneak preview of some of the significant improvements to Solid Framework that you can expect to see soon in the v7 release. This list is not exhaustive .. we like surprises ..

Single DLL
We now wrap all the native code inside a single managed assembly called SolidFramework.dll for ease of deployment. The native code is automatically extracted on first run and integrators have some flexibility regarding this thanks to SolidFramework.Configuration.Installer.
64-Bit
On both Windows and OSX we now support x64 and x86 native code. If you build your C# project as CPU Any then it will automatically use the correct native code depending on the current platform.
Single Threaded
The portable subset (Solid Framework Nucleus) that we used to provide the conversion functionality in Solid PDF to Word for Mac is single threaded. This is really helpful when building enterprise applications or in a server environment.
Office Open XML
Solid Framework now creates .docx and .xlsx without needing Office to be present. Also useful in server environments. These formats are now understood by the majority of word processors including Word 2003/2007/2010, Corel WordPerfect, Open Office, Google Docs and iWork Pages.
Geometric NSE
A new mechanism to resolve "non-standard encoding" issues for glyph to character mapping has been developed which does not rely on OCR. Once again, a useful change for when you don't want to depend on Office being present.

Saturday, September 19, 2009

PDF/A: what's new in Solid PDF Tools v6

Last week Solid Documents released an upgrade to Solid PDF Tools. In a nutshell, with Solid PDF Tools you can:

convert from PDF to Word
export tables from PDF to Excel
scan directly to Word
edit PDF files (page manipulation, text touchup, etc.)
validate PDF/A
convert PDF to PDF/A
create structure PDF files from Office applications

For a complete list visit the Solid PDF Tools features page.

With version 6, the product now exports PDF/A validation and conversion reports as per the specifications from the PDF/D Consortium. The validator and converter have also been greatly improved to:

provide much improved support for XMP validation
pass 100% the Bavaria Test Suite cases (v5 already passed the Isartor cases)

With the work we've done to improve our PDF/A technology, we think version 6 is now one of the best PDF/A tools on the market. This PDF/A functionality is also available to both .NET and C++ developers through our Solid Framework SDK product.

Thursday, September 10, 2009

RDF for PDF/A-1 Predefined XMP Schemas Updated

Today we shared the latest update of the pdfa.rdf (now 1.1) schema used by the Solid Documents PDF/A Validator and PDF to PDF/A Converter.

At over 2500 lines long this is probably the largest use of the PDF/A extension schema definitions on the planet. Our own pdfaValidate schema has been updated too. It now includes some new properties ('default', 'subst', 'predefined' and 'count') to help us use this RDF schema to build our data-driven PDF/A XMP validator.

Feel free to use this data to build your own PDF/A XMP validator but remember to give back: if you have corrections or improvements, please share them with the PDF/A community. Better still, join the PDF/D Consortium.

Friday, July 24, 2009

Info Dictionary vs XMP Metadata

The PDF/A-1 specification goes to great lengths to describe a mapping between the Entries in the legacy PDF Document Information Dictionary and their corresponding values in the more modern Document XMP Metadata. Section 3.4 in TechNote 0003 describes how the values in the Document Information Dictionary must be mirrored in XMP. Section 3.3 describes the requirements for Document Information Entries.

However, these requirements are not symmetric. What I mean by this is that it is perfectly legal for a PDF/A document to contain Document XMP Metadata and not to include a Document Information Dictionary.

Keeping the entries of the legacy and the more modern structures in sync is a headache for the software developer and this pursuit is littered with ambiguous scenarios. For example, many of the XMP Metadata fields can have multiple values. For example, multiple dc:title values for multiple languages or a seq of multiple authors for dc:creator rather than a single author. Each Entry in the legacy Document Information Dictionary is a simple string. There are no conventions on how to order or delimit these strings when mapping multiple fields from XMP to these single string values.

The solution is simple: don't use Document Information Dictionaries! Accept that Document XMP Metadata is the way forward and move on.

We'll be adding this to PDF/D as a constraint: the Info dictionary will be illegal in PDF/D - legacy software be damned.

Wednesday, July 1, 2009

Solid Framework v6 includes PDF/A Validation

It was a quiet month here at Pragmatic PDF central. We've been hard at work on the finishing touches of our Solid Framework v6 upgrade.

Major changes include:

new enterprise license model (in addition to republisher model)
PDF/A Validation
PDF to PDF/A Conversion
PDF to flowing HTML conversion
support for 64 bit Windows

We also have a much more elaborate set of sample code than before in the form of some free applications. Solid PDF Navigator is 100% free and illustrates what can be achieved with the Free license of SolidFramework. Solid PDF Mechanic uses the new free Developer license to allow exploration of all the premium Solid Framework features. All features are fully functional but include watermarks and "not for resale" text. To take advantage of either the Free or Developer license, simply download Solid Framework and start using it immediately.

PDF viewer including Page Pane and standard navigation controls similar to Acrobat Reader.

Explorer view allows navigation and examination of PDF internals.

Monday, May 11, 2009

PDF: right up there with COBOL

And this is a good thing.

PDF is an amazing document format: it is both backward and forward compatible:

PDF 1.1 files from 1993 can still be perfectly understood by today's PDF tools
PDF files created by today's tools can still be viewed by older PDF software

Could this be one of the reasons, along with technological soundness, why PDF is ubiquitous?

What other parts of our industry can claim such success without leaving data or customers behind every three years after "upgrade season"?

Not Google

A few years back Google offered a very simple Google SOAP Search API to allow 3rd parties to easily use the Google search engine to add native search to their websites. By native, I mean no ads from Google and 100% custom UI. We used this API as a quick fix to get search on the Solid Documents web site. In 2006, Google "deprecated" this API and required web developers to migrate to their new and improved AJAX version of the same thing. in August 2009, the API will cease to function altogether.

To be fair, the service was free. However, that's supposed to be the benefit of going with Google rather than Microsoft. It is hardly a benefit if they pull the rug out from under you. The least they could have done was provide some sort of legacy wrapper for the new API.

If you cannot rely on an API to exist for the life of your business, then it would be foolish to build your infrastructure on it. Luckily search was a cheap way for us to learn to steer well clear of any "enterprise" offerings from Google in future. No, we will not be using Google Apps (the "enterprise" version of GMail plus Google Docs). And we certainly will not be building anything using the Google App Engine. I don't care how cool it is: I'm willing to bet that your app will no longer be running in 10 years from now. This Blog uses a free service acquired by Google. Hmm....

Not Microsoft

What set me off on this tirade was our hosted Exchange upgrade this week. We drank the Kool-Aid and outsourced 'generic' parts of our IT including our e-mail. This week they upgraded us from Exchange 2003 to Exchange 2007.

On the positive side, they didn't lose my e-mail. However, the transition has been anything but smooth. It included instructions like clearing your Blackberry to 'out of box' state. In other words, assuming that the only thing you do with your Blackberry is use it as a client for their e-mail server. Most people I know have at least one other app that they regularly use on their Blackberry ("telephone" anyone?). So, plenty of time was wasted backing up and restoring address books and re-installing 3rd party applications.

Pretty much the only thing that worked after the transition was e-mail. One of the primary reasons we originally switched from our own simple open source e-mail server to Exchange was to take advantage of collaborative features of Outlook like shared calendars and address books. None of that worked after the transition.

If it ain't broke ..

.. don't fix it! One of the key features expected from any "Enterprise Solution" should be longevity. Just like railways and roads, one should expect a bit of maintenence over the lifetime of the tool but one does not expect to have to toss the whole thing out and replace it every 4 years. Some of the open source projects deal with this issue a little better but that's not all roses either: anyone else remember the upgrade to PHP 5 or is it just me?

I understand that sometimes you need to throw out the legacy to make progress. Shutting down analog TV in the US is a great example of this. However, when it comes to expectations for enterprise business solutions, 4 years is a very low bar. For Exchange, part of the blame goes to Apptix and part to Microsoft:

When I look for Exchange 2003 on Microsoft's site I'm redirected to the Exchange 2010 pages. You have to dig deep on technet to find 2003 info. Even then, it is not clear how long Microsoft intends to support it.
Apptix should have offered the 2007 migration as an option rather than a compulsory disruption to all of their clients and their businesses. Part of their plan should have been to keep running Exchange 2003 for Luddites like me. Remind me again what the benefit of the 2007 upgrade was?

In the event that breaking changes to an API, file format or service are unavoidable, a responsible enterprise service provider will provide a smooth transition path to their customers.

Back to Solid PDF

Aside from one small change in the way table reconstruction worked in a very early version of Solid Converter PDF, the publically exposed APIs of our SDK have remained constant for 7 years now. That first minor change we made taught us our lesson: even as we've migrated from a COM SDK to our more recent .NET Solid Framework, we've taken great care to avoid breaking customer apps that rely on our older APIs.

When we released Solid Script, our command line syntax for our desktop applications had to change but we offered a legacy wrapper that translates old command lines into the newer scripts. Even this is not a big issue though since the software we created 7 years ago still works just as well as it did the day it was purchased. No forced upgrades due to changing file formats or 'deprecated' APIs.

When PDF/A was announced in 2005 we immediately recognized the value this added to an already awesome file format and decided to make archiving functionality one of the pillars of our business. The PDF/A standard underlines the already proven long term vision we have for both customer documents and PDF products:

Think 40 years, not 4 years
Think incremental non-breaking improvements, not disruptive change

Wouldn't it be grand if the bigger players had a similar definition of long term? With all the focus today on sustainability on conservation, why do they continue to waste our time, money and energy?