Introduction

PDF files are ubiquitous in today’s digital world and HexaPDF provides easy but fully-featured access to all those PDF files. The only thing HexaPDF won’t implement is rendering of PDF documents.

About PDF

PDF, the Portable Document Format, is a file format created by Adobe for representing digital documents independently from applications, operating systems or hardware. It is the defacto standard for digital documents and for their interchange. It can not only contain text and graphics but also annotations, links, form fields, layers, rich media like video and many more things.

While the PDF specification has started out as a propriertary, though open, document format at Adobe, the PDF 1.7 specification became an ISO standard (32000-1:2008) in 2008. It then took nine years for the next version of the specification, PDF 2.0, to get published in 2017.

Because the original ISO standard was nearly identical to the then already publicly available Adobe PDF 1.7 specification, it is one of the few ISO specifications that is freely available to the public at Adobe’s website.

While it was not publicly available from the beginning, the PDF Association has managed to make the PDF 2.0 specification freely available via sponsors. Although it is more evolution than revolution, it is better to get it while it is easily available since it has better and more detailed explanations for a few sections and fixes and corrections for previously underspecified functionality. I suggest getting it for first-hand knowledge about PDF topics.

You will find that the API documentation has many references to applicable sections of the PDF 2.0 specification. So having the specification at hand will allow you dive deeper into a certain topic.

HexaPDF API Design

HexaPDF was designed with ease of use and performance in mind. To this end the API follows some guidelines:

Apart from these guidelines concerning the API care has been taken to make sure that HexaPDF performs well and doesn’t use much memory. Most parts of HexaPDF are therefore already very optimized and various benchmarks ensure that HexaPDF gets still faster over time.

The library is also thoroughly tested with 100% code coverage.

General Usage Pattern

As stated above you will only need to remember the class name HexaPDF::Document for creating a new document or loading an existing one:

doc = HexaPDF::Document.new
# or
doc = HexaPDF::Document.open(pdf_file)

You might optionally set some configuration options when instantiating the main class or later via the HexaPDF::Document#config method. The configuration options allow you to fine tune internal behaviour to your liking. For example, by default HexaPDF is quite forgiving when it comes to corrupt or invalid PDF files and can handle or recover from many. This can be changed by changing the appropriate configuration options like parser.on_correctable_error.

Next you work with the document: add, delete or change pages, handle annotations, fill out or create interactive forms and much more.

The final step is to write out the document:

doc.write('output.pdf', optimize: true)

Optimizing the resulting file is optional but highly recommended to produce quite a bit smaller PDF files. The default optimization should be fine for most cases. However, if you need more control, you can invoke the task HexaPDF::Task::Optimize yourself before writing out the document.

Additionally, before writing out the document, the validation routine HexaPDF::Document#validate is called to validate and possibly auto-correct problems. It is advised not to disable it. Un-correctable validation problems lead to an exception. If you want to handle this part yourself, e.g. by customizing your reaction to validation problems, you would pass validate: false to HexaPDF::Document#write and invoke #validate before writing.

The smallest HexaPDF application which writes out a minimal PDF is:

HexaPDF::Document.new.write('output.pdf')

HexaPDF doesn’t automatically add any content to a newly created document, not even a page. However, if you look at the resulting PDF you will see that it has a single, blank page. This is because for a PDF to valid it needs at least one page and the validation routine ensure that.