PDF Objects

A PDF file essentially consists of PDF objects in serialized form; the additional information in the file is just needed to locate and load these objects.

These PDF objects define everything, from the meta data needed for a page to how certain parts of a page are defined as form fields.

Basic Object Types

The PDF specification defines several basic object types and most of them map directly to native Ruby classes:

Booleans

Represented by true and false.

Numerics

Integers like 123 and floats like 123.45.

Strings

Represented by Ruby’s String class and the special HexaPDF::DictionaryFields::PDFByteString class. Strings can be pure ASCII strings, Unicode strings or binary strings. There are two serialization formats: One uses parentheses, e.g. (Test), the other angle brackets with hex-encoding, like <54657374>.

Names

Work like symbols in Ruby and are therefore mapped to them. PDF names are serialized by prefixing a slash to the name, e.g. /Name.

Arrays

Represented by Array or HexaPDF::PDFArray and serialized by using brackets around the values, e.g. [123 (Test) /Name].

The HexaPDF::PDFArray class provides, among other things, automatic dereferencing of values (see below).

Dictionaries

Represented by Hash or HexaPDF::Dictionary but can only have name objects as keys. Serialization is done using double angle brackets where each key is followed by its value, e.g. <</Key (Value) /AnotherKey 12345>>.

HexaPDF uses HexaPDF::Dictionary instead of plain hashes where possible. The reason is that the dictionary class provides various methods that allow for much more convenient use. For example, accessing a value automatically dereferences it so that not the reference itself is returned, but the referenced indirect object (see below).

Null

Represented by nil and serialized as null.

Streams

A sequence of potentially unlimited bytes. Represented by the HexaPDF::Stream class and serialized as a dictionary followed by stream\n...stream bytes...\nendstream. A stream is always an indirect object (see below).

Since the stream data can amount to many mebibytes, the stream data itself is lazily loaded on first access.

Indirect objects

An object of any of the above types that is additionally assigned an object identifier consisting of an object number (a positive integer) and a generation number (a non-negative integer). Represented by the HexaPDF::Object class and serialized by putting the object between OID GEN obj and endobj, like this 4 0 obj (SomeObject) endobj. Can be referenced in serialized form from another object like this: 4 0 R.

Indirect objects are special in that they don’t define a separate type but allow an object of any other type to be referenced. This reference mechanism allows HexaPDF to provide lazy loading of indirect objects, e.g. only those indirect objects that are actually accessed are loaded.

Sometimes a direct object is also represented by a subclass of HexaPDF::Object (e.g. to work with the object using convenience methods). In such cases the object number 0 is used to indicate that the object is a direct object. Use HexaPDF::Object#indirect? to determine whether an object direct or indirect.

Since most of the PDF object types map perfectly to Ruby classes, working with PDF objects is very easy because you don’t need to do anything special in nearly all cases. As an example, the following code creates a new PDF document, manually assembles a page dictionary and then adds it to the document’s page tree:

require 'hexapdf'

doc = HexaPDF::Document.new
page = doc.add({Type: :Page, MediaBox: [0, 0, 100, 100]})
page.contents = "0 0 m 100 100 l S"
doc.pages << page
doc.write("sample.pdf")

Note that the HexaPDF::Document#add call actually returns a HexaPDF::Type::Page object and not a simple dictionary, allowing the use of its #contents method. See below for details.

Dictionary Types

While specifying an object as indirect object gives you access to it from anywhere in the PDF file, the meaning of this indirect object may not be apparent. This is where the dictionary object type comes into play.

The PDF specification uses dictionary objects to describe various dictionary types, like pages, fonts or annotations. It defines each key of such a dictionary type, together with additional information like the allowed object types of the value, possible default value, earliest PDF version that key is available and, naturally, a description

These dictionary types are implemented as subclasses of HexaPDF::Dictionary and use HexaPDF::Dictionary::define_field to define the fields described in the PDF specification together with the mentioned meta data about them.

This meta data allows HexaPDF to do things like validating values and mapping objects to more specific classes (see the next section).

Mapping Dictionary Types to Classes

Most of the dictionary types have a special /Type key with which an object can be recognized. For example, the main PDF object, the catalog, has the type /Catalog.

While many dictionary types require the /Type key to be present, sometimes it is optional. And there are also dictionary types that don’t have a /Type key at all. In such cases the dictionary type can be inferred via the object from which it is referenced. For example, the viewer preferences type doesn’t have a /Type key but because it is referenced from the document catalog via the /ViewerPreferences key we know how to interpret it.

This allows HexaPDF to provide an automatic mapping of objects to more specific classes! For example, a page object would normally be represented by HexaPDF::Dictionary. However, since there is a more specific subclass HexaPDF::Type::Page registered for it, this subclass is used.

Internally, this is made possible by HexaPDF::Object not actually storing the (indirect) object’s data but just a HexaPDF::PDFData object that holds everything related to a PDF object. So it doesn’t matter whether a HexaPDF::Dictionary or a HexaPDF::Type::Page object is used as wrapper as long as they use the same HexaPDF::PDFData object. This increases memory usage but the gains are worth it.

The automatic mapping happens in two places:

HexaPDF::Document#wrap

Whenever an object is loaded from a PDF or manually added through HexaPDF::Document#add, the #wrap method is called. This method contains the logic to map a hash to a concrete dictionary type class, mainly based on the contents of the hash.

HexaPDF::Dictionary#[]

The only required meta data item of a dictionary field definition is the type of the field. When the #[] method is used to retrieve a value, this type information is used to wrap the value in the correct dictionary class. Internally, this is done by the HexaPDF::DictionaryFields::DictionaryConverter module.

How the mapping from dictionary object to concrete implementation class is done can be configured via the global configuration object (see HexaPDF::GlobalConfiguration). The default configuration uses all the classes shipped with HexaPDF. However, you can easily replace a class or add a new mapping by changing the configuration.