PDF Objects

A PDF file essentially consists of PDF objects in serialized form; the additional information in the file is just needed to locate and load these objects.

These PDF objects define everything, from the meta data needed for a page to how certain parts of a page are defined as form fields.

The PDF specification defines several basic object types and most of them map directly to native Ruby classes:

Booleans

Represented by true and false.

Numerics

Integers like 123 and floats like 123.45.

Strings

Represented by Ruby’s String class and the special HexaPDF::DictionaryFields::PDFByteString class. Strings can be pure ASCII strings, Unicode strings or binary strings. There are two serialization formats: One uses parentheses, e.g. (Test), the other angle brackets with hex-encoding, like <54657374>.

Names

Work like symbols in Ruby and are therefore mapped to them. PDF names are serialized by prefixing a slash to the name, e.g. /Name.

Arrays

Represented by Array and serialized by using brackets around the values, e.g. [123 (Test) /Name].

Dictionaries

Represented by Hash or HexaPDF::Dictionary but can only have name objects as keys. Serialization is done using double angle brackets where each key is followed by its value, e.g. <</Key (Value) /AnotherKey 12345>>.

Note that it is better to use the HexaPDF::Dictionary class instead of a plain hash because it provides various convenience methods. For example, accessing a value automatically dereferences it so that not the reference itself is returned, but the referenced indirect object (see below).

Null

Represented by nil and serialized as null.

Streams

A sequence of potentially unlimited bytes. Represented by the HexaPDF::Stream class and serialized as a dictionary followed by stream\n...stream bytes...\nendstream. A stream is always an indirect object (see below).

Since the stream data can amount to many mebibytes, the stream data itself is lazily loaded on first access.

Indirect objects

An object of any of the above types that is additionally assigned an object identifier consisting of an object number (a positive integer) and a generation number (a non-negative integer). Represented by the HexaPDF::Object class and serialized by putting the object between OID GEN obj and endobj, like this 4 0 obj (SomeObject) endobj. Can be referenced in serialized form from another object like this: 4 0 R.

Indirect objects are special in that they don’t define a separate type but allow an object of any other type to be referenced. This reference mechanism allows HexaPDF to provide lazy loading of indirect objects, e.g. only those indirect objects that are actually accessed are loaded.

Sometimes a direct object is also represented by a subclass of HexaPDF::Object (e.g. to work with the object using convenience methods). In such cases the object number 0 is used to indicate that the object is a direct object.

Since most of the PDF object types map perfectly to Ruby classes, working with PDF objects is very easy because you don’t need to do anything special in nearly all cases. As an example, the following code creates a new PDF document, manually assembles a page dictionary and then adds it to the document’s page tree:

require 'hexapdf'

doc = HexaPDF::Document.new
page = doc.add(Type: :Page, MediaBox: [0, 0, 100, 100])
page.contents = "0 0 m 100 100 l S"
doc.pages << page
doc.write("sample.pdf")

Note that the HexaPDF::Document#add call actually returns a HexaPDF::Type::Page object and not a simple dictionary, allowing the use of its #contents method. See below for details.

PDF Types

While specifying an object as indirect object gives you access to it from anywhere in the PDF file, the meaning of this indirect object may not be apparent. This is where the PDF types come into play.

The PDF specification uses dictionary objects to describe various PDF types, like pages, fonts or annotations. Most of these types have a special /Type key with which an object can be recognized. For example, the main PDF object, the catalog, has the type /Catalog.

While many PDF types require the /Type key to be present in an object, sometimes it is optional. And there are also PDF types that don’t have a /Type key at all. In such cases the PDF type of an object can be inferred via the object from which it is referenced. For example, the viewer preferences type doesn’t have a /Type key but because it is referenced from the document catalog via the /ViewerPreferences key we know how to interpret it.

Mapping PDF Types to Classes

HexaPDF uses HexaPDF::Dictionary instead of plain hashes where possible. One reason are the added convenience methods. Another reason is the automatic mapping of PDF objects to specific subclasses of HexaPDF::Dictionary.

For example, a page object is a PDF dictionary and would normally be represented by HexaPDF::Dictionary. However, since there is a more specific subclass HexaPDF::Type::Page registered for it, this subclass is used.

Internally, this is made possible by HexaPDF::Object not actually storing the (indirect) object’s data but just a HexaPDF::PDFData object that holds everything related to a PDF object. So it doesn’t matter whether a HexaPDF::Object or a HexaPDF::Type::Page object is used as wrapper as long as they use the same HexaPDF::PDFData object. This increases memory usage but the gains are worth it.

These specific classes use HexaPDF::Dictionary::define_field to define the fields described in the PDF specification as well as some meta data about them. One meta data item is the type of the field and this information is used, among other things, to provide the automatic mapping when using HexaPDF::Dictionary#[] and HexaPDF::Document#wrap.

How the mapping is done can be configured via the global configuration object (see HexaPDF::GlobalConfiguration). The default configuration uses all the classes shipped with HexaPDF. However, you can easily replace a class or add a new mapping by changing the configuration.