PDF Document Structure

The PDF specification defines a single entry point into the document structure, the file trailer dictionary, from which all other objects are referenced (n.b.: It is possible to store objects in a PDF file without any reference to them. However, no standard PDF reader would be able to use them). This means that the file trailer can be thought of as the root of a tree of PDF objects.

Although HexaPDF provides abstractions and convenience methods for working with the most important PDF objects, basic knowledge of the structure of a PDF file helps a lot. For in-depth information or for information about parts that are not covered, please consult the relevant parts of the PDF specification.

File Trailer

The file trailer dictionary (implemented by HexaPDF::Type::Trailer) is not really useful for the library user but for the PDF library itself because it contains all the information to properly parse a PDF file, for example, encryption information.

Additionally, it provides access to the document catalog via the /Root key and to the information dictionary via the /Info key.

If you need to access it, use HexaPDF::Document#trailer.

Document Catalog

Although the file trailer provides the entry point to all objects, the document catalog (see HexaPDF::Type::Catalog) is the real root of the document tree.

It contains references to all the important parts of a PDF file, for example, the page tree, the objects for interactive form support and the outline.

Additionally, it can be used to specify how the PDF document should be displayed through the keys /ViewerPreferences, /PageLayout and /PageMode.

The document catalog can be accessed via HexaPDF::Document#catalog.

Page Tree

The page tree is a tree-like object structure that contains references to all the pages of a PDF document.

The PDF specification could have used a simple array with references to the pages instead of the page tree. However, when a PDF document contains many pages and is viewed on a device with limited memory, a tree structure is better suited.

Since the object structure contains several redundant fields to aid in quickly getting the right page object and since these fields need to be in sync, it is not advised to manually alter the structure by inserting or deleting pages. HexaPDF can recover from such modifications but only if explicitly told so through its validation feature.

Because of this complexity the class HexaPDF::Type::PageTreeNode which implements nodes of the tree provides all the necessary convenience methods for adding, retrieving and deleting pages as well as getting the zero-based index of a page.

To make it still easier to work with pages, HexaPDF provides a convenience wrapper HexaPDF::Document::Pages that can be accessed via HexaPDF::Document#pages. This wrapper allows you to use standard methods names like #add, #delete and #[] when working with pages. If you still want to access the page tree itself, use HexaPDF::Type::Catalog#pages.

Page

For each page in a PDF document exists one page object that holds all the needed information for displaying the page.

The most important information is stored in the following keys:

Media box (key /MediaBox)

Defines the size of the page.

Content streams (key /Contents)

Holds references to one or more content streams that define the contents of page.

Ressource dictionary (key /Ressources)

Contains reference to ressources that may be used by the page, like fonts or images.

There are many other keys for specifying things like page transitions, annotations or actions.

Page pbjects are represented by HexaPDF::Type::Page. This class provides all the necessary convenience methods to work with pages, for example:

To access an existing page object you can use HexaPDF::Document::Pages#[] with a zero-based index; to add a new one use HexaPDF::Document::Pages#add.