PDF Document Structure
The PDF specification defines a single entry point into the document structure, the file trailer dictionary, from which all other objects are referenced (n.b.: It is possible to store objects in a PDF file without any reference to them. However, no standard PDF reader would be able to use them). This means that the file trailer can be thought of as the root of a tree of PDF objects.
Although HexaPDF provides abstractions and convenience methods for working with the most important PDF objects, basic knowledge of the structure of a PDF file helps a lot. For in-depth information or for information about parts that are not covered, please consult the relevant parts of the PDF specification.
The file trailer dictionary (implemented by HexaPDF::Type::Trailer) is not really useful for the library user but for the PDF library itself because it contains all the information to properly parse a PDF file, for example, encryption information.
Additionally, it provides access to the document catalog via the
/Root key and to the information
dictionary via the
If you need to access it, use HexaPDF::Document#trailer.
Although the file trailer provides the entry point to all objects, the document catalog (see HexaPDF::Type::Catalog) is the real root of the document tree.
It contains references to all the important parts of a PDF file, for example, the page tree, the objects for interactive form support and the outline.
Additionally, it can be used to specify how the PDF document should be displayed through the keys
The document catalog can be accessed via HexaPDF::Document#catalog.
The page tree is a tree-like object structure that contains references to all the pages of a PDF document.
The PDF specification could have used a simple array with references to the pages instead of the page tree. However, when a PDF document contains many pages and is viewed on a device with limited memory, a tree structure is better suited.
Since the object structure contains several redundant fields to aid in quickly getting the right page object and since these fields need to be in sync, it is not advised to manually alter the structure by inserting or deleting pages. HexaPDF can recover from such modifications but only if explicitly told so through its validation feature.
Because of this complexity the class HexaPDF::Type::PageTreeNode which implements nodes of the tree provides all the necessary convenience methods for adding, retrieving and deleting pages as well as getting the zero-based index of a page.
To make it still easier to work with pages, HexaPDF provides a convenience wrapper
HexaPDF::Document::Pages that can be accessed via HexaPDF::Document#pages. This wrapper allows
you to use standard methods names like
# when working with pages. If you
still want to access the page tree itself, use HexaPDF::Type::Catalog#pages.
For each page in a PDF document exists one page object that holds all the needed information for displaying the page.
The most important information is stored in the following keys:
- Media box (key
Defines the size of the page.
- Content streams (key
Holds references to one or more content streams that define the contents of page.
- Ressource dictionary (key
Contains reference to ressources that may be used by the page, like fonts or images.
There are many other keys for specifying things like page transitions, annotations or actions.
Page pbjects are represented by HexaPDF::Type::Page. This class provides all the necessary convenience methods to work with pages, for example:
The HexaPDF::Type::Page#canvas method gives you access to a HexaPDF::Content::Canvas object for drawing on the page. You could manually assemble the content streams but this is error prone and very tedious - better rely on the canvas object.
The HexaPDF::Type::Page#box method allows you to view or change the various page boxes like the media box.