Analysing PDFs
This how-to guide shows how to use the hexapdf inspect
command to analyse PDF files.
How hexapdf inspect
Works
The hexapdf
command line tool comes with a variety of sub-commands, one of them being the
inspect
command. This command is designed to allow its user to inspect and analyse a PDF file.
The default mode is the interactive mode which is used when no command line arguments besides the PDF file are given. The interactive mode loads the PDF file and allows running one or more commands against it which is useful if the PDF file in question is huge.
If additional arguments are provided on the command line, they are interpreted as interactive mode commands and executed.
Let’s see those two ways of running hexapdf inspect
in action:
-
First we invoke the interactive mode and run three commands:
h
for showing the help,pages
for displaying all pages andquit
for quitting the interactive mode. -
Then we use command line arguments to run the same three commands (although
quit
wouldn’t be necessary). Note how we have to add a semicolon (escaped because otherwise it would be interpreted by the shell) after thepages
command as command separator because otherwise it would useq
as argument.
That’s all the needed basics, let’s dive into various usage scenarios and explore the inspect
command’s functionality.
Traversing the PDF Tree Structure
The key topic on PDF’s document structure explains the main structures used in PDF files. Basic knowledge of the places where certain information can be found is useful when analysing PDF files.
The trailer
, object
, catalog
and stream
commands can help you navigate through this tree
structure by showing the objects in question. Let’s see them in action:
-
The
trailer
ort
command shows the trailer dictionary. This dictionary may have an associated object identifier but since that is not always the case this special command is available.Interesting objects in the trailer dictionary are the
/Info
dictionary with basic meta data, the/Encryption
dictionary with encryption information and the catalog dictionary under the/Root
key containing the main PDF objects. -
Next we use the short form of the
object
command by just providing the object number referenced in the/Root
entry. This will show us the catalog dictionary. We then use thecatalog
command to show the same object. So if you want to start at the catalog, its faster to use the providedcatalog
command. -
Finally we navigate through the page tree, which only contains one page, and then show the contents of the page - more on that below.
The used sample PDF is very simple, in normal PDFs the catalog dictionary contains many more keys for various things like annotations or form fields.
Another way to view the tree structure of a PDF is to use the recursive
command. This command will
not only show the requested object but also all objects referenced from it. Note that this will
often show you the whole document tree when one page is referenced since page nodes have a reference
to their parent. See Comparing PDF Files below for details.
Analysing Pages
In the last section you got a glimpse at inspecting pages, by navigating from the catalog
dictionary. Since inspecting pages is often necessary the special pages
command exists. It will
display all page object identifiers in the correct order, together with the object identifiers for
their content streams:
-
In the example you can see that the PDF file has one page and that the page has one content stream. Using two
object
commands at once we show the respective objects. -
From the page object we see that the page has size A4 (the
/MediaBox
key and knowing that the numbers represent PDF points with 72 points per inch) and that it references (but may not actually use) one font (in the/Resources
→/Font
dictionary). -
The stream object itself is rather plain, the interesting part is the stream contents. A page’s content stream is written with the same syntax as a PDF file and contains the render instructions for the viewer.
Since we use a very simple PDF as example (see the tutorial “Creating a PDF Document from Scratch” for how it was created), deciphering the instructions is not that hard:
- The first line selects a font from the resource dictionary and sets the font size to 50 points.
- The second line sets the “text color” (actually the fill color) to an RGB value of
(0, 128, 255)
. - The next four lines draw the text “Hello World” at the location (150, 396).
Page objects often contain much more information, for example references to image resources, annotations or form fields.
Searching for Data
Sometimes you need to find an object with some specific data in it. For example, when an error message provides the name of a dictionary key or (part of) a value. Or when you want to find all objects referencing a specific indirect object.
In such cases the search
command comes in handy: It searches through all indirect objects and
prints those matching the given argument (a regular expression). Here is an example:
-
First we search for occurrences of “hexa”, finding the information dictionary where the search string appears under the
/Producer
key. -
Then we search for all references to the indirect object (2,0) by using the search string
\b2 0 R\b
(in this case it is important to use the\b
anchors for word boundaries). This results in two objects being shown, the catalog dictionary and the dictionary of the only page, because the object (2,0) refers to the root of the page tree.
Comparing PDF File Structure
Comparing the structures of two PDF files allows you to analyse “behind the scenes” changes done by a program. If you want to compare two PDF files visually, i.e. the appearance of the page content, we recommend using a tool like DiffPDF.
To compare PDF structures use the recursive
command to output the whole structure of the PDFs in
question and then compare the output. By using the process substitution feature of the shell you
don’t even need to create temporary files.
Here is a simple example showing the difference between the hello-world.pdf
we have used
throughout this guide and the optimized version:
-
First we produce the optimized version using the
hexapdf optimize
command which compresses the PDF down to about 50% of the original file size. -
Then we use the
recursive
command inhexapdf inspect
’s command line mode forhello-world.pdf
andhello-world-opt.pdf
and use their output directly withvimdiff
to show the differences. -
There are only three differences shown:
-
The second part of the
/ID
key changed. This is expected as this part should always change when an existing file is modified. -
The
/ModDate
field also changed to reflect the date of the change. -
And although the file is smaller now it contains two more objects: one cross-reference stream and one object stream. These stream objects are never referenced from the main structure since they only provide a different way of storing data in the PDF file. Therefore they also don’t appear in the
recursive
output.
If the PDF file was created by another program and not HexaPDF, the
/Producer
line would also have changed. -
-
The meaning of this is that nothing essential really changed when the PDF file was optimized, which was expected.
If you try this with bigger files and ones not created with HexaPDF, the output will probably show many more changes because HexaPDF also removes unneeded key-value pairs of dictionaries.