Analysing PDFs

This how-to guide shows how to use the hexapdf inspect command to analyse PDF files.

How hexapdf inspect Works

The hexapdf command line tool comes with a variety of sub-commands, one of them being the inspect command. This command is designed to allow its user to inspect and analyse a PDF file.

The default mode is the interactive mode which is used when no command line arguments besides the PDF file are given. The interactive mode loads the PDF file and allows running one or more commands against it which is useful if the PDF file in question is huge.

If additional arguments are provided on the command line, they are interpreted as interactive mode commands and executed.

Let’s see those two ways of running hexapdf inspect in action:

Showing the basics of the inspect command

That’s all the needed basics, let’s dive into various usage scenarios and explore the inspect command’s functionality.

Traversing the PDF Tree Structure

The key topic on PDF’s document structure explains the main structures used in PDF files. Basic knowledge of the places where certain information can be found is useful when analysing PDF files.

The trailer, object, catalog and stream commands can help you navigate through this tree structure by showing the objects in question. Let’s see them in action:

Navigating the PDF tree structure

The used sample PDF is very simple, in normal PDFs the catalog dictionary contains many more keys for various things like annotations or form fields.

Another way to view the tree structure of a PDF is to use the recursive command. This command will not only show the requested object but also all objects referenced from it. Note that this will often show you the whole document tree when one page is referenced since page nodes have a reference to their parent. See Comparing PDF Files below for details.

Analysing Pages

In the last section you got a glimpse at inspecting pages, by navigating from the catalog dictionary. Since inspecting pages is often necessary the special pages command exists. It will display all page object identifiers in the correct order, together with the object identifiers for their content streams:

Analysing a page

Page objects often contain much more information, for example references to image resources, annotations or form fields.

Searching for Data

Sometimes you need to find an object with some specific data in it. For example, when an error message provides the name of a dictionary key or (part of) a value. Or when you want to find all objects referencing a specific indirect object.

In such cases the search command comes in handy: It searches through all indirect objects and prints those matching the given argument (a regular expression). Here is an example:

Searching for data in a PDF

Comparing PDF File Structure

Comparing the structures of two PDF files allows you to analyse “behind the scenes” changes done by a program. If you want to compare two PDF files visually, i.e. the appearance of the page content, we recommend using a tool like DiffPDF.

To compare PDF structures use the recursive command to output the whole structure of the PDFs in question and then compare the output. By using the process substitution feature of the shell you don’t even need to create temporary files.

Here is a simple example showing the difference between the hello-world.pdf we have used throughout this guide and the optimized version:

Comparing two PDF files