Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects

A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.

Using object and cross-reference streams

A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.

Recompressing page content streams

The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key Operation
C Compacting by removing unused and deleted objects
S Usage of object and cross-reference streams
P Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
CSP
hexapdf optimize INPUT --compress-pages OUTPUT

Note that a lot of time is spent deflating when recompressing pages. This is because HexaPDF uses the highest deflate compression level by default. By changing the configuration option ‘filter.flate.compression’ to something lower than 9, it is possible to trade compression speed with file size.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.29
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.3.3
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 10.4.0
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

cpdf
Homepage: http://www.coherentpdf.com/
Version: 2.8
Abilities: CS, CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: cpdf -squeeze -squeeze-no-pagedata INPUT -o OUTPUT (for CS, the -squeeze-no-pagedata is removed for CSP).

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name Size Objects Pages Details
a.pdf 53.056 36 4 Very simple one page file
b.pdf 11.520.218 4.161 439 Many non-stream objects
c.pdf 14.399.980 5.263 620 Linearized, many streams
d.pdf 8.107.348 34.513 20  
e.pdf 21.788.087 2.296 52 Huge content streams, many pictures, object streams, encrypted with default password
f.pdf 154.752.614 287.977 28.365 Very big file

Results

These benchmark results are from 2025-01-04.

benchmark graphic

    Time Memory File size
hexapdf a.pdf 286ms 24.540KiB 52.339
hexapdf C a.pdf 277ms 24.780KiB 52.315
hexapdf CS a.pdf 292ms 25.676KiB 49.258
hexapdf CSP a.pdf 303ms 25.680KiB 48.330
combinepdf a.pdf 153ms 21.004KiB 53.229
pdftk C? a.pdf 169ms 63.580KiB 53.144
qpdf C a.pdf 23ms 7.772KiB 53.179
qpdf CS a.pdf 14ms 8.244KiB 49.287
cpdf CS a.pdf 24ms 10.132KiB 49.116
cpdf CSP a.pdf 17ms 10.504KiB 48.236
hexapdf b.pdf 603ms 40.160KiB 11.231.437
hexapdf C b.pdf 525ms 42.480KiB 11.358.718
hexapdf CS b.pdf 595ms 47.044KiB 11.053.653
hexapdf CSP b.pdf 2.243ms 61.292KiB 11.035.513
combinepdf b.pdf 755ms 72.456KiB 11.526.138
pdftk C? b.pdf 527ms 112.624KiB 11.564.056
qpdf C b.pdf 197ms 23.052KiB 11.273.690
qpdf CS b.pdf 219ms 23.200KiB 11.124.676
cpdf CS b.pdf 141ms 35.436KiB 11.104.344
cpdf CSP b.pdf 1.609ms 58.084KiB 11.096.229
hexapdf c.pdf 748ms 46.084KiB 14.384.737
hexapdf C c.pdf 733ms 47.768KiB 14.347.310
hexapdf CS c.pdf 822ms 50.540KiB 13.182.721
hexapdf CSP c.pdf 2.488ms 71.812KiB 13.104.056
combinepdf c.pdf 1.009ms 73.044KiB 14.329.423
pdftk C? c.pdf 2.195ms 157.348KiB 14.439.011
qpdf C c.pdf 591ms 95.444KiB 14.432.647
qpdf CS c.pdf 785ms 95.672KiB 13.221.450
cpdf CS c.pdf 386ms 65.360KiB 13.168.968
cpdf CSP c.pdf 1.614ms 77.888KiB 13.081.657
hexapdf d.pdf 1.497ms 85.724KiB 7.774.816
hexapdf C d.pdf 1.441ms 75.232KiB 7.036.577
hexapdf CS d.pdf 1.609ms 71.924KiB 6.530.436
hexapdf CSP d.pdf 1.560ms 83.024KiB 5.503.967
combinepdf d.pdf 1.783ms 69.020KiB 7.243.073
pdftk C? d.pdf 3.176ms 251.736KiB 7.279.035
qpdf C d.pdf 998ms 69.304KiB 7.209.305
qpdf CS d.pdf 1.139ms 69.484KiB 6.702.580
cpdf CS d.pdf 875ms 74.800KiB 6.566.625
cpdf CSP d.pdf 1.256ms 74.628KiB 5.529.140
hexapdf e.pdf 571ms 50.300KiB 21.784.690
hexapdf C e.pdf 628ms 94.516KiB 21.850.643
hexapdf CS e.pdf 646ms 110.004KiB 21.769.015
hexapdf CSP e.pdf 11.289ms 186.176KiB 21.204.227
ERR combinepdf e.pdf 0ms 0KiB 0
pdftk C? e.pdf 667ms 200.516KiB 21.874.883
qpdf C e.pdf 244ms 31.372KiB 21.802.439
qpdf CS e.pdf 243ms 31.812KiB 21.787.322
ERR cpdf CS e.pdf 0ms 0KiB 0
ERR cpdf CSP e.pdf 0ms 0KiB 0
hexapdf f.pdf 15.883ms 576.692KiB 154.077.468
hexapdf C f.pdf 17.445ms 490.304KiB 153.949.744
hexapdf CS f.pdf 19.923ms 545.820KiB 117.647.969
ERR hexapdf CSP f.pdf 0ms 0KiB 0
ERR combinepdf f.pdf 0ms 0KiB 0
pdftk C? f.pdf 34.796ms 719.972KiB 157.850.353
qpdf C f.pdf 13.954ms 959.732KiB 157.723.936
qpdf CS f.pdf 18.601ms 975.292KiB 118.023.718
cpdf CS f.pdf 18.790ms 928.720KiB 114.098.009
ERR cpdf CSP f.pdf 0ms 0KiB 0