Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects

A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.

Using object and cross-reference streams

A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.

Recompressing page content streams

The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key Operation
C Compacting by removing unused and deleted objects
S Usage of object and cross-reference streams
P Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
CSP
hexapdf optimize INPUT --compress-pages OUTPUT

Note that a lot of time is spent deflating when recompressing pages. This is because HexaPDF uses the highest deflate compression level by default. By changing the configuration option ‘filter.flate.compression’ to something lower than 9, it is possible to trade compression speed with file size.

origami

Homepage: https://github.com/gdelugre/origami
Version: 2.1.0
Abilities: ?

Similar to HexaPDF Origami is a framework for manipulating PDF files. Since it is also written in Ruby, it makes for a good comparison.

The origami.rb script can be invoked like ruby origami.rb INPUT OUTPUT.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.23
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.3.2
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 10.4.0
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

smpdf
Homepage: http://www.coherentpdf.com/compression.html
Version: 1.4.1
Abilities: CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: smpdf INPUT -o OUTPUT.

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name Size Objects Pages Details
a.pdf 53.056 36 4 Very simple one page file
b.pdf 11.520.218 4.161 439 Many non-stream objects
c.pdf 14.399.980 5.263 620 Linearized, many streams
d.pdf 8.107.348 34.513 20  
e.pdf 21.788.087 2.296 52 Huge content streams, many pictures, object streams, encrypted with default password
f.pdf 154.752.614 287.977 28.365 Very big file

Results

These benchmark results are from 2023-08-03.

benchmark graphic

    Time Memory File size
hexapdf a.pdf 199ms 34.944KiB 52.299
hexapdf C a.pdf 200ms 35.200KiB 52.277
hexapdf CS a.pdf 207ms 35.328KiB 49.224
hexapdf CSP a.pdf 214ms 36.096KiB 48.297
origami a.pdf 250ms 44.684KiB 52.111
combinepdf a.pdf 124ms 30.848KiB 53.263
pdftk C? a.pdf 152ms 56.304KiB 53.144
qpdf C a.pdf 12ms 8.064KiB 53.179
qpdf CS a.pdf 12ms 8.320KiB 49.287
smpdf CSP a.pdf 14ms 7.936KiB 48.329
hexapdf b.pdf 497ms 51.616KiB 11.222.968
hexapdf C b.pdf 515ms 54.392KiB 11.350.258
hexapdf CS b.pdf 564ms 58.112KiB 11.045.214
hexapdf CSP b.pdf 2.366ms 68.428KiB 11.027.080
ERR origami b.pdf 0ms 0KiB 0
combinepdf b.pdf 959ms 152.336KiB 11.526.172
pdftk C? b.pdf 497ms 105.976KiB 11.564.056
qpdf C b.pdf 215ms 22.912KiB 11.273.690
qpdf CS b.pdf 229ms 23.040KiB 11.126.861
smpdf CSP b.pdf 1.785ms 49.408KiB 11.092.465
hexapdf c.pdf 780ms 55.168KiB 14.382.696
hexapdf C c.pdf 744ms 57.216KiB 14.345.269
hexapdf CS c.pdf 826ms 60.032KiB 13.180.713
hexapdf CSP c.pdf 2.549ms 76.416KiB 13.102.033
origami c.pdf 3.490ms 136.808KiB 14.338.126
combinepdf c.pdf 1.008ms 154.400KiB 14.329.457
pdftk C? c.pdf 1.544ms 158.212KiB 14.439.011
qpdf C c.pdf 707ms 95.744KiB 14.432.647
qpdf CS c.pdf 817ms 95.488KiB 13.228.102
smpdf CSP c.pdf 1.668ms 74.240KiB 13.076.598
hexapdf d.pdf 1.658ms 80.648KiB 7.662.939
hexapdf C d.pdf 1.596ms 84.776KiB 6.924.700
hexapdf CS d.pdf 1.822ms 84.520KiB 6.418.482
hexapdf CSP d.pdf 1.737ms 99.468KiB 5.391.919
origami d.pdf 4.053ms 144.000KiB 7.498.876
combinepdf d.pdf 2.156ms 144.552KiB 7.243.107
pdftk C? d.pdf 2.140ms 211.976KiB 7.279.035
qpdf C d.pdf 1.045ms 69.248KiB 7.209.305
qpdf CS d.pdf 1.206ms 69.248KiB 6.703.374
smpdf CSP d.pdf 1.734ms 70.604KiB 5.528.352
hexapdf e.pdf 512ms 59.720KiB 21.766.847
hexapdf C e.pdf 584ms 113.592KiB 21.832.869
hexapdf CS e.pdf 620ms 112.812KiB 21.751.196
hexapdf CSP e.pdf 11.397ms 156.636KiB 21.186.414
ERR origami e.pdf 0ms 0KiB 0
ERR combinepdf e.pdf 0ms 0KiB 0
pdftk C? e.pdf 687ms 198.568KiB 21.874.883
qpdf C e.pdf 268ms 31.420KiB 21.802.439
qpdf CS e.pdf 267ms 31.660KiB 21.787.558
smpdf CSP e.pdf 20.680ms 608.428KiB 21.188.516
hexapdf f.pdf 20.748ms 435.072KiB 153.972.520
hexapdf C f.pdf 23.003ms 474.052KiB 153.844.796
hexapdf CS f.pdf 26.267ms 552.272KiB 117.545.255
ERR hexapdf CSP f.pdf 0ms 0KiB 0
origami f.pdf 63.483ms 1.567.124KiB 152.614.156
ERR combinepdf f.pdf 0ms 0KiB 0
pdftk C? f.pdf 22.076ms 792.892KiB 157.850.353
qpdf C f.pdf 14.567ms 959.820KiB 157.723.936
qpdf CS f.pdf 19.236ms 975.392KiB 118.114.521
ERR smpdf CSP f.pdf 0ms 0KiB 0