Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects

A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.

Using object and cross-reference streams

A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.

Recompressing page content streams

The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key Operation
C Compacting by removing unused and deleted objects
S Usage of object and cross-reference streams
P Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
CSP
hexapdf optimize INPUT --compress-pages OUTPUT
origami

Homepage: https://github.com/gdelugre/origami
Version: 2.1.0
Abilities: ?

Similar to HexaPDF Origami is a framework for manipulating PDF files. Since it is also written in Ruby, it makes for a good comparison.

The origami.rb script can be invoked like ruby origami.rb INPUT OUTPUT.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.15
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.0
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 8.2.1
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

smpdf
Homepage: http://www.coherentpdf.com/compression.html
Version: 1.4.1
Abilities: CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: smpdf INPUT -o OUTPUT.

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name Size Objects Pages Details
a.pdf 53.056 36 4 Very simple one page file
b.pdf 11.520.218 4.161 439 Many non-stream objects
c.pdf 14.399.980 5.263 620 Linearized, many streams
d.pdf 8.107.348 34.513 20  
e.pdf 21.788.087 2.296 52 Huge content streams, many pictures, object streams, encrypted with default password
f.pdf 154.752.614 287.977 28.365 Very big file

Results

These benchmark results are from 2020-12-27.

benchmark graphic

    Time Memory File size
hexapdf a.pdf 209ms 27.356KiB 52.314
hexapdf C a.pdf 172ms 27.564KiB 52.291
hexapdf CS a.pdf 172ms 27.696KiB 49.152
hexapdf CSP a.pdf 187ms 28.012KiB 48.223
origami a.pdf 253ms 36.824KiB 52.111
combinepdf a.pdf 136ms 27.340KiB 53.263
pdftk C? a.pdf 168ms 38.824KiB 53.144
qpdf C a.pdf 13ms 7.992KiB 53.179
qpdf CS a.pdf 15ms 8.168KiB 49.287
smpdf CSP a.pdf 21ms 8.404KiB 48.329
hexapdf b.pdf 672ms 45.228KiB 11.456.456
hexapdf C b.pdf 712ms 45.692KiB 11.406.392
hexapdf CS b.pdf 816ms 48.276KiB 11.045.146
hexapdf CSP b.pdf 4.893ms 58.784KiB 11.027.010
origami b.pdf 1.585ms 101.664KiB 11.482.405
combinepdf b.pdf 3.393ms 116.112KiB 11.526.172
pdftk C? b.pdf 905ms 97.452KiB 11.501.669
qpdf C b.pdf 486ms 26.956KiB 11.500.308
qpdf CS b.pdf 585ms 26.684KiB 11.124.779
smpdf CSP b.pdf 2.579ms 50.012KiB 11.092.428
hexapdf c.pdf 1.278ms 47.656KiB 14.382.785
hexapdf C c.pdf 1.299ms 49.568KiB 14.347.142
hexapdf CS c.pdf 1.512ms 51.900KiB 13.180.384
hexapdf CSP c.pdf 5.168ms 66.008KiB 13.101.710
origami c.pdf 5.603ms 154.124KiB 14.338.126
combinepdf c.pdf 1.721ms 151.140KiB 14.329.457
pdftk C? c.pdf 2.706ms 139.712KiB 14.439.011
qpdf C c.pdf 1.913ms 132.088KiB 14.432.647
qpdf CS c.pdf 2.623ms 132.116KiB 13.228.102
smpdf CSP c.pdf 2.568ms 74.704KiB 13.076.598
hexapdf d.pdf 2.890ms 76.556KiB 7.662.939
hexapdf C d.pdf 2.880ms 73.948KiB 6.924.700
hexapdf CS d.pdf 3.216ms 74.508KiB 6.418.439
hexapdf CSP d.pdf 3.434ms 87.980KiB 5.476.431
origami d.pdf 6.880ms 165.192KiB 7.498.876
combinepdf d.pdf 3.924ms 152.500KiB 7.243.107
pdftk C? d.pdf 3.814ms 150.692KiB 7.279.035
qpdf C d.pdf 2.402ms 87.344KiB 7.209.305
qpdf CS d.pdf 2.542ms 86.728KiB 6.703.374
smpdf CSP d.pdf 2.318ms 71.228KiB 5.528.352
hexapdf e.pdf 626ms 55.548KiB 21.766.894
hexapdf C e.pdf 715ms 95.268KiB 21.832.831
hexapdf CS e.pdf 761ms 95.848KiB 21.751.207
hexapdf CSP e.pdf 19.666ms 154.684KiB 21.186.460
ERR origami e.pdf 0ms 0KiB 0
ERR combinepdf e.pdf 0ms 0KiB 0
pdftk C? e.pdf 808ms 136.056KiB 21.874.883
qpdf C e.pdf 731ms 34.168KiB 21.802.439
qpdf CS e.pdf 711ms 34.084KiB 21.787.558
smpdf CSP e.pdf 32.107ms 608.964KiB 21.188.516
hexapdf f.pdf 35.169ms 472.228KiB 153.972.520
hexapdf C f.pdf 40.082ms 531.304KiB 153.844.796
hexapdf CS f.pdf 45.294ms 585.956KiB 117.545.255
ERR hexapdf CSP f.pdf 0ms 0KiB 0
origami f.pdf 92.667ms 1.698.016KiB 152.614.156
ERR combinepdf f.pdf 0ms 0KiB 0
pdftk C? f.pdf 44.340ms 612.372KiB 157.850.353
qpdf C f.pdf 31.821ms 1.252.352KiB 157.723.936
qpdf CS f.pdf 39.942ms 1.269.532KiB 118.114.521
ERR smpdf CSP f.pdf 0ms 0KiB 0