Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects

A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.

Using object and cross-reference streams

A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.

Recompressing page content streams

The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key Operation
C Compacting by removing unused and deleted objects
S Usage of object and cross-reference streams
P Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
CSP
hexapdf optimize INPUT --compress-pages OUTPUT
origami

Homepage: https://github.com/gdelugre/origami
Version: 2.1.0
Abilities: ?

Similar to HexaPDF Origami is a framework for manipulating PDF files. Since it is also written in Ruby, it makes for a good comparison.

The origami.rb script can be invoked like ruby origami.rb INPUT OUTPUT.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.15
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.0
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 8.2.1
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

smpdf
Homepage: http://www.coherentpdf.com/compression.html
Version: 1.4.1
Abilities: CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: smpdf INPUT -o OUTPUT.

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name Size Objects Pages Details
a.pdf 53.056 36 4 Very simple one page file
b.pdf 11.520.218 4.161 439 Many non-stream objects
c.pdf 14.399.980 5.263 620 Linearized, many streams
d.pdf 8.107.348 34.513 20  
e.pdf 21.788.087 2.296 52 Huge content streams, many pictures, object streams, encrypted with default password
f.pdf 154.752.614 287.977 28.365 Very big file

Results

These benchmark results are from 2018-12-31.

benchmark graphic

    Time Memory File size
hexapdf a.pdf 159ms 15.268KiB 52.313
hexapdf C a.pdf 123ms 15.244KiB 52.289
hexapdf CS a.pdf 125ms 15.824KiB 49.152
hexapdf CSP a.pdf 137ms 15.884KiB 48.223
origami a.pdf 184ms 23.664KiB 52.312
combinepdf a.pdf 106ms 14.720KiB 53.263
pdftk C? a.pdf 161ms 38.528KiB 53.144
qpdf C a.pdf 8ms 4.968KiB 53.179
qpdf CS a.pdf 11ms 4.920KiB 49.287
smpdf CSP a.pdf 17ms 8.612KiB 48.329
hexapdf b.pdf 555ms 30.884KiB 11.456.455
hexapdf C b.pdf 548ms 26.924KiB 11.406.391
hexapdf CS b.pdf 621ms 37.384KiB 11.045.145
hexapdf CSP b.pdf 4.061ms 42.328KiB 11.027.010
origami b.pdf 1.332ms 86.420KiB 11.482.769
combinepdf b.pdf 3.606ms 116.632KiB 11.526.172
pdftk C? b.pdf 704ms 96.532KiB 11.501.669
qpdf C b.pdf 354ms 21.112KiB 11.500.308
qpdf CS b.pdf 439ms 21.612KiB 11.124.779
smpdf CSP b.pdf 2.341ms 49.944KiB 11.092.428
hexapdf c.pdf 983ms 34.056KiB 14.382.785
hexapdf C c.pdf 1.052ms 36.880KiB 14.347.139
hexapdf CS c.pdf 1.195ms 39.616KiB 13.180.381
hexapdf CSP c.pdf 4.572ms 55.588KiB 13.101.708
origami c.pdf 4.736ms 137.052KiB 14.338.614
combinepdf c.pdf 1.573ms 121.688KiB 14.329.457
pdftk C? c.pdf 2.092ms 139.756KiB 14.439.011
qpdf C c.pdf 1.293ms 93.496KiB 14.432.647
qpdf CS c.pdf 1.743ms 93.164KiB 13.228.102
smpdf CSP c.pdf 2.345ms 74.696KiB 13.076.598
hexapdf d.pdf 2.574ms 57.604KiB 7.662.938
hexapdf C d.pdf 2.554ms 56.492KiB 6.924.699
hexapdf CS d.pdf 2.835ms 59.048KiB 6.418.438
hexapdf CSP d.pdf 3.005ms 84.748KiB 5.476.431
origami d.pdf 5.625ms 126.840KiB 7.499.298
combinepdf d.pdf 3.376ms 157.520KiB 7.243.107
pdftk C? d.pdf 2.939ms 145.432KiB 7.279.035
qpdf C d.pdf 1.720ms 78.084KiB 7.209.305
qpdf CS d.pdf 1.794ms 78.380KiB 6.703.374
smpdf CSP d.pdf 2.101ms 71.196KiB 5.528.352
hexapdf e.pdf 512ms 46.128KiB 21.766.893
hexapdf C e.pdf 591ms 81.364KiB 21.832.849
hexapdf CS e.pdf 603ms 79.140KiB 21.751.115
hexapdf CSP e.pdf 16.591ms 163.108KiB 21.186.356
ERR origami e.pdf 0ms 0KiB 0
ERR combinepdf e.pdf 0ms 0KiB 0
ERR pdftk C? e.pdf 0ms 0KiB 0
qpdf C e.pdf 583ms 30.180KiB 21.802.439
qpdf CS e.pdf 556ms 30.236KiB 21.787.558
smpdf CSP e.pdf 28.837ms 609.116KiB 21.188.516
hexapdf f.pdf 30.714ms 490.472KiB 153.972.519
hexapdf C f.pdf 33.299ms 526.688KiB 153.844.795
hexapdf CS f.pdf 39.533ms 612.468KiB 117.545.254
ERR hexapdf CSP f.pdf 0ms 0KiB 0
ERR origami f.pdf 0ms 0KiB 0
ERR combinepdf f.pdf 0ms 0KiB 0
qpdf C f.pdf 23.090ms 917.592KiB 157.723.936
qpdf CS f.pdf 29.309ms 942.096KiB 118.114.521
ERR smpdf CSP e.pdf 0ms 0KiB 0