Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects

A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.

Using object and cross-reference streams

A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.

Recompressing page content streams

The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key Operation
C Compacting by removing unused and deleted objects
S Usage of object and cross-reference streams
P Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
CSP
hexapdf optimize INPUT --compress-pages OUTPUT
origami

Homepage: https://github.com/gdelugre/origami
Version: 2.1.0
Abilities: ?

Similar to HexaPDF Origami is a framework for manipulating PDF files. Since it is also written in Ruby, it makes for a good comparison.

The origami.rb script can be invoked like ruby origami.rb INPUT OUTPUT.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.22
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.0.9
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 10.4.0
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

smpdf
Homepage: http://www.coherentpdf.com/compression.html
Version: 1.4.1
Abilities: CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: smpdf INPUT -o OUTPUT.

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name Size Objects Pages Details
a.pdf 53.056 36 4 Very simple one page file
b.pdf 11.520.218 4.161 439 Many non-stream objects
c.pdf 14.399.980 5.263 620 Linearized, many streams
d.pdf 8.107.348 34.513 20  
e.pdf 21.788.087 2.296 52 Huge content streams, many pictures, object streams, encrypted with default password
f.pdf 154.752.614 287.977 28.365 Very big file

Results

These benchmark results are from 2022-12-30.

benchmark graphic

    Time Memory File size
hexapdf a.pdf 285ms 34.624KiB 52.300
hexapdf C a.pdf 267ms 34.740KiB 52.278
hexapdf CS a.pdf 278ms 35.168KiB 49.226
hexapdf CSP a.pdf 297ms 35.868KiB 48.297
origami a.pdf 340ms 43.616KiB 52.111
combinepdf a.pdf 173ms 30.920KiB 53.263
pdftk C? a.pdf 171ms 38.056KiB 53.144
qpdf C a.pdf 31ms 8.052KiB 53.179
qpdf CS a.pdf 12ms 8.092KiB 49.287
smpdf CSP a.pdf 20ms 8.316KiB 48.329
hexapdf b.pdf 742ms 51.944KiB 11.222.968
hexapdf C b.pdf 735ms 54.100KiB 11.350.258
hexapdf CS b.pdf 792ms 57.592KiB 11.045.210
hexapdf CSP b.pdf 3.417ms 65.056KiB 11.027.076
ERR origami b.pdf 0ms 0KiB 0
combinepdf b.pdf 1.372ms 131.220KiB 11.526.172
pdftk C? b.pdf 744ms 104.780KiB 11.564.056
qpdf C b.pdf 340ms 22.928KiB 11.273.690
qpdf CS b.pdf 367ms 23.052KiB 11.126.861
smpdf CSP b.pdf 2.744ms 49.676KiB 11.092.465
hexapdf c.pdf 1.017ms 54.916KiB 14.382.696
hexapdf C c.pdf 1.090ms 54.712KiB 14.345.270
hexapdf CS c.pdf 1.201ms 59.664KiB 13.180.716
hexapdf CSP c.pdf 3.788ms 73.668KiB 13.102.032
origami c.pdf 5.111ms 134.020KiB 14.338.126
combinepdf c.pdf 1.520ms 147.584KiB 14.329.457
pdftk C? c.pdf 3.683ms 139.412KiB 14.439.011
qpdf C c.pdf 914ms 95.572KiB 14.432.647
qpdf CS c.pdf 1.224ms 95.592KiB 13.228.102
smpdf CSP c.pdf 2.596ms 74.548KiB 13.076.598
hexapdf d.pdf 2.368ms 80.268KiB 7.662.939
hexapdf C d.pdf 2.312ms 84.340KiB 6.924.700
hexapdf CS d.pdf 2.672ms 84.228KiB 6.418.483
hexapdf CSP d.pdf 2.475ms 99.964KiB 5.391.920
origami d.pdf 5.573ms 141.088KiB 7.498.876
combinepdf d.pdf 2.957ms 145.392KiB 7.243.107
pdftk C? d.pdf 5.156ms 152.192KiB 7.279.035
qpdf C d.pdf 1.625ms 69.484KiB 7.209.305
qpdf CS d.pdf 1.747ms 69.264KiB 6.703.374
smpdf CSP d.pdf 2.218ms 70.980KiB 5.528.352
hexapdf e.pdf 620ms 58.712KiB 21.766.847
hexapdf C e.pdf 740ms 111.516KiB 21.832.860
hexapdf CS e.pdf 790ms 116.448KiB 21.751.180
hexapdf CSP e.pdf 16.726ms 156.972KiB 21.186.414
ERR origami e.pdf 0ms 0KiB 0
ERR combinepdf e.pdf 0ms 0KiB 0
pdftk C? e.pdf 863ms 138.460KiB 21.874.883
qpdf C e.pdf 407ms 31.288KiB 21.802.439
qpdf CS e.pdf 405ms 31.388KiB 21.787.558
smpdf CSP e.pdf 32.419ms 608.908KiB 21.188.516
hexapdf f.pdf 25.283ms 433.488KiB 153.972.520
hexapdf C f.pdf 27.345ms 472.464KiB 153.844.796
hexapdf CS f.pdf 31.618ms 551.468KiB 117.545.254
ERR hexapdf CSP f.pdf 0ms 0KiB 0
origami f.pdf 75.431ms 1.432.276KiB 152.614.156
ERR combinepdf f.pdf 0ms 0KiB 0
pdftk C? f.pdf 63.202ms 611.476KiB 157.850.353
qpdf C f.pdf 19.822ms 959.732KiB 157.723.936
qpdf CS f.pdf 26.195ms 975.212KiB 118.114.521
ERR smpdf CSP f.pdf 0ms 0KiB 0