Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects

A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.

Using object and cross-reference streams

A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.

Recompressing page content streams

The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key Operation
C Compacting by removing unused and deleted objects
S Usage of object and cross-reference streams
P Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
CSP
hexapdf optimize INPUT --compress-pages OUTPUT
origami

Homepage: https://github.com/gdelugre/origami
Version: 2.1.0
Abilities: ?

Similar to HexaPDF Origami is a framework for manipulating PDF files. Since it is also written in Ruby, it makes for a good comparison.

The origami.rb script can be invoked like ruby origami.rb INPUT OUTPUT.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.15
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Version: 2.02
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library which has been compiled to native code using GCJ.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

Note that GCJ was deprecated and newer versions of Ubuntu don’t include the pdftk package anymore!

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 8.0.2
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

smpdf
Homepage: http://www.coherentpdf.com/compression.html
Version: 1.4.1
Abilities: CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: smpdf INPUT -o OUTPUT.

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name Size Objects Pages Details
a.pdf 53.056 36 4 Very simple one page file
b.pdf 11.520.218 4.161 439 Many non-stream objects
c.pdf 14.399.980 5.263 620 Linearized, many streams
d.pdf 8.107.348 34.513 20  
e.pdf 21.788.087 2.296 52 Huge content streams, many pictures, object streams, encrypted with default password
f.pdf 154.752.614 287.977 28.365 Very big file

Results

These benchmark results are from 2017-12-31.

benchmark graphic

    Time Memory File size
hexapdf a.pdf 149ms 14,788KiB 52,313
hexapdf C a.pdf 148ms 14,768KiB 52,289
hexapdf CS a.pdf 150ms 15,180KiB 49,152
hexapdf CSP a.pdf 165ms 15,556KiB 48,223
origami a.pdf 235ms 22,572KiB 52,312
combinepdf a.pdf 113ms 14,004KiB 53,697
pdftk C? a.pdf 43ms 54,528KiB 53,144
qpdf C a.pdf 8ms 4,652KiB 53,179
qpdf CS a.pdf 11ms 4,652KiB 49,287
smpdf CSP a.pdf 23ms 8,436KiB 48,329
hexapdf b.pdf 836ms 32,072KiB 11,456,455
hexapdf C b.pdf 857ms 25,844KiB 11,406,391
hexapdf CS b.pdf 984ms 29,196KiB 11,044,875
hexapdf CSP b.pdf 7,779ms 40,312KiB 11,026,744
origami b.pdf 2,133ms 87,952KiB 11,482,769
combinepdf b.pdf 6,552ms 115,764KiB 11,498,874
pdftk C? b.pdf 480ms 69,552KiB 11,501,669
qpdf C b.pdf 295ms 11,636KiB 11,500,308
qpdf CS b.pdf 365ms 11,852KiB 11,124,779
smpdf CSP b.pdf 3,349ms 49,836KiB 11,092,428
hexapdf c.pdf 1,709ms 37,532KiB 14,382,785
hexapdf C c.pdf 1,836ms 37,800KiB 14,347,139
hexapdf CS c.pdf 2,013ms 40,304KiB 13,180,380
hexapdf CSP c.pdf 8,370ms 58,520KiB 13,101,710
origami c.pdf 7,912ms 135,052KiB 14,338,614
combinepdf c.pdf 2,428ms 119,144KiB 14,329,496
pdftk C? c.pdf 1,674ms 101,932KiB 14,439,611
qpdf C c.pdf 751ms 34,064KiB 14,432,647
qpdf CS c.pdf 998ms 34,528KiB 13,228,102
smpdf CSP c.pdf 3,090ms 74,572KiB 13,076,598
hexapdf d.pdf 4,269ms 63,148KiB 7,662,938
hexapdf C d.pdf 4,150ms 57,864KiB 6,924,699
hexapdf CS d.pdf 4,721ms 53,688KiB 6,418,429
hexapdf CSP d.pdf 4,782ms 83,568KiB 5,476,424
origami d.pdf 9,205ms 137,612KiB 7,499,298
combinepdf d.pdf 4,852ms 154,732KiB 7,243,135
pdftk C? d.pdf 2,351ms 103,652KiB 7,279,035
qpdf C d.pdf 1,514ms 40,580KiB 7,209,305
qpdf CS d.pdf 1,629ms 40,808KiB 6,703,374
smpdf CSP d.pdf 2,831ms 71,068KiB 5,528,352
hexapdf e.pdf 783ms 48,700KiB 21,766,894
hexapdf C e.pdf 905ms 84,280KiB 21,832,850
hexapdf CS e.pdf 969ms 74,392KiB 21,751,114
hexapdf CSP e.pdf 28,415ms 166,092KiB 21,186,359
origami e.pdf 1,969ms 132,376KiB 21,800,150
ERR combinepdf e.pdf 0ms 0KiB 0
pdftk C? e.pdf 693ms 123,684KiB 21,874,883
qpdf C e.pdf 998ms 63,988KiB 21,802,439
qpdf CS e.pdf 1,154ms 64,368KiB 21,787,558
smpdf CSP e.pdf 37,382ms 646,888KiB 21,188,516
hexapdf f.pdf 44,889ms 473,780KiB 153,972,519
hexapdf C f.pdf 48,766ms 507,772KiB 153,844,795
hexapdf CS f.pdf 55,066ms 543,556KiB 117,542,999
ERR hexapdf CSP f.pdf 0ms 0KiB 0
ERR origami f.pdf 0ms 0KiB 0
ERR combinepdf f.pdf 0ms 0KiB 0
pdftk C? f.pdf 32,260ms 689,900KiB 157,850,354
qpdf C f.pdf 18,270ms 478,816KiB 157,723,936
qpdf CS f.pdf 22,787ms 484,512KiB 118,114,521
ERR smpdf CSP f.pdf 0ms 0KiB 0