Optimization - HexaPDF

Optimization Benchmark
Benchmark Setup
Results

Optimization Benchmark

One of the ways to use the hexapdf command is to optimize a PDF file in terms of its file size. This involves reading and writing the PDF file and performing the optimization. Sometimes the word “optimization” is used when a PDF file is linearized for faster display on web sites. However, here it always means file size optimization.

There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:

Removing unused and deleted objects: A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.
Using object and cross-reference streams: A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.
Recompressing page content streams: The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.

There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.

Benchmark Setup

There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.

Since the abilities of the applications vary, following is a table of keys used to describe the various operations:

Key	Operation
C	Compacting by removing unused and deleted objects
S	Usage of object and cross-reference streams
P	Recompression of page content streams

The list of the benchmarked applications:

hexapdf

Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and P

We want to benchmark hexapdf with increasing levels of compression, using the following invocations:

None of C, S, or P: hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
C: hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
CS (so this would be the standard mode of operation): hexapdf optimize INPUT OUTPUT
CSP: hexapdf optimize INPUT --compress-pages OUTPUT

Note that a lot of time is spent deflating when recompressing pages. This is because HexaPDF uses the highest deflate compression level by default. By changing the configuration option ‘filter.flate.compression’ to something lower than 9, it is possible to trade compression speed with file size.

combine_pdf

Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.29
Abilities: ?

CombinePDF is a tool for merging PDF files, written in Ruby.

The combine_pdf.rb script can be invoked like ruby combine_pdf.rb INPUT OUTPUT.

pdftk

Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.3.3
Abilities: C

pdftk is probably one of the best known applications because, like hexapdf it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.

The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like pdftk INPUT output OUTPUT.

qpdf

Homepage: http://qpdf.sourceforge.net/
Version: 10.4.0
Abilities: C, CS

QPDF is a command line application for transforming PDF files written in C++.

The standard C mode of operation is invoked with qpdf INPUT OUTPUT whereas the CS mode would need an additional option --object-streams=generate.

cpdf

Homepage: http://www.coherentpdf.com/
Version: 2.8
Abilities: CS, CSP

This is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.

Invocation is done like this: cpdf -squeeze -squeeze-no-pagedata INPUT -o OUTPUT (for CS, the -squeeze-no-pagedata is removed for CSP).

The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:

Name	Size	Objects	Pages	Details
a.pdf	53.056	36	4	Very simple one page file
b.pdf	11.520.218	4.161	439	Many non-stream objects
c.pdf	14.399.980	5.263	620	Linearized, many streams
d.pdf	8.107.348	34.513	20
e.pdf	21.788.087	2.296	52	Huge content streams, many pictures, object streams, encrypted with default password
f.pdf	154.752.614	287.977	28.365	Very big file

Results

These benchmark results are from 2025-01-04.

benchmark graphic

		Time	Memory	File size
hexapdf	a.pdf	286ms	24.540KiB	52.339
hexapdf C	a.pdf	277ms	24.780KiB	52.315
hexapdf CS	a.pdf	292ms	25.676KiB	49.258
hexapdf CSP	a.pdf	303ms	25.680KiB	48.330
combinepdf	a.pdf	153ms	21.004KiB	53.229
pdftk C?	a.pdf	169ms	63.580KiB	53.144
qpdf C	a.pdf	23ms	7.772KiB	53.179
qpdf CS	a.pdf	14ms	8.244KiB	49.287
cpdf CS	a.pdf	24ms	10.132KiB	49.116
cpdf CSP	a.pdf	17ms	10.504KiB	48.236
hexapdf	b.pdf	603ms	40.160KiB	11.231.437
hexapdf C	b.pdf	525ms	42.480KiB	11.358.718
hexapdf CS	b.pdf	595ms	47.044KiB	11.053.653
hexapdf CSP	b.pdf	2.243ms	61.292KiB	11.035.513
combinepdf	b.pdf	755ms	72.456KiB	11.526.138
pdftk C?	b.pdf	527ms	112.624KiB	11.564.056
qpdf C	b.pdf	197ms	23.052KiB	11.273.690
qpdf CS	b.pdf	219ms	23.200KiB	11.124.676
cpdf CS	b.pdf	141ms	35.436KiB	11.104.344
cpdf CSP	b.pdf	1.609ms	58.084KiB	11.096.229
hexapdf	c.pdf	748ms	46.084KiB	14.384.737
hexapdf C	c.pdf	733ms	47.768KiB	14.347.310
hexapdf CS	c.pdf	822ms	50.540KiB	13.182.721
hexapdf CSP	c.pdf	2.488ms	71.812KiB	13.104.056
combinepdf	c.pdf	1.009ms	73.044KiB	14.329.423
pdftk C?	c.pdf	2.195ms	157.348KiB	14.439.011
qpdf C	c.pdf	591ms	95.444KiB	14.432.647
qpdf CS	c.pdf	785ms	95.672KiB	13.221.450
cpdf CS	c.pdf	386ms	65.360KiB	13.168.968
cpdf CSP	c.pdf	1.614ms	77.888KiB	13.081.657
hexapdf	d.pdf	1.497ms	85.724KiB	7.774.816
hexapdf C	d.pdf	1.441ms	75.232KiB	7.036.577
hexapdf CS	d.pdf	1.609ms	71.924KiB	6.530.436
hexapdf CSP	d.pdf	1.560ms	83.024KiB	5.503.967
combinepdf	d.pdf	1.783ms	69.020KiB	7.243.073
pdftk C?	d.pdf	3.176ms	251.736KiB	7.279.035
qpdf C	d.pdf	998ms	69.304KiB	7.209.305
qpdf CS	d.pdf	1.139ms	69.484KiB	6.702.580
cpdf CS	d.pdf	875ms	74.800KiB	6.566.625
cpdf CSP	d.pdf	1.256ms	74.628KiB	5.529.140
hexapdf	e.pdf	571ms	50.300KiB	21.784.690
hexapdf C	e.pdf	628ms	94.516KiB	21.850.643
hexapdf CS	e.pdf	646ms	110.004KiB	21.769.015
hexapdf CSP	e.pdf	11.289ms	186.176KiB	21.204.227
ERR combinepdf	e.pdf	0ms	0KiB	0
pdftk C?	e.pdf	667ms	200.516KiB	21.874.883
qpdf C	e.pdf	244ms	31.372KiB	21.802.439
qpdf CS	e.pdf	243ms	31.812KiB	21.787.322
ERR cpdf CS	e.pdf	0ms	0KiB	0
ERR cpdf CSP	e.pdf	0ms	0KiB	0
hexapdf	f.pdf	15.883ms	576.692KiB	154.077.468
hexapdf C	f.pdf	17.445ms	490.304KiB	153.949.744
hexapdf CS	f.pdf	19.923ms	545.820KiB	117.647.969
ERR hexapdf CSP	f.pdf	0ms	0KiB	0
ERR combinepdf	f.pdf	0ms	0KiB	0
pdftk C?	f.pdf	34.796ms	719.972KiB	157.850.353
qpdf C	f.pdf	13.954ms	959.732KiB	157.723.936
qpdf CS	f.pdf	18.601ms	975.292KiB	118.023.718
cpdf CS	f.pdf	18.790ms	928.720KiB	114.098.009
ERR cpdf CSP	f.pdf	0ms	0KiB	0