Optimization Benchmark
One of the ways to use the hexapdf
command is to optimize a PDF file in terms of its file size.
This involves reading and writing the PDF file and performing the optimization. Sometimes the word
“optimization” is used when a PDF file is linearized for faster display on web sites. However, here
it always means file size optimization.
There are various ways to optimize the file size of a PDF file and they can be divided into two groups: lossless and lossy operations. Since all used applications perform only lossless optimizations, we only look at those:
- Removing unused and deleted objects
-
A PDF file can store multiple revisions of an object but only the last one is used. So all other versions can safely be deleted.
- Using object and cross-reference streams
-
A PDF file can be thought of as a collection of random-access objects that are stored sequentially in an ASCII-based format. Object streams take those objects and store them compressed in a binary format. And cross-reference streams store the file offsets to the objects in a compressed manner, instead of the standard ASCII-based format.
- Recompressing page content streams
-
The content of a PDF page is described in an ASCII-based format. Some PDF producers don’t optimize their output which can lead to bigger than necessary content streams or don’t store it in a compressed format.
There are some more techniques for reducing the file size like font subsetting/merging/deduplication or object and image deduplication. However, those are rather advanced and not implemented in most PDF libraries because it is hard to get them right.
Benchmark Setup
There are many applications that can perform some or all of the optimizations mentioned above. Since this benchmark is intended to be run on Linux we will use command line applications that are readily available on this platform.
Since the abilities of the applications vary, following is a table of keys used to describe the various operations:
Key | Operation |
---|---|
C | Compacting by removing unused and deleted objects |
S | Usage of object and cross-reference streams |
P | Recompression of page content streams |
The list of the benchmarked applications:
- hexapdf
-
Homepage: http://hexapdf.gettalong.org
Version: Latest version
Abilities: Any combination of C, S and PWe want to benchmark
hexapdf
with increasing levels of compression, using the following invocations:- None of C, S, or P
hexapdf optimize INPUT --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
- C
hexapdf optimize INPUT --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts OUTPUT
- CS (so this would be the standard mode of operation)
hexapdf optimize INPUT OUTPUT
- CSP
hexapdf optimize INPUT --compress-pages OUTPUT
Note that a lot of time is spent deflating when recompressing pages. This is because HexaPDF uses the highest deflate compression level by default. By changing the configuration option ‘filter.flate.compression’ to something lower than 9, it is possible to trade compression speed with file size.
- origami
-
Homepage: https://github.com/gdelugre/origami
Version: 2.1.0
Abilities: ?Similar to HexaPDF Origami is a framework for manipulating PDF files. Since it is also written in Ruby, it makes for a good comparison.
The
origami.rb
script can be invoked likeruby origami.rb INPUT OUTPUT
. - combine_pdf
-
Homepage: https://github.com/boazsegev/combine_pdf
Version: 1.0.23
Abilities: ?CombinePDF is a tool for merging PDF files, written in Ruby.
The
combine_pdf.rb
script can be invoked likeruby combine_pdf.rb INPUT OUTPUT
. - pdftk
-
Homepage: https://gitlab.com/marcvinyals/pdftk
Version: 3.3.2
Abilities: Cpdftk
is probably one of the best known applications because, likehexapdf
it allows for many different operations on PDFs. It is based on the Java iText library. Prior version have been compiled to native code using GCJ but GCJ was deprecated and this fork of pdftk now uses Java.The application doesn’t have options for optimizing a PDF file but it can be assumed that it removes unused and deleted objects when invoked like
pdftk INPUT output OUTPUT
. - qpdf
-
Homepage: http://qpdf.sourceforge.net/
Version: 10.4.0
Abilities: C, CSQPDF is a command line application for transforming PDF files written in C++.
The standard
C
mode of operation is invoked withqpdf INPUT OUTPUT
whereas the CS mode would need an additional option--object-streams=generate
. - smpdf
- Homepage: http://www.coherentpdf.com/compression.html
Version: 1.4.1
Abilities: CSPThis is a commercial application but can be used for evaluation purposes. There is no way to configure the operations done but judging from its output it seems it does all of the lossless operations.
Invocation is done like this:
smpdf INPUT -o OUTPUT
.
The standard files used in the benchmark (not available in the HexaPDF distribution) vary in file size and internal structure:
Name | Size | Objects | Pages | Details |
---|---|---|---|---|
a.pdf | 53.056 | 36 | 4 | Very simple one page file |
b.pdf | 11.520.218 | 4.161 | 439 | Many non-stream objects |
c.pdf | 14.399.980 | 5.263 | 620 | Linearized, many streams |
d.pdf | 8.107.348 | 34.513 | 20 | |
e.pdf | 21.788.087 | 2.296 | 52 | Huge content streams, many pictures, object streams, encrypted with default password |
f.pdf | 154.752.614 | 287.977 | 28.365 | Very big file |
Results
These benchmark results are from 2023-08-03.
Time | Memory | File size | ||
---|---|---|---|---|
hexapdf | a.pdf | 199ms | 34.944KiB | 52.299 |
hexapdf C | a.pdf | 200ms | 35.200KiB | 52.277 |
hexapdf CS | a.pdf | 207ms | 35.328KiB | 49.224 |
hexapdf CSP | a.pdf | 214ms | 36.096KiB | 48.297 |
origami | a.pdf | 250ms | 44.684KiB | 52.111 |
combinepdf | a.pdf | 124ms | 30.848KiB | 53.263 |
pdftk C? | a.pdf | 152ms | 56.304KiB | 53.144 |
qpdf C | a.pdf | 12ms | 8.064KiB | 53.179 |
qpdf CS | a.pdf | 12ms | 8.320KiB | 49.287 |
smpdf CSP | a.pdf | 14ms | 7.936KiB | 48.329 |
hexapdf | b.pdf | 497ms | 51.616KiB | 11.222.968 |
hexapdf C | b.pdf | 515ms | 54.392KiB | 11.350.258 |
hexapdf CS | b.pdf | 564ms | 58.112KiB | 11.045.214 |
hexapdf CSP | b.pdf | 2.366ms | 68.428KiB | 11.027.080 |
ERR origami | b.pdf | 0ms | 0KiB | 0 |
combinepdf | b.pdf | 959ms | 152.336KiB | 11.526.172 |
pdftk C? | b.pdf | 497ms | 105.976KiB | 11.564.056 |
qpdf C | b.pdf | 215ms | 22.912KiB | 11.273.690 |
qpdf CS | b.pdf | 229ms | 23.040KiB | 11.126.861 |
smpdf CSP | b.pdf | 1.785ms | 49.408KiB | 11.092.465 |
hexapdf | c.pdf | 780ms | 55.168KiB | 14.382.696 |
hexapdf C | c.pdf | 744ms | 57.216KiB | 14.345.269 |
hexapdf CS | c.pdf | 826ms | 60.032KiB | 13.180.713 |
hexapdf CSP | c.pdf | 2.549ms | 76.416KiB | 13.102.033 |
origami | c.pdf | 3.490ms | 136.808KiB | 14.338.126 |
combinepdf | c.pdf | 1.008ms | 154.400KiB | 14.329.457 |
pdftk C? | c.pdf | 1.544ms | 158.212KiB | 14.439.011 |
qpdf C | c.pdf | 707ms | 95.744KiB | 14.432.647 |
qpdf CS | c.pdf | 817ms | 95.488KiB | 13.228.102 |
smpdf CSP | c.pdf | 1.668ms | 74.240KiB | 13.076.598 |
hexapdf | d.pdf | 1.658ms | 80.648KiB | 7.662.939 |
hexapdf C | d.pdf | 1.596ms | 84.776KiB | 6.924.700 |
hexapdf CS | d.pdf | 1.822ms | 84.520KiB | 6.418.482 |
hexapdf CSP | d.pdf | 1.737ms | 99.468KiB | 5.391.919 |
origami | d.pdf | 4.053ms | 144.000KiB | 7.498.876 |
combinepdf | d.pdf | 2.156ms | 144.552KiB | 7.243.107 |
pdftk C? | d.pdf | 2.140ms | 211.976KiB | 7.279.035 |
qpdf C | d.pdf | 1.045ms | 69.248KiB | 7.209.305 |
qpdf CS | d.pdf | 1.206ms | 69.248KiB | 6.703.374 |
smpdf CSP | d.pdf | 1.734ms | 70.604KiB | 5.528.352 |
hexapdf | e.pdf | 512ms | 59.720KiB | 21.766.847 |
hexapdf C | e.pdf | 584ms | 113.592KiB | 21.832.869 |
hexapdf CS | e.pdf | 620ms | 112.812KiB | 21.751.196 |
hexapdf CSP | e.pdf | 11.397ms | 156.636KiB | 21.186.414 |
ERR origami | e.pdf | 0ms | 0KiB | 0 |
ERR combinepdf | e.pdf | 0ms | 0KiB | 0 |
pdftk C? | e.pdf | 687ms | 198.568KiB | 21.874.883 |
qpdf C | e.pdf | 268ms | 31.420KiB | 21.802.439 |
qpdf CS | e.pdf | 267ms | 31.660KiB | 21.787.558 |
smpdf CSP | e.pdf | 20.680ms | 608.428KiB | 21.188.516 |
hexapdf | f.pdf | 20.748ms | 435.072KiB | 153.972.520 |
hexapdf C | f.pdf | 23.003ms | 474.052KiB | 153.844.796 |
hexapdf CS | f.pdf | 26.267ms | 552.272KiB | 117.545.255 |
ERR hexapdf CSP | f.pdf | 0ms | 0KiB | 0 |
origami | f.pdf | 63.483ms | 1.567.124KiB | 152.614.156 |
ERR combinepdf | f.pdf | 0ms | 0KiB | 0 |
pdftk C? | f.pdf | 22.076ms | 792.892KiB | 157.850.353 |
qpdf C | f.pdf | 14.567ms | 959.820KiB | 157.723.936 |
qpdf CS | f.pdf | 19.236ms | 975.392KiB | 118.114.521 |
ERR smpdf CSP | f.pdf | 0ms | 0KiB | 0 |