Text Extraction

This example shows how to extract layouted text from a page.

It uses the provided input PDF or creates a small sample PDF as input. Then it extracts the text for each page and creates new pages with the extracted text in a fixed-width font.

Usage:: ruby text_extraction.rb [INPUT.PDF]

Code

require 'hexapdf'

# Use the input PDF or create a sample PDF.
if ARGV.length > 0
  doc = HexaPDF::Document.open(ARGV[0])
else
  composer = HexaPDF::Composer.new do |pdf|
    pdf.lorem_ipsum(count: 3, padding: [0, 0, 20])
    pdf.lorem_ipsum(padding: [0, 50, 20], text_indent: 40)
    pdf.lorem_ipsum(count: 2)
  end
  doc = composer.document
end

# Extract the existing pages and add new ones with the extracted text
doc.pages.count.times do |index|
  text = doc.pages[index].extract_text
  doc.pages.add.canvas.font('/usr/share/fonts/truetype/freefont/FreeMono.ttf', size: 6).
    text(text, at: [10, 820])
end

doc.write('text_extraction.pdf', optimize: true)