module HexaPDF::Content::SmartTextExtractor

This module converts the glyphs on a page to a single text string while preserving the layout.

The general algorithm is:

  1. Collect all individual glyphs with their user space coordinates in TextRunCollector::TextRun objects.

  2. Sort text runs top to bottom and then left to right.

  3. Group those text runs into lines based on a “baseline” while also combining neighboring text runs into larger runs.

  4. Render each line into a string by taking into account the page size and the median glyph width for a text run to column mapping.

  5. Add blank lines between text lines based on the page’s normal line spacing.

Constants

Line

Holds an array of TextRun objects and their median baseline.

Public Class Methods

layout_text_runs(text_runs, page_width, page_height, line_tolerance_factor: 0.4, paragraph_distance_threshold: 1.35, large_distance_threshold: 3.0)

Converts an array of TextRun objects into a single string representation, preserving the visual layout.

The page_width and page_height arguments specify the width and height of the page from which the text runs were extracted.

The remaining keyword arguments can be used to fine-tune the algorithm for one’s needs:

line_tolerance_factor

The tolerance factor is applied to the median text run height to determine the range within which two text runs are considered to be on the same line. This ensures that small differences in the baseline due to, for example, subscript or superscript parts don’t result in multiple lines.

The factor should not be too large to avoid forcing separate visual lines into one line but also not too small to avoid subscript/superscript begin on separate lines. The default seems to work quite well.

paragraph_distance_threshold

If the number of normal line spacings between two adjacent baselines is at least this large (but smaller than large_distance_threshold), the gap is interpreted as a paragraph break and a single blank line is inserted.

large_distance_threshold

Works like paragraph_distance_threshold and indicates if a number of normal line spacings is too large for being a paragraph break. A proportional number of blank lines is inserted in this case.

This is used to represent large parts with non-text content like images.