class HexaPDF::Tokenizer


Tokenizes the content of an IO object following the PDF rules.

See: PDF1.7 s7.2



Characters defined as delimiters.

See: PDF1.7 s7.2.2


This object is returned when there are no more tokens to read.


Characters defined as whitespace.

See: PDF1.7 s7.2.2



The IO object from the tokens are read.

Public Class Methods


Creates a new tokenizer.

Public Instance Methods


Reads the byte (an integer) at the current position and advances the scan pointer.

next_object(allow_end_array_token: false, allow_keyword: false)

Returns the PDF object at the current position. This is different from next_token because references, arrays and dictionaries consist of multiple tokens.

If the allow_end_array_token argument is true, the ']' token is permitted to facilitate the use of this method during array parsing.

See: PDF1.7 s7.3


Returns a single token read from the current position and advances the scan pointer.

Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS is returned if there are no more tokens available.

next_xref_entry() { |matched_size| ... }

Reads the cross-reference subsection entry at the current position and advances the scan pointer.

If a possible problem is detected, yields to caller.

See: PDF1.7 7.5.4


Returns the next token but does not advance the scan pointer.


Returns the current position of the tokenizer inside in the IO object.

Note that this position might be different from io.pos since the latter could have been changed somewhere else.


Sets the position at which the next token should be read.

Note that this does *not* set io.pos directly (at the moment of invocation)!


Utility method for scanning until the given regular expression matches.

If the end of the file is reached in the process, nil is returned. Otherwise the matched string is returned.


Skips all whitespace at the current position.

See: PDF1.7 s7.2.2