class HexaPDF::Tokenizer

Parent

Tokenizes the content of an IO object following the PDF rules.

See: PDF2.0 s7.2

Constants

DELIMITER

Characters defined as delimiters.

See: PDF2.0 s7.2.2

NO_MORE_TOKENS

This object is returned when there are no more tokens to read.

WHITESPACE

Characters defined as whitespace.

See: PDF2.0 s7.2.2

Attributes

io[R]

The IO object from the tokens are read.

Public Class Methods

new(io, on_correctable_error: nil)

Creates a new tokenizer for the given IO stream.

If on_correctable_error is set to an object responding to +call(msg, pos)+, errors for correctable situations are only raised if the return value of calling the object is true.

Public Instance Methods

next_byte()

Reads the byte (an integer) at the current position and advances the scan pointer.

next_integer_or_keyword()

Returns a single integer or keyword token read from the current position and advances the scan pointer. If the current position doesn’t contain such a token, nil is returned without advancing the scan pointer. The value NO_MORE_TOKENS is returned if there are no more tokens available.

Initial runs of whitespace characters are ignored.

Note: This is a special method meant for use with reconstructing the cross-reference table!

next_object(allow_end_array_token: false, allow_keyword: false)

Returns the PDF object at the current position. This is different from next_token because references, arrays and dictionaries consist of multiple tokens.

If the allow_end_array_token argument is true, the ‘]’ token is permitted to facilitate the use of this method during array parsing.

See: PDF2.0 s7.3

next_token()

Returns a single token read from the current position and advances the scan pointer.

Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS is returned if there are no more tokens available.

next_xref_entry() { |recoverable| ... }

Reads the cross-reference subsection entry at the current position and advances the scan pointer.

If a problem is detected, yields to caller where the argument recoverable is truthy if the problem is recoverable.

See: PDF2.0 7.5.4

peek_token()

Returns the next token but does not advance the scan pointer.

pos()

Returns the current position of the tokenizer inside in the IO object.

Note that this position might be different from io.pos since the latter could have been changed somewhere else.

pos=(pos)

Sets the position at which the next token should be read.

Note that this does not set io.pos directly (at the moment of invocation)!

scan_until(re)

Utility method for scanning until the given regular expression matches.

If the end of the file is reached in the process, nil is returned. Otherwise the matched string is returned.

skip_whitespace()

Skips all whitespace at the current position.

See: PDF2.0 s7.2.2