Tokenizing and padding - keras-text Documentation (2024)

pad_sequences

pad_sequences(sequences, max_sentences=None, max_tokens=None, padding="pre", truncating="post", \ value=0.0)

Pads each sequence to the same length (length of the longest sequence or provided override).

Args:

sequences: list of list (samples, words) or list of list of list (samples, sentences, words)
max_sentences: The max sentence length to use. If None, largest sentence length is used.
max_tokens: The max word length to use. If None, largest word length is used.
padding: 'pre' or 'post', pad either before or after each sequence.
truncating: 'pre' or 'post', remove values from sequences larger than max_sentences or max_tokens either in the beginning or in the end of the sentence or word sequence respectively.
value: The padding value.

Returns:

Numpy array of (samples, max_sentences, max_tokens) or (samples, max_tokens) depending on the sequence input.

Raises:

ValueError: in case of invalid values for truncating or padding.

unicodify

unicodify(texts)

Encodes all text sequences as unicode. This is a python2 hassle.

Args:

texts: The sequence of texts.

Returns:

Unicode encoded sequences.

Tokenizer

Tokenizer.has_vocab

Tokenizer.num_texts

The number of texts used to build the vocabulary.

Tokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

Tokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

Tokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

Tokenizer.`init`

__init__(self, lang="en", lower=True)

Encodes text into (samples, aux_indices..., token) where each token is mapped to a unique index startingfrom 1. Note that 0 is a reserved for unknown tokens.

Args:

lang: The spacy language to use. (Default value: 'en')
lower: Lower cases the tokens if True. (Default value: True)

Tokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

Tokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

texts: The list of text items to encode.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Tokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter witha desired strategy and regenerate token_index using this method. The token index is subsequently usedwhen encode_texts or decode_texts methods are called.

Tokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

encoded_texts: The list of texts to decode.
unknown_token: The placeholder value for unknown token. (Default value: "")
inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

Tokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

texts: The list of text items to encode.
include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

Tokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates(text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths acrosstexts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

Tokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates(text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengthsacross texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.

Tokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

file_path: The file path to use.

Tokenizer.token_generator

token_generator(self, texts, **kwargs)

Generator for yielding tokens. You need to implement this method.

WordTokenizer

WordTokenizer.has_vocab

WordTokenizer.num_texts

The number of texts used to build the vocabulary.

WordTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

WordTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

WordTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

WordTokenizer.`init`

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \ remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \ exclude_entities=['PERSON'])

Encodes text into (samples, words)

Args:

lang: The spacy language to use. (Default value: 'en')
lower: Lower cases the tokens if True. (Default value: True)
lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
remove_punct: Removes punct words if True. (Default value: True)
remove_digits: Removes digit words if True. (Default value: True)
remove_stop_words: Removes stop words if True. (Default value: False)
exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])

WordTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

WordTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

texts: The list of text items to encode.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

WordTokenizer.create_token_indices

create_token_indices(self, tokens)

WordTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

encoded_texts: The list of texts to decode.
unknown_token: The placeholder value for unknown token. (Default value: "")
inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

WordTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

texts: The list of text items to encode.
include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

WordTokenizer.get_counts

get_counts(self, i)

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

WordTokenizer.get_stats

get_stats(self, i)

This information can be used to pad or truncate inputs.

WordTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

file_path: The file path to use.

WordTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, word)

Args:

texts: The list of texts.**kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

SentenceWordTokenizer

SentenceWordTokenizer.has_vocab

SentenceWordTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceWordTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

SentenceWordTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceWordTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

SentenceWordTokenizer.`init`

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \ remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \ exclude_entities=['PERSON'])

Encodes text into (samples, sentences, words)

Args:

lang: The spacy language to use. (Default value: 'en')
lower: Lower cases the tokens if True. (Default value: True)
lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
remove_punct: Removes punct words if True. (Default value: True)
remove_digits: Removes digit words if True. (Default value: True)
remove_stop_words: Removes stop words if True. (Default value: False)
exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])

SentenceWordTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

SentenceWordTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

texts: The list of text items to encode.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

SentenceWordTokenizer.create_token_indices

create_token_indices(self, tokens)

SentenceWordTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

encoded_texts: The list of texts to decode.
unknown_token: The placeholder value for unknown token. (Default value: "")
inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

SentenceWordTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

texts: The list of text items to encode.
include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

SentenceWordTokenizer.get_counts

get_counts(self, i)

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

SentenceWordTokenizer.get_stats

get_stats(self, i)

This information can be used to pad or truncate inputs.

SentenceWordTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

file_path: The file path to use.

SentenceWordTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, sent_idx, word)

Args:

texts: The list of texts.**kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

CharTokenizer

CharTokenizer.has_vocab

CharTokenizer.num_texts

The number of texts used to build the vocabulary.

CharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

CharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

CharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

CharTokenizer.`init`

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, characters)

Args:

lang: The spacy language to use. (Default value: 'en')
lower: Lower cases the tokens if True. (Default value: True)
charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

CharTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

CharTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

texts: The list of text items to encode.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

CharTokenizer.create_token_indices

create_token_indices(self, tokens)

CharTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

encoded_texts: The list of texts to decode.
unknown_token: The placeholder value for unknown token. (Default value: "")
inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

CharTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

texts: The list of text items to encode.
include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

CharTokenizer.get_counts

get_counts(self, i)

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

CharTokenizer.get_stats

get_stats(self, i)

This information can be used to pad or truncate inputs.

CharTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

file_path: The file path to use.

CharTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, character)

SentenceCharTokenizer

SentenceCharTokenizer.has_vocab

SentenceCharTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceCharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

SentenceCharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceCharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

SentenceCharTokenizer.`init`

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, sentences, characters)

Args:

lang: The spacy language to use. (Default value: 'en')
lower: Lower cases the tokens if True. (Default value: True)
charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

SentenceCharTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

SentenceCharTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

texts: The list of text items to encode.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

SentenceCharTokenizer.create_token_indices

create_token_indices(self, tokens)

SentenceCharTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

encoded_texts: The list of texts to decode.
unknown_token: The placeholder value for unknown token. (Default value: "")
inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

SentenceCharTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

texts: The list of text items to encode.
include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

SentenceCharTokenizer.get_counts

get_counts(self, i)

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

SentenceCharTokenizer.get_stats

get_stats(self, i)

This information can be used to pad or truncate inputs.

SentenceCharTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

file_path: The file path to use.

SentenceCharTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, sent_idx, character)

Args:

texts: The list of texts.**kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

Tokenizing and padding - keras-text Documentation (2024)

Tokenizer.has_vocab

Tokenizer.num_texts

Tokenizer.num_tokens

Tokenizer.token_counts

Tokenizer.token_index

WordTokenizer.has_vocab

WordTokenizer.num_texts

WordTokenizer.num_tokens

WordTokenizer.token_counts

WordTokenizer.token_index

SentenceWordTokenizer.has_vocab

SentenceWordTokenizer.num_texts

SentenceWordTokenizer.num_tokens

SentenceWordTokenizer.token_counts

SentenceWordTokenizer.token_index

CharTokenizer.has_vocab

CharTokenizer.num_texts

CharTokenizer.num_tokens

CharTokenizer.token_counts

CharTokenizer.token_index

SentenceCharTokenizer.has_vocab

SentenceCharTokenizer.num_texts

SentenceCharTokenizer.num_tokens

SentenceCharTokenizer.token_counts

SentenceCharTokenizer.token_index

References