Tokenizing and padding - keras-text Documentation (2024)

Table of Contents
pad_sequences unicodify Tokenizer Tokenizer.__init__ Tokenizer.apply_encoding_options Tokenizer.build_vocab Tokenizer.create_token_indices Tokenizer.decode_texts Tokenizer.encode_texts Tokenizer.get_counts Tokenizer.get_stats Tokenizer.save Tokenizer.token_generator WordTokenizer WordTokenizer.__init__ WordTokenizer.apply_encoding_options WordTokenizer.build_vocab WordTokenizer.create_token_indices WordTokenizer.decode_texts WordTokenizer.encode_texts WordTokenizer.get_counts WordTokenizer.get_stats WordTokenizer.save WordTokenizer.token_generator SentenceWordTokenizer SentenceWordTokenizer.__init__ SentenceWordTokenizer.apply_encoding_options SentenceWordTokenizer.build_vocab SentenceWordTokenizer.create_token_indices SentenceWordTokenizer.decode_texts SentenceWordTokenizer.encode_texts SentenceWordTokenizer.get_counts SentenceWordTokenizer.get_stats SentenceWordTokenizer.save SentenceWordTokenizer.token_generator CharTokenizer CharTokenizer.__init__ CharTokenizer.apply_encoding_options CharTokenizer.build_vocab CharTokenizer.create_token_indices CharTokenizer.decode_texts CharTokenizer.encode_texts CharTokenizer.get_counts CharTokenizer.get_stats CharTokenizer.save CharTokenizer.token_generator SentenceCharTokenizer SentenceCharTokenizer.__init__ SentenceCharTokenizer.apply_encoding_options SentenceCharTokenizer.build_vocab SentenceCharTokenizer.create_token_indices SentenceCharTokenizer.decode_texts SentenceCharTokenizer.encode_texts SentenceCharTokenizer.get_counts SentenceCharTokenizer.get_stats SentenceCharTokenizer.save SentenceCharTokenizer.token_generator References

Source: keras_text/processing.py#L0

pad_sequences

pad_sequences(sequences, max_sentences=None, max_tokens=None, padding="pre", truncating="post", \ value=0.0)

Pads each sequence to the same length (length of the longest sequence or provided override).

Args:

  • sequences: list of list (samples, words) or list of list of list (samples, sentences, words)
  • max_sentences: The max sentence length to use. If None, largest sentence length is used.
  • max_tokens: The max word length to use. If None, largest word length is used.
  • padding: 'pre' or 'post', pad either before or after each sequence.
  • truncating: 'pre' or 'post', remove values from sequences larger than max_sentences or max_tokens either in the beginning or in the end of the sentence or word sequence respectively.
  • value: The padding value.

Returns:

Numpy array of (samples, max_sentences, max_tokens) or (samples, max_tokens) depending on the sequence input.

Raises:

  • ValueError: in case of invalid values for truncating or padding.

unicodify

unicodify(texts)

Encodes all text sequences as unicode. This is a python2 hassle.

Args:

  • texts: The sequence of texts.

Returns:

Unicode encoded sequences.

Tokenizer

Tokenizer.has_vocab

Tokenizer.num_texts

The number of texts used to build the vocabulary.

Tokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

Tokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

Tokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

Tokenizer.__init__

__init__(self, lang="en", lower=True)

Encodes text into (samples, aux_indices..., token) where each token is mapped to a unique index startingfrom 1. Note that 0 is a reserved for unknown tokens.

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)

Tokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

Tokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Tokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter witha desired strategy and regenerate token_index using this method. The token index is subsequently usedwhen encode_texts or decode_texts methods are called.

Tokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

Tokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

Tokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates(text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths acrosstexts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

Tokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates(text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengthsacross texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.

Tokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

Tokenizer.token_generator

token_generator(self, texts, **kwargs)

Generator for yielding tokens. You need to implement this method.

Args:

  • texts: list of text items to tokenize.**kwargs: The kwargs propagated from build_vocab_and_encode or encode_texts call.

Returns:

(text_idx, aux_indices..., token) where aux_indices are optional. For example, if you want to vectorize texts as (text_idx, sentences, words), you should return(text_idx, sentence_idx, word_token)`. Similarly, you can include paragraph, page level information etc., if needed.

WordTokenizer

WordTokenizer.has_vocab

WordTokenizer.num_texts

The number of texts used to build the vocabulary.

WordTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

WordTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

WordTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

WordTokenizer.__init__

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \ remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \ exclude_entities=['PERSON'])

Encodes text into (samples, words)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
  • remove_punct: Removes punct words if True. (Default value: True)
  • remove_digits: Removes digit words if True. (Default value: True)
  • remove_stop_words: Removes stop words if True. (Default value: False)
  • exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
  • exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
  • exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])

WordTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

WordTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

WordTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter witha desired strategy and regenerate token_index using this method. The token index is subsequently usedwhen encode_texts or decode_texts methods are called.

WordTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

WordTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

WordTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates(text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths acrosstexts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

WordTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates(text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengthsacross texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.

WordTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

WordTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, word)

Args:

  • texts: The list of texts.**kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
  • batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

SentenceWordTokenizer

SentenceWordTokenizer.has_vocab

SentenceWordTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceWordTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

SentenceWordTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceWordTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

SentenceWordTokenizer.__init__

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \ remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \ exclude_entities=['PERSON'])

Encodes text into (samples, sentences, words)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
  • remove_punct: Removes punct words if True. (Default value: True)
  • remove_digits: Removes digit words if True. (Default value: True)
  • remove_stop_words: Removes stop words if True. (Default value: False)
  • exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
  • exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
  • exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])

SentenceWordTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

SentenceWordTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

SentenceWordTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter witha desired strategy and regenerate token_index using this method. The token index is subsequently usedwhen encode_texts or decode_texts methods are called.

SentenceWordTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

SentenceWordTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

SentenceWordTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates(text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths acrosstexts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

SentenceWordTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates(text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengthsacross texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.

SentenceWordTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

SentenceWordTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, sent_idx, word)

Args:

  • texts: The list of texts.**kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
  • batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

CharTokenizer

CharTokenizer.has_vocab

CharTokenizer.num_texts

The number of texts used to build the vocabulary.

CharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

CharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

CharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

CharTokenizer.__init__

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, characters)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

CharTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

CharTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

CharTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter witha desired strategy and regenerate token_index using this method. The token index is subsequently usedwhen encode_texts or decode_texts methods are called.

CharTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

CharTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

CharTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates(text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths acrosstexts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

CharTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates(text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengthsacross texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.

CharTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

CharTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, character)

SentenceCharTokenizer

SentenceCharTokenizer.has_vocab

SentenceCharTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceCharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding.This can change with calls to apply_encoding_options.

SentenceCharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceCharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

SentenceCharTokenizer.__init__

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, sentences, characters)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

SentenceCharTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you toplay with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

SentenceCharTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

SentenceCharTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter witha desired strategy and regenerate token_index using this method. The token index is subsequently usedwhen encode_texts or decode_texts methods are called.

SentenceCharTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.

SentenceCharTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See`apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)**kwargs: The kwargs for token_generator.

Returns:

The encoded texts.

SentenceCharTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates(text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths acrosstexts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can useget_stats method.

SentenceCharTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates(text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengthsacross texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.

SentenceCharTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

SentenceCharTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, sent_idx, character)

Args:

  • texts: The list of texts.**kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
  • batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)
Tokenizing and padding - keras-text Documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Maia Crooks Jr

Last Updated:

Views: 5642

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.