Tokenization is an aspect of data preprocessing in which input data is split into a sequence of smaller meaningful parts, such as sentences into words or phrases, images into components, documents into parts, which are called “tokens”. Tokenization is important especially for Natural Language Processing (NLP) applications since learning from words, their order, and context is especially important. Additionally, machine learning algorithms need to work on numbers, not words or text, so we need a way to represent raw text as numbers for the algorithms to work. A token is often something that can be repeated across different input data, so that would be common or important words, important repeated image components, or parts of documents that might appear in multiple input data. For NLP applications, you can use a dictionary of the language to assist with identifying meaningful tokens or outliers that aren’t regularly used words in the dictionary, and convert those words to numbers based on their dictionary entry. Popular tokenizers include FastText Tokenizer, SentencePiece, Byte-Pair-Encoding (BPE) which accounts for word frequency, WordPiece, Unigram, and others. Care needs to be taken in choosing the tokenizer as some languages, especially Asian languages, don’t separate words by spacing or use characters in the same way. Text can be tokenized at the character, word, sentence, line, paragraph, or “n-gram” level.