Models - can’t understand text data (can only munch numbers). There are ways to convert text to numerical data. This chapter will be predominantly discussing about that. The sequence of text data → continuous valued vectors → Embedding. Embedding is a mapping established from the data of interest (audio, video or text documents) to a vector space. One of the most popular word embedding techniques → Word2Vec Word2Vec assumes the words that appear in similar context has similar meanings.
Byte Pair Encoders
A simple tokenizer is implemented using the vocabulary of the text data being used. Let’s say that we have a word that’s not in the vocabulary, it won’t be able to provide a corresponding token value. Byte Pair Encoder is a technique which is used to tokenise even the unseen words, by breaking the unknown word into parts. (Todo: Have to look into the implementation in detail)
The remaining part of the chapter 2 is mainly concentrated on implementation of Tokenizers. Code repository for the same: https://github.com/saipragathi0912/LLM-from-scratch-python/tree/text_tokenization
Miscellaneous
Pytorch’s Dataset and DataLoader class:
Dataset → Create datasets to train the models efficiently. It creates a custom class that defines how individual data is recorded
DataLoader → Handles shuffling, in different batch sizes and more stuff