What I Learned during my Second and Third Internships

You Learn by Doing

Creating Models
LLMs
Papers
Programming
Transformers
Brief summaries of various things I learned.
Author

Salman Naqvi

Published

Tuesday, 24 September 2024

a continuously expanding brain; there are lots of swirls swirling around the brain

I recently completed my second and third internships. These were quite hands on, and involved tasks such as finetuning LLMs to generate Cantonese lyrics to the topic and tones specified. Along the way, I learned a lot. Three months ago, I would not have been able to do what I did. You learn by doing.

Let’s list out some of the things I learned (in no particular order).

Training Strategies

I know it sounds really obvious, but one should never start the finetuning process on the full dataset with the largest model variant right out of the box. There will be issues. One may be under the assumption that they’ve implemented the tweaks they needed to on the model, in the first attempt. Or that the data is correctly processed and formatted. However, chances are that they’re not. Save time, money, and resources by finetuning on on a small subset of the data with a smaller model variant. Once it is known the whole process works correctly and as expected, then go ahead with the full finetune.

BPE Tokenizers

Byte Pair Encoding (BPE) tokenizers are “dynamic” tokenizers. That is, they do not tokenize text according to a preset vocabulary or set of rules. Rather, they tokenize according to an algorithm. In short, frequent sequences of characters are merged together to form tokens. These tokenizers also store their vocabulary in a data type known as byte strings.

GPUs

There are various types of GPUs. I got to try out the mighty A100 all the way to reliable T4. I found the L4 and 4090 to be the sweet spot, though I’d lean towards the latter. Also, if you’re simply going after a GPU, don’t go for the major cloud providers such as Google Cloud. The only reason our company was using Google Cloud was because they had obtained free credits. Otherwise, it’s, well, a rip off when compared to other providers (in the context of research).

Forward Hooks

A hook is a piece of code that can be inserted at predefined places of an algorithm to achieve extra functionality. In the context of deep learning, a forward hook is a piece of code that is run after the forward pass of a model. By creating a forward hook, one can directly play around with and tweak the computed logits before the loss is calculated.

Token Generation

With Hugging Face, at least, the way models generate tokens during training and inference is different. During training, all output tokens are generated simultaneously whereas during inference, generation occurs token by token.

However, as I dug behind the scenes of the builtin generate method, I found that during inference, all tokens were actually generated simultaneously–in a way. Say, if 10 tokens were to be generated, for each token, 10 tokens would be generated with the remaining 9 being “filler” tokens.

Logit Processors

Logit processors are nifty objects that Hugging Face provides to be able to conveniently tweak a model’s logits during inference with the builtin generate method.

Handling Mistakes

Mistakes happen. Especially for challenging tasks. That’s normal. Mistakes should happen. What’s important is that one learns from their mistakes.

FSDP

Fully Sharded Data Parallel is a technique for splitting a model across multiple GPUs. Imagine literally splitting a model’s parameters into a bunch of different groups, and then putting those groups onto multiple GPUs that run simultaneously. When a layer is being computed on a GPU, all relevant parameters are copied to that GPU.

LoRA

Low Rank Adaptation (LoRA) is a LLM finetuning technique in which new layers are added onto the head of the existing model. These new layers are then trained to achieve a specific task, and form an, well, adapter of sorts. Think of a screwdriver handle onto which you can attach various heads, each head achieving a specific task. In the context of LLMs, these specific tasks could be increasing the performance of a model, reducing its resource usage, or changing what sort of output it generates.

Quantization

Quantization is a LLM inference technique whereby the weights in the remaining few layers of the model are stored in 4 or fewer bits, as opposed to the more typical 16 or 32 bits. This allows more weights to be stored in the same amount of memory, and larger models to be fit on smaller hardware, all whilst retaining the relative same level of performance.

Reading Papers

When browsing papers, one typically wants to read the abstract, headings, figures, and results. From this, one can determine whether a given paper is worthy for their goal or not.

When reading a paper, one wants to do it iteratively. In the first iteration, also only read the abstract, figures, and results. In the next iteration, read the paper but not with too much focus. Don’t understand something? That’s fine, move on. In the third iteration, start reading more closely and try to make sense of what is going on. Get bogged down? Spend some more time on it, but not too much more–save it for the next iteration. And so on. As you do more iterations, it may be helpful to rephrase or summarize what each section discusses. This allows you to find gaps in your knowledge and determine whether you understand what you just wrote.

Looping and Slicing

Loops can be immensely slow. I managed to deflate my training time from 10+ hours to around 2 hours by replacing two nested for loops with a single slicing operation.

LLM Training Data Formats

There are different ways LLM training data can be formatted. The Alpaca format uses '###' as the delimiter, while ChatML uses the tokens '<|im_start|> and '<|im_end|>' as the delimiters.

Best Performing Open Source Models

The best performing open source models come from the other side of the Pacific ocean; from China. The AI scene there is quite vibrant, with most of the actual advances happening there. If you look at the latest papers, most of the authors will be from there too. During my internships, I came across so many models I never had heard off that were well performers or had interesting perks or quirks. Qwen is the best performing open source model. Deepseek is the most cost effective endpoint that exists. Then there are so many other models such as InternLM, Yi, PhotoMaker, and more.

Refactoring

Refactored code is abstracted code. Abstracted code is maintainable code. Abstracted code is understandable code. Abstracted code is good code.

Data

It’s stating the obvious, but data matters. It actually does influence the quality or manner of the model output.

Conclusion

If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!

Back to top