Llama-3.1-Minitron 4B is a Smaller and Accurate LLM

Minitron is an interesting finetune of Llama 3.1 by NVIDIA Research.

The group investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be an effective way to create smaller models, instead of full retraining. They hypothesized that this approach can significantly reduce the training cost while maintaining good performance. They developed a method to efficiently create smaller, accurate language models by using structured weight pruning and knowledge distillation, offering several advantages:

16% improvement in MMLU scores
Up to 40x fewer training tokens per model
Compute cost savings of 1.8x for training the full model family

They demonstrated the approach by deriving Minitron 4B models from an already pretrained Llama-3.1 8B model.

Knowledge Distillation: A Powerful Technique for Language Model Research#

Knowledge distillation is a powerful technique in language model research that enables the transfer of knowledge from a large, complex model to a smaller, more efficient one. By distilling the essence of the larger model, researchers can create more practical and deployable AI systems.

Here’s how the knowledge distillation process works in Nvidia’s research:

Knowledge distillation is a technique used to transfer the capabilities of a large “teacher” model to a smaller “student” model. In this process, we’re not just teaching the student to give correct answers, but to replicate the teacher’s subtle patterns and relationships between outputs.
When we input data into both models, they produce probability distributions over possible answers. The student’s goal is to match its probability distribution to the teacher’s, rather than simply selecting the most likely answer. This captures the teacher’s uncertainty and relative confidence across all options.
To compare these distributions, we use Kullback-Leibler divergence (KL divergence). KL divergence is a measure of the difference between two probability distributions. Think of it like comparing two probability distributions as two different ways of predicting the weather. The KL divergence would give us a single number representing how different these two predictions are. This gives us a single number representing how different the distributions are, which the student tries to minimize.
In addition to comparing final outputs, we also examine the intermediate calculations (hidden states) inside both models. This helps the student learn to process information similarly to the teacher at various stages. However, since the student is smaller, its hidden states have different dimensions than the teacher’s. To address this, we use a learned linear transformation - essentially a matrix of adjustable parameters - to “scale up” the student’s hidden states before comparison. This transformation is like a scaling factor that allows the student’s smaller representation to be compared to the teacher’s larger one. Think of it like adjusting the zoom level on a map to match the scale of a larger map. This transformation is learned during the training process, allowing the student to find the best way to map its smaller representation to the teacher’s larger one.
The student model must balance three competing objectives: achieving accurate results based on training data, matching the teacher’s output probabilities, and replicating the teacher’s internal processing patterns. We combine these objectives into a single loss function that the student tries to minimize. The relative importance of each component is adjusted dynamically during training.
The training process involves showing both models many examples, far fewer than were used to train the original teacher. For each example, we run it through both models, calculate how well the student is doing on our combined objective, and make small adjustments to the student model to improve its performance. This is repeated many times, gradually refining the student model.
To optimize the learning process, we adjust the learning rate – the speed at which the student model updates its parameters – using a schedule that starts slow, accelerates, and then slows down again towards the end. This helps the model learn effectively without overshooting optimal settings.
By following this process, we can create a smaller model that captures much of the sophisticated behavior of the larger model. This makes it more practical to use in real-world applications while maintaining strong performance, effectively distilling the essence of the larger model into a more compact form.
We also had to choose how to edit the layers and neurons of the teacher model to create the right size for the student model.

Research paper: Compact Language Models via Pruning and Knowledge Distillation

NVIDIA’s blog post: How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model

The model weights are available on HuggingFace.

That’s all I have for now. I hope you learn something. I will explore and explain structured weight pruning next time. Stay tuned.