Fine-Tuning the Tiny-Llama Model on a Custom Dataset

Anay Dongre
5 min readApr 3, 2024

--

In the rapidly evolving landscape of machine learning, the ability to fine-tune models on custom datasets is a game-changer. It allows for the creation of models that are not only powerful but also tailored to specific domains, enhancing their performance and relevance. This article delves into the intricacies of fine-tuning the Tiny-Llama model on a custom dataset, exploring the entire process from data preparation to model training and highlighting advanced training techniques like PEFT and QLORA.

THE POWER OF FINE-TUNING

Fine-tuning is a process where a pre-trained model is further trained (i.e., “fine-tuned”) on a smaller, domain-specific dataset. This approach leverages the knowledge gained during the initial training phase and adapts it to the specific needs of the new task. The Tiny-Llama model, a compact and efficient variant of the GPT family, is particularly well-suited for fine-tuning due to its balance between performance and resource efficiency.

DATA PREPARATION: THE FOUNDATION OF SUCCESS

The success of fine-tuning hinges on the quality and relevance of the training data. For this tutorial, we’ll explore how to prepare a dataset from PDF documents, converting it into a structured format that can be efficiently used for training.

EXTRACTING TEXT FROM PDFS

The first step involves extracting meaningful text from PDF documents. This is crucial for generating questions and answers that can be used to fine-tune the model. We use the PyPDF2 library, which provides a straightforward way to read PDF files and extract text.

import PyPDF2

def extract_text_from_pdf(file_path):
pdf_file_obj = open(file_path, 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file_obj)
text = ''
for page_num in range(len(pdf_reader.pages)):
page_obj = pdf_reader.pages[page_num]
text += page_obj.extract_text()
pdf_file_obj.close()
return text

GENERATING QUESTIONS AND ANSWERS

Once we have the text, we need to generate questions and answers. This is achieved by sending the text to an API (in this case, using the GPT-4 model from OpenAI) and processing the response to extract the questions and answers.

import openai
import json
from typing import List
from tqdm import tqdm

def generate_questions_answers(text_chunk):
messages = [
{'role': 'system', 'content': 'You are an API that converts bodies of text into a single question and answer into a JSON format. Each JSON " \
"contains a single question with a single answer. Only respond with the JSON and no additional text. \n.'},
{'role': 'user', 'content': 'Text: ' + text_chunk}
]

response = openai.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=2048,
n=1,
stop=None,
temperature=0.7,
)

response_text = response.choices[0].message.content.strip()
try:
json_data = json.loads(response_text)
return json_data
except json.JSONDecodeError:
print("Error: Response is not valid JSON.")
return []

PROCESSING TEXT AND WRITING TO JSON

The final step in data preparation involves processing the text in chunks, generating questions and answers for each chunk, and writing the results to a JSON file. This structured format is essential for the subsequent steps of training.

def process_text(text: str, chunk_size: int = 4000) -> List[dict]:
text_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
all_responses = []
for chunk in tqdm(text_chunks, desc="Processing chunks", unit="chunk"):
responses = generate_questions_answers(chunk)
all_responses.extend(responses)
return all_responses

text = extract_text_from_pdf('Provide Path to your file')
responses = {"responses": process_text(text)}

with open('responses.json', 'w') as f:
json.dump(responses, f, indent=2)

CONVERTING JSON TO CSV FOR TRAINING

With the data prepared, we convert it into a CSV format that is suitable for training. This involves extracting the questions and answers from the JSON and writing them to a CSV file with columns for the prompt, question, and answer.

import json
import csv

instruction = "You will answer questions about Machine Learning"

with open('responses.json', 'r', encoding='utf-8') as f:
responses = json.load(f)

with open('responses.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['prompt', 'question', 'answer']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for response in responses['responses']:
if 'question' in response and 'answer' in response:
writer.writerow({'prompt': instruction, 'question': response['question'], 'answer': response['answer']})

PREPARING THE DATA FOR TRAINING

Before training, we need to format the data correctly. This involves reading the CSV file, combining the prompt, question, and answer into a single text column, and then saving the formatted data to a new CSV file.

import pandas as pd

df = pd.read_csv("responses.csv")
df.fillna("")

text_col = []

for _, row in df.iterrows():
prompt = "Below is an instruction which describes a tasks paired with a question that provides further context. Write a response that completes the request.\n\n"
instruction = str(row["prompt"])
input_query = str(row["question"])
response = str(row["answer"])

if len(input_query.strip()) == 0:
text = prompt + "### Instruction:\n " + instruction + "\n### Response:\n " + response
else:
text = prompt + "### Instruction:\n " + instruction + "\n### Input:" + input_query + "\n### Response:\n " + response

text_col.append(text)

df.loc[:, "train"] = text_col
df.to_csv("train.csv", index=False)

TRAINING WITH AUTO TRAIN ADVANCED

For training, we use Auto Train Advanced, a powerful tool that simplifies the training process. It supports advanced training techniques like PEFT (Process-Efficient Fine-Tuning) and QLORA (Quantized Low-Rank Approximation), which can significantly reduce training time and memory usage.

TINY-LLAMA

The Tiny-Llama model is a compact and efficient variant of the GPT family, designed for fine-tuning on custom datasets. It leverages the Llama architecture to enhance text generation capabilities, offering a proof of concept for recreating the TinyStories-1M model with improved efficiency and performance. Hosted on Hugging Face, the Tiny-Llama model showcases the potential of fine-tuning and domain-specific modeling in the field of machine learning.

! autotrain llm --train --project-name josh-ops --model Maykeye/TinyLLama-v0 --data-path /kaggle/input/llamav2 --use-peft --quantization int4 --lr 1e-4 --train-batch-size 15 --epochs 60 --trainer sft

UNDERSTANDING PEFT AND QLORA

  • PEFT (Process-Efficient Fine-Tuning): This technique optimizes the training process by reducing the number of parameters that need to be updated during each training step. It’s particularly useful for fine-tuning large models on custom datasets.
  • QLORA (Quantized Low-Rank Approximation): QLORA is a method for reducing the memory footprint of models during training. It involves quantizing the model’s weights and applying a low-rank approximation to the weight matrices. This can significantly reduce the amount of memory required for training, making it possible to train larger models on hardware with limited resources.

CONCLUSION

Fine-tuning the Tiny-Llama model on a custom dataset is a powerful way to achieve domain-specific performance. By leveraging advanced training techniques like PEFT and QLORA, and using tools like Auto Train Advanced, we can efficiently train models on custom datasets. This process not only enhances the model’s performance on specific tasks but also opens up new possibilities for applying machine learning in various domains. As the field of machine learning continues to evolve, the ability to fine-tune models on custom datasets will remain a critical skill, enabling us to create models that are not only powerful but also tailored to our specific needs.

--

--