You have hit the limit of what prompt engineering can do. You have written a 3-page system prompt. You have implemented few-shot examples. Yet, when you ask Llama 3 8B to summarize medical documents with specific billing codes, it still hallucinates roughly 10% of the time, and its tone sounds distinctly like an AI.
You cannot solve this inside the context window. You have to alter the model's actual weights. It is time to fine-tune.
In this tutorial, we will use Unsloth and PEFT (Parameter-Efficient Fine-Tuning) via LoRA (Low-Rank Adaptation) to train Llama 3 8B on a custom dataset. By the end, you will have a custom model that inherently understands your domain, fast enough to run locally or on a cheap cloud GPU.
Step 1: The Dataset is Everything
The biggest mistake developers make is trying to fix bad logic with fine-tuning. Fine-tuning is not for teaching the model new facts. RAG is for facts. Fine-tuning is for teaching the model a specific format, structure, or tone.
We need "Instruct" JSONL data. The standard format looks like this. You need at least 150-200 of these examples. 500 is optimal.
{
"instruction": "Convert this raw patient note into a structured billing report.",
"input": "Patient complains of chest pain, gave him aspirin, sent to cardiology.",
"output": "BILLING CODE: Z99. \nDEPARTMENT: Cardiology. \nACTION: Administrated [Aspirin]. \nSUMMARY: Evaluated for chest pain."
}
Do not use an LLM to generate your dataset unless you manually verify every single output. The model will hyper-fixate on your dataset's flaws. If you have typos in your output, the model will learn to write typos.
Step 2: The Environment (Google Colab / RunPod)
You cannot run this on your MacBook Air. You need an NVIDIA GPU. We recommend a free Google Colab T4 instance for prototyping, or an A100 instance on RunPod for production training.
Open a new Jupyter Notebook on a GPU instance and install the dependencies. We use Unsloth because it makes fine-tuning Llama models 2x faster with 50% less VRAM.
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
Step 3: Loading the Base Model
We will load the base Llama-3-8B-Instruct model in 4-bit quantization. This shrinks the model from 16GB of VRAM to under 6GB, allowing it to fit on a cheap GPU.
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Max context length to train on
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
Step 4: Applying LoRA Adapters
Training an 8-billion parameter model from scratch costs millions. LoRA freezes the original 8 billion weights and adds a tiny, 2% layer on top. We only train that tiny top layer. Mathematically, it works almost identically to full fine-tuning.
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Rank. Higher = larger adapter. 16 is optimal for tone/formatting.
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Optimization setting
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
)
Step 5: Formatting the Dataset for Llama 3
Llama 3 uses a very specific chat template (with headers like <|start_header_id|>). We must format our JSONL data through the tokenizer to match the exact template the model expects.
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "my_dataset.jsonl"}, split="train")
# Define the Llama 3 prompt format
prompt_template = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{instruction}
{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{output}<|eot_id|>"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for num, (instruction, input, output) in enumerate(zip(instructions, inputs, outputs)):
text = prompt_template.format(instruction=instruction, input=input, output=output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True,)
Step 6: Executing the Training Run
This is where your GPU catches fire (metaphorically). We pass our model and formatted dataset into the HuggingFace Trainer.
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Set to True if doing long context training
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4, # Batch size multiplier
warmup_steps = 5,
max_steps = 60, # Tweak this based on dataset size. Too high = overfitting.
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
trainer_stats = trainer.train()
Step 7: Testing & Exporting
Once training is complete, the weights are stored in the adapter. You can instantly test it in your notebook:
# Enable native inference speed
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
prompt_template.format(
instruction="Convert this raw patient note into a structured billing report.",
input="Patient has a mild fever, prescribed rest.",
output="",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
print(tokenizer.batch_decode(outputs))
If it outputs your exact billing structure seamlessly, you have succeeded. Your final step is to save the adapter or merge it back into a GGUF file so you can run it locally on your MacBook using Ollama or LM Studio.
# Save locally
model.save_pretrained("my_custom_lora_model")
# Or export to GGUF for ollama
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
Conclusion
Fine-tuning is a paradigm shift. Instead of pleading with the model in a prompt to "please respond only in JSON", you simply force the architecture to physically forget how to respond in any other format. When you master LoRA fine-tuning, you cross the gap from Prompt Engineer to Machine Learning Practitioner.