I was impressed by how fine-tuned large language models outperform retrieval-augmented systems, especially at inference time. So I set out to fine-tune an open-source model like Meta's Llama 3.1 (8B parameters). But most sources stated you needed giant, expensive GPUs and tons of storage resources that I simply didn't have.
Determined to find another way, I tried it on a budget, and discovered you don't need a fancy setup to customize a state-of-the-art model.
The Hurdles: Time, Memory, and Disk
GPU RAM Required
16GB
4hrs GPU Time Limit
Hard constraint
Limited Disk Storage
16GB quota
My Secret Weapon: LoRA (Low-Rank Adaptation)
Instead of retraining 8 billion weights, LoRA tacks on small "adapter" matrices. Think of it as fine-tuning a car's suspension, rather than redesigning the entire engine.
Parameter Comparison
FULL MODEL
8B Parameters
LoRA
65K Parameters
That's 250× fewer parameters!
Adapter Sizes
Mini (rank 8)
Medium (rank 16)
My Setup on Northeastern's Cluster
I tapped into the Discovery cluster, which has:
H200 GPU
141GB VRAM
A100 GPU
40GB VRAM
V100 GPU
32GB VRAM
Jobs are managed by SLURM, with a hard 4-hour limit. To avoid eating disk quota, I redirected all caches to scratch space:
export CACHE_DIR="${SCRATCH}/llama3_finetune/cache/huggingface"
export HF_HOME="$CACHE_DIR"
export TRANSFORMERS_CACHE="$CACHE_DIR"
Organizing the Project
I called it LlamaVox, with a simple structure:
LlamaVox/
├── config/ # LoRA settings
├── data/ # Training files
├── models/ # Saved adapters
├── slurm/ # Job scripts
└── src/ # Training code
My Tech Stack
Prepping the Data
I built three datasets:
5K Mini Dataset
2.2 MB
50K Medium Dataset
Larger scale
1K Synthetic Dataset
Generated data
Each example looked like this:
{
"conversations": [
{"role": "user", "content": "What are the main challenges of urban planning?"},
{"role": "assistant", "content": "Urban planning faces several key challenges..."}
]
}
Running the Training Jobs
Here's a snippet of my SLURM script for an H200 GPU:
#!/bin/bash
#SBATCH --job-name=llama3_h200
#SBATCH --time=4:00:00
#SBATCH --gres=gpu:h200:1
#SBATCH --mem=96G
#SBATCH --cpus-per-task=16
python src/train.py \
--model llama-3.1 \
--dataset data/mini.json \
--lora_rank 8 \
--output_dir models/mini_h200
Mini Dataset Results on H200
Start
July 7, 2025 8:06 PM EDT
Runtime
18 minutes
Loss drop
0.1164 → 0.0229
Token accuracy
98.98%
GPU Performance Comparison
| GPU | VRAM | Mini (5K) | Medium (50K) | Adapter Size |
|---|---|---|---|---|
| H200 | 141 GB | 18 min | ~2.5 hours | 81–161 MB |
| A100 | 40 GB | 25 min | ~3 hours | 81–161 MB |
| V100 | 32 GB | 40 min | Failed | 81 MB only |
H200
Blistering speed, but scarce
A100
Great balance of power & availability
V100
OK for tiny jobs, hits 4hr wall
Performance Optimization Tricks
Gradient Accumulation
Fake bigger batches with less memory.
Mixed-Precision Training
Use 16-bit floats to halve memory use.
8-bit Quantization
Load the base model in 8-bit, freeing up VRAM.
Smart Checkpointing
Save every 10 minutes so you can resume if you hit the time limit.
Troubleshooting: Issues & Fixes
| Issue | Status | Fix Summary |
|---|---|---|
| Flash Attention Compatibility | Fixed | Uninstalled flash_attn; set attn_implementation="eager" |
| Disk Quota Exceeded | Fixed | Redirected HF cache to scratch space |
| Hugging Face Authentication Errors | Fixed | Added explicit huggingface_hub.login() and token management |
| Environment Setup Complexity | Fixed | Created run_model.sh for automated setup and GPU checks |
| Out-of-Memory (OOM) Errors | Ongoing | 8-bit quantization, smaller batches, request more VRAM, monitor |
| Model Access & Permissions | Fixed | Verified permissions; added access checks before download attempts |
Quick Start Guide
# Grab a GPU node
srun --gres=gpu:1 --mem=32G --time=1:00:00 --pty bash
# Run the launcher
./llama3_finetune/run_model.sh
What I Learned
The key insight is that you don't need expensive hardware to fine-tune large language models. With LoRA, smart optimization, and university cluster resources, you can achieve impressive results on a budget. The 4-hour time limit actually forced me to be more efficient and creative with my approach.
This project showed me that democratizing AI isn't just about open-source models—it's about making the fine-tuning process accessible to students and researchers with limited resources. By sharing these techniques, I hope to inspire others to experiment with large language models without breaking the bank.
Ready to Try It Yourself?
Check out the full code and detailed setup instructions on my GitHub repository.
View on GitHub