How I Customized Llama 3.1 8B on a Budget

I was impressed by how fine-tuned large language models outperform retrieval-augmented systems, especially at inference time. So I set out to fine-tune an open-source model like Meta's Llama 3.1 (8B parameters). But most sources stated you needed giant, expensive GPUs and tons of storage resources that I simply didn't have.

Determined to find another way, I tried it on a budget, and discovered you don't need a fancy setup to customize a state-of-the-art model.

The Hurdles: Time, Memory, and Disk

GPU RAM Required

16GB

4hrs GPU Time Limit

Hard constraint

Limited Disk Storage

16GB quota

My Secret Weapon: LoRA (Low-Rank Adaptation)

Instead of retraining 8 billion weights, LoRA tacks on small "adapter" matrices. Think of it as fine-tuning a car's suspension, rather than redesigning the entire engine.

Parameter Comparison

FULL MODEL

8B Parameters

LoRA

65K Parameters

That's 250× fewer parameters!

Adapter Sizes

81MB

Mini (rank 8)

161MB

Medium (rank 16)

My Setup on Northeastern's Cluster

I tapped into the Discovery cluster, which has:

H200 GPU

141GB VRAM

A100 GPU

40GB VRAM

V100 GPU

32GB VRAM

Jobs are managed by SLURM, with a hard 4-hour limit. To avoid eating disk quota, I redirected all caches to scratch space:

export CACHE_DIR="${SCRATCH}/llama3_finetune/cache/huggingface"
export HF_HOME="$CACHE_DIR"
export TRANSFORMERS_CACHE="$CACHE_DIR"

Organizing the Project

I called it LlamaVox, with a simple structure:

LlamaVox/
├── config/          # LoRA settings
├── data/            # Training files
├── models/          # Saved adapters
├── slurm/           # Job scripts
└── src/             # Training code

My Tech Stack

PyTorch 2.6.0

Transformers 4.53.0

PEFT 0.16.0

TRL 0.19.0

Accelerate

bitsandbytes

Prepping the Data

I built three datasets:

5K Mini Dataset

2.2 MB

50K Medium Dataset

Larger scale

1K Synthetic Dataset

Generated data

Each example looked like this:

{
  "conversations": [
    {"role": "user", "content": "What are the main challenges of urban planning?"},
    {"role": "assistant", "content": "Urban planning faces several key challenges..."}
  ]
}

Running the Training Jobs

Here's a snippet of my SLURM script for an H200 GPU:

#!/bin/bash
#SBATCH --job-name=llama3_h200
#SBATCH --time=4:00:00
#SBATCH --gres=gpu:h200:1
#SBATCH --mem=96G
#SBATCH --cpus-per-task=16

python src/train.py \
  --model llama-3.1 \
  --dataset data/mini.json \
  --lora_rank 8 \
  --output_dir models/mini_h200

Mini Dataset Results on H200

Start

July 7, 2025 8:06 PM EDT

Runtime

18 minutes

Loss drop

0.1164 → 0.0229

Token accuracy

98.98%

GPU Performance Comparison

GPU	VRAM	Mini (5K)	Medium (50K)	Adapter Size
H200	141 GB	18 min	~2.5 hours	81–161 MB
A100	40 GB	25 min	~3 hours	81–161 MB
V100	32 GB	40 min	Failed	81 MB only

H200

Blistering speed, but scarce

A100

Great balance of power & availability

V100

OK for tiny jobs, hits 4hr wall

Performance Optimization Tricks

Gradient Accumulation

Fake bigger batches with less memory.

Mixed-Precision Training

Use 16-bit floats to halve memory use.

8-bit Quantization

Load the base model in 8-bit, freeing up VRAM.

Smart Checkpointing

Save every 10 minutes so you can resume if you hit the time limit.

Troubleshooting: Issues & Fixes

Issue	Status	Fix Summary
Flash Attention Compatibility	Fixed	Uninstalled flash_attn; set attn_implementation="eager"
Disk Quota Exceeded	Fixed	Redirected HF cache to scratch space
Hugging Face Authentication Errors	Fixed	Added explicit huggingface_hub.login() and token management
Environment Setup Complexity	Fixed	Created run_model.sh for automated setup and GPU checks
Out-of-Memory (OOM) Errors	Ongoing	8-bit quantization, smaller batches, request more VRAM, monitor
Model Access & Permissions	Fixed	Verified permissions; added access checks before download attempts

Quick Start Guide

# Grab a GPU node
srun --gres=gpu:1 --mem=32G --time=1:00:00 --pty bash

# Run the launcher
./llama3_finetune/run_model.sh

What I Learned

The key insight is that you don't need expensive hardware to fine-tune large language models. With LoRA, smart optimization, and university cluster resources, you can achieve impressive results on a budget. The 4-hour time limit actually forced me to be more efficient and creative with my approach.

This project showed me that democratizing AI isn't just about open-source models—it's about making the fine-tuning process accessible to students and researchers with limited resources. By sharing these techniques, I hope to inspire others to experiment with large language models without breaking the bank.

Ready to Try It Yourself?

Check out the full code and detailed setup instructions on my GitHub repository.

View on GitHub