AI/ML Research Dec 15, 2024

How I Customized Llama 3.1 8B on a Budget

Democratizing AI: Fine-tuning Large Language Models with Limited Resources

Graduate Student Research Northeastern University 15 min read

I was impressed by how fine-tuned large language models outperform retrieval-augmented systems, especially at inference time. So I set out to fine-tune an open-source model like Meta's Llama 3.1 (8B parameters). But most sources stated you needed giant, expensive GPUs and tons of storage resources that I simply didn't have.

Determined to find another way, I tried it on a budget, and discovered you don't need a fancy setup to customize a state-of-the-art model.

The Hurdles: Time, Memory, and Disk

GPU RAM Required

16GB

4hrs GPU Time Limit

Hard constraint

Limited Disk Storage

16GB quota

My Secret Weapon: LoRA (Low-Rank Adaptation)

Instead of retraining 8 billion weights, LoRA tacks on small "adapter" matrices. Think of it as fine-tuning a car's suspension, rather than redesigning the entire engine.

Parameter Comparison

FULL MODEL

8B Parameters

LoRA

65K Parameters

That's 250× fewer parameters!

Adapter Sizes

81MB

Mini (rank 8)

161MB

Medium (rank 16)

My Setup on Northeastern's Cluster

I tapped into the Discovery cluster, which has:

H200 GPU

141GB VRAM

A100 GPU

40GB VRAM

V100 GPU

32GB VRAM

Jobs are managed by SLURM, with a hard 4-hour limit. To avoid eating disk quota, I redirected all caches to scratch space:

export CACHE_DIR="${SCRATCH}/llama3_finetune/cache/huggingface"
export HF_HOME="$CACHE_DIR"
export TRANSFORMERS_CACHE="$CACHE_DIR"

Organizing the Project

I called it LlamaVox, with a simple structure:

LlamaVox/
├── config/          # LoRA settings
├── data/            # Training files
├── models/          # Saved adapters
├── slurm/           # Job scripts
└── src/             # Training code

My Tech Stack

PyTorch 2.6.0
Transformers 4.53.0
PEFT 0.16.0
TRL 0.19.0
Accelerate
bitsandbytes

Prepping the Data

I built three datasets:

5K Mini Dataset

2.2 MB

50K Medium Dataset

Larger scale

1K Synthetic Dataset

Generated data

Each example looked like this:

{
  "conversations": [
    {"role": "user", "content": "What are the main challenges of urban planning?"},
    {"role": "assistant", "content": "Urban planning faces several key challenges..."}
  ]
}

Running the Training Jobs

Here's a snippet of my SLURM script for an H200 GPU:

#!/bin/bash
#SBATCH --job-name=llama3_h200
#SBATCH --time=4:00:00
#SBATCH --gres=gpu:h200:1
#SBATCH --mem=96G
#SBATCH --cpus-per-task=16

python src/train.py \
  --model llama-3.1 \
  --dataset data/mini.json \
  --lora_rank 8 \
  --output_dir models/mini_h200

Mini Dataset Results on H200

Start

July 7, 2025 8:06 PM EDT

Runtime

18 minutes

Loss drop

0.1164 → 0.0229

Token accuracy

98.98%

GPU Performance Comparison

GPU VRAM Mini (5K) Medium (50K) Adapter Size
H200 141 GB 18 min ~2.5 hours 81–161 MB
A100 40 GB 25 min ~3 hours 81–161 MB
V100 32 GB 40 min Failed 81 MB only

H200

Blistering speed, but scarce

A100

Great balance of power & availability

V100

OK for tiny jobs, hits 4hr wall

Performance Optimization Tricks

Gradient Accumulation

Fake bigger batches with less memory.

Mixed-Precision Training

Use 16-bit floats to halve memory use.

8-bit Quantization

Load the base model in 8-bit, freeing up VRAM.

Smart Checkpointing

Save every 10 minutes so you can resume if you hit the time limit.

Troubleshooting: Issues & Fixes

Issue Status Fix Summary
Flash Attention Compatibility Fixed Uninstalled flash_attn; set attn_implementation="eager"
Disk Quota Exceeded Fixed Redirected HF cache to scratch space
Hugging Face Authentication Errors Fixed Added explicit huggingface_hub.login() and token management
Environment Setup Complexity Fixed Created run_model.sh for automated setup and GPU checks
Out-of-Memory (OOM) Errors Ongoing 8-bit quantization, smaller batches, request more VRAM, monitor
Model Access & Permissions Fixed Verified permissions; added access checks before download attempts

Quick Start Guide

# Grab a GPU node
srun --gres=gpu:1 --mem=32G --time=1:00:00 --pty bash

# Run the launcher
./llama3_finetune/run_model.sh

What I Learned

The key insight is that you don't need expensive hardware to fine-tune large language models. With LoRA, smart optimization, and university cluster resources, you can achieve impressive results on a budget. The 4-hour time limit actually forced me to be more efficient and creative with my approach.

This project showed me that democratizing AI isn't just about open-source models—it's about making the fine-tuning process accessible to students and researchers with limited resources. By sharing these techniques, I hope to inspire others to experiment with large language models without breaking the bank.

Ready to Try It Yourself?

Check out the full code and detailed setup instructions on my GitHub repository.

View on GitHub