Fine-Tuning LLMs

Fine-tuning large language models (LLMs) has become one of the most important steps in adapting foundation models to domain-specific tasks such as customer support, code generation, legal analysis, healthcare assistants, and enterprise copilots. While full-model training remains expensive, open-source libraries now make it possible to fine-tune models efficiently on modest hardware using techniques like LoRA, QLoRA, quantization, and distributed training.

Fine-tuning a 70B model requires 280GB of VRAM. Load the model weights (140GB in FP16), add optimizer states (another 140GB), account for gradients and activations, and you’re looking at hardware most teams can’t access.

The standard approach doesn’t scale. Training Llama 4 Maverick (400B parameters) or Qwen 3.5 397B on this math would require multi-node GPU clusters costing hundreds of thousands of dollars.

10 open-source libraries changed this by rewriting how training happens. Custom kernels, smarter memory management, and efficient algorithms make it possible to fine-tune frontier models on consumer GPUs.

Here’s what each library does and when to use it:

1. Unsloth

Unsloth cuts VRAM usage by 70% and doubles training speed through hand-optimized CUDA kernels written in Triton.

Standard PyTorch attention does three separate operations: compute queries, compute keys, compute values. Each operation launches a kernel, allocates intermediate tensors, and stores them in VRAM. Unsloth fuses all three into a single kernel that never materializes those intermediates.

Gradient checkpointing is selective. During backpropagation, you need activations from the forward pass. Standard checkpointing throws everything away and recomputes it all. Unsloth only recomputes attention and layer normalization (the memory bottlenecks) and caches everything else.

What you can train:

  • Qwen 3.5 27B on a single 24GB RTX 4090 using QLoRA
  • Llama 4 Scout (109B total, 17B active per token) on an 80GB GPU
  • Gemma 3 27B with full fine-tuning on consumer hardware
  • MoE models like Qwen 3.5 35B-A3B (12x faster than standard frameworks)
  • Vision-language models with multimodal inputs
  • 500K context length training on 80GB GPUs

Training methods:

  • LoRA and QLoRA (4-bit and 8-bit quantization)
  • Full parameter fine-tuning
  • GRPO for reinforcement learning (80% less VRAM than PPO)
  • Pretraining from scratch

For reinforcement learning, GRPO removes the critic model that PPO requires. This is what DeepSeek R1 used for its reasoning training. You get the same training quality with a fraction of the memory.

The library integrates directly with Hugging Face Transformers. Your existing training scripts work with minimal changes. Unsloth also offers Unsloth Studio, a desktop app with a WebUI if you prefer no-code training.

Unsloth GitHub Repo β†’

2. LLaMA-Factory

LLaMA-Factory provides a Gradio interface where non-technical team members can fine-tune models without writing code.

Launch the WebUI and you get a browser-based dashboard. Select your base model from a dropdown (supports Llama 4, Qwen 3.5, Gemma 3, Phi-4, DeepSeek R1, and 100+ others). Upload your dataset or choose from built-in ones. Pick your training method and configure hyperparameters using form fields. Click start.

What it handles:

  • Supervised fine-tuning (SFT)
  • Preference optimization (DPO, KTO, ORPO)
  • Reinforcement learning (PPO, GRPO)
  • Reward modeling
  • Real-time loss curve monitoring
  • In-browser chat interface for testing outputs mid-training
  • Export to Hugging Face or local saves

Memory efficiency:

  • LoRA and QLoRA with 2-bit through 8-bit quantization
  • Freeze-tuning (train only a subset of layers)
  • GaLore, DoRA, and LoRA+ for improved efficiency

This matters for teams where domain experts need to run experiments independently. Your legal team can test whether a different contract dataset improves clause extraction. Your support team can fine-tune on recent tickets without waiting for ML engineers to write training code.

Built-in integrations with LlamaBoard, Weights & Biases, MLflow, and SwanLab handle experiment tracking. If you prefer command-line work, it also supports YAML configuration files.

LLaMA-Factory GitHub Repo β†’

3. Axolotl

Axolotl uses YAML configuration files for reproducible training pipelines. Your entire setup lives in version control.

Write one config file that specifies your base model (Qwen 3.5 397B, Llama 4 Maverick, Gemma 3 27B), dataset path and format, training method, and hyperparameters. Run it on your laptop for testing. Run the exact same file on an 8-GPU cluster for production.

Training methods:

  • LoRA and QLoRA with 4-bit and 8-bit quantization
  • Full parameter fine-tuning
  • DPO, KTO, ORPO for preference optimization
  • GRPO for reinforcement learning

The library scales from single GPU to multi-node clusters with built-in FSDP2 and DeepSpeed support. Multimodal support covers vision-language models like Qwen 3.5’s vision variants and Llama 4’s multimodal capabilities.

Six months after training, you have an exact record of what hyperparameters and datasets produced your checkpoint. Share configs across teams. A researcher’s laptop experiments use identical settings to production runs.

The tradeoff is a steeper learning curve than WebUI tools. You’re writing YAML, not clicking through forms.

Axolotl Github Repo β†’

4. Torchtune

Torchtune gives you the raw PyTorch training loop with no abstraction layers.

When you need to modify gradient accumulation, implement a custom loss function, add specific logging, or change how batches are constructed, you edit PyTorch code directly. You’re working with the actual training loop, not configuring a framework that wraps it.

Built and maintained by Meta’s PyTorch team. The codebase provides modular components (attention mechanisms, normalization layers, optimizers) that you mix and match as needed.

This matters when you’re implementing research that requires training loop modifications. Testing a new optimization algorithm. Debugging unexpected loss curves. Building custom distributed training strategies that existing frameworks don’t support.

The tradeoff is control versus convenience. You write more code than using a high-level framework, but you control exactly what happens at every step.

Torchtune GitHub Repo β†’

5. TRL

TRL handles alignment after fine-tuning. You’ve trained your model on domain data, now you need it to follow instructions reliably.

The library takes preference pairs (output A is better than output B for this input) or reward signals and optimizes the model’s policy.

Methods supported:

  • RLHF (Reinforcement Learning from Human Feedback)
  • DPO (Direct Preference Optimization)
  • PPO (Proximal Policy Optimization)
  • GRPO (Group Relative Policy Optimization)

GRPO drops the critic model that PPO requires, cutting VRAM by 80% while maintaining training quality. This is what DeepSeek R1 used for reasoning training.

Full integration with Hugging Face Transformers, Datasets, and Accelerate means you can take any Hugging Face model, load preference data, and run alignment training with a few function calls.

This matters when supervised fine-tuning isn’t enough. Your model generates factually correct outputs but in the wrong tone. It refuses valid requests inconsistently. It follows instructions unreliably. Alignment training fixes these by directly optimizing for human preferences rather than just predicting next tokens.

TRL GitHub Repo β†’

6.Β DeepSpeed

DeepSpeed is a library that helps with fine-tuning large language models that don’t fit in memory easily.

It supports things like model parallelism and gradient checkpointing to make better use of GPU memory, and can run across multiple GPUs or machines.

Useful if you’re working with larger models in a high-compute setup.

Key Features:

  • Distributed training across GPUs or compute nodes
  • ZeRO optimizer for massive memory savings
  • Optimized for fast inference and large-scale training
  • Works well with HuggingFace and PyTorch-based models

7. Colossal-AI: Distributed Fine-Tuning for Large Models

Colossal-AI is built for large-scale model training where memory optimization and distributed execution are essential.

Core Strengths

  • tensor parallelism
  • pipeline parallelism
  • zero redundancy optimization
  • hybrid parallel training
  • support for very large transformer models

It is especially useful when training models beyond single-GPU limits.

Why Colossal-AI Matters

When models reach tens of billions of parameters, ordinary PyTorch training becomes inefficient. Colossal-AI reduces GPU memory overhead and improves scaling across clusters. Its architecture is designed for production-grade AI labs and enterprise research teams.

Best Use Cases

  • fine-tuning 13B+ models
  • multi-node GPU clusters
  • enterprise LLM training pipelines
  • custom transformer research

Example Advantage

A team training a legal-domain 34B model can split model layers across GPUs while maintaining stable throughput.


8. PEFT: Parameter-Efficient Fine-Tuning Made Practical

PEFT has become one of the most widely used LLM fine-tuning libraries because it dramatically reduces memory usage.

Supported Methods

  • LoRA
  • QLoRA
  • Prefix Tuning
  • Prompt Tuning
  • AdaLoRA

Why PEFT Is Popular

Instead of updating all model weights, PEFT trains only lightweight adapters. This reduces compute cost while preserving strong performance.

Major Benefits

  • lower VRAM requirements
  • faster experimentation
  • easy integration with Hugging Face Transformers
  • adapter reuse across tasks

Example Workflow

A 7B model can often be fine-tuned on a single GPU using LoRA adapters instead of full parameter updates.

Ideal For

  • startups
  • researchers
  • custom chatbots
  • domain adaptation projects

9. H2O LLM Studio: No-Code Fine-Tuning with GUI

H2O LLM Studio brings visual simplicity to LLM fine-tuning.

What Makes It Different

Unlike code-heavy libraries, H2O LLM Studio offers:

  • graphical interface
  • dataset upload tools
  • experiment tracking
  • hyperparameter controls
  • side-by-side model evaluation

Why Teams Like It

Many organizations want fine-tuning without deep ML engineering overhead.

Key Features

  • LoRA support
  • 8-bit training
  • model comparison charts
  • Hugging Face export
  • evaluation dashboards

Best For

  • enterprise teams
  • analysts
  • applied NLP practitioners
  • rapid experimentation

It lowers the entry barrier for fine-tuning large models while still supporting modern methods.

Community Insight

Reddit users frequently recommend H2O LLM Studio for teams wanting a GUI instead of building pipelines manually.


10. bitsandbytes: The Memory Optimizer Behind Modern Fine-Tuning

bitsandbytes is one of the most important libraries behind low-memory LLM training.

Core Function

It enables:

  • 8-bit quantization
  • 4-bit quantization
  • memory-efficient optimizers

Why It Is Critical

Without bitsandbytes, many fine-tuning tasks would exceed GPU memory limits.

Main Advantages

  • train large models on smaller GPUs
  • lower VRAM usage dramatically
  • combine with PEFT for QLoRA

Example

A 13B model that normally needs very high GPU memory becomes feasible on smaller hardware using 4-bit quantization.

Common Pairing

bitsandbytes + PEFT is now one of the most common fine-tuning stacks.

Comparison

Here is a practical comparison of the most important open-source libraries for fine-tuning LLMs in 2026 β€” organized by speed, ease of use, scalability, hardware efficiency, and ideal use case ⚑🧠

Modern LLM fine-tuning tools generally fall into four layers:

  • ⚑ Speed optimization frameworks
  • 🧠 Training orchestration frameworks
  • πŸ”§ Parameter-efficient tuning libraries
  • πŸ—οΈ Distributed infrastructure systems

The best choice depends on whether you want:

  • single-GPU speed
  • enterprise-scale distributed training
  • RLHF / DPO alignment
  • no-code UI workflows
  • low VRAM fine-tuning

Quick Comparison Table

LibraryBest ForMain StrengthWeakness
UnslothFast single-GPU fine-tuningExtremely fast + low VRAMLimited large-scale distributed support
LLaMA-FactoryBeginner-friendly universal trainerHuge model support + UISlightly less optimized than Unsloth
AxolotlProduction pipelinesFlexible YAML configsMore engineering overhead
TorchtunePyTorch-native researchClean modular recipesSmaller ecosystem
TRLAlignment / RLHFDPO, PPO, SFT, reward trainingNot speed-focused
DeepSpeedMassive distributed trainingMulti-node scalingComplex setup
Colossal-AIUltra-large model trainingAdvanced parallelismSteeper learning curve
PEFTLow-cost fine-tuningLoRA / QLoRA adaptersDepends on other frameworks
H2O LLM StudioGUI fine-tuningNo-code workflowLess flexible for deep customization
bitsandbytesQuantization4-bit / 8-bit memory savingsWorks as support library

Best Stack by Use Case

For beginners:

βœ… LLaMA-Factory + PEFT + bitsandbytes

For fastest local fine-tuning:

βœ… Unsloth + PEFT + bitsandbytes

For RLHF:

βœ… TRL + PEFT

For enterprise:

βœ… Axolotl + DeepSpeed

For frontier-scale:

βœ… Colossal-AI + DeepSpeed

For no-code teams:

βœ… H2O LLM Studio


Current 2026 Community Trend

Reddit and practitioner communities increasingly use:

  • Unsloth for speed
  • LLaMA-Factory for versatility
  • Axolotl for production
  • TRL for alignment