ForgeLM — Config-driven LLM fine-tuning toolkit

Open source · Apache 2.0 · v0.6.0

YAML in.
Fine-tuned model out.

ForgeLM is the config-driven LLM fine-tuning toolkit for teams that ship into regulated environments.

EU AI Act ready CPU-friendly audit Air-gap capable CI/CD integrated
~/forgelm — zsh
$ forgelm --config configs/policy-bot.yaml # validating config… # auditing data/policies.jsonl (8 splits, 42K rows) # PII flags: 0 critical, 3 medium # near-dup pairs: 12 (LSH banded, threshold 3) # SFT epoch 1/3 loss=1.142 # DPO preference pass β=0.1, kl=4.7 # benchmark (lm-eval-harness, 6 tasks) acc=0.812 ↑ # Llama Guard safety pass S1-S14: clean # Annex IV artifact → artifacts/annex_iv_metadata.json finished in 47m, exit 0
6
Trainer types
16
GPU profiles auto-detected
5
Ingest formats
CI ✓
Tests passing on every commit
The complete alignment stack

From base model to production-ready

Six post-training paradigms behind one declarative interface.

01

SFT

Supervised fine-tuning on instruction pairs

02

DPO

Direct Preference Optimization on chosen/rejected pairs

03

SimPO

Reference-free preference learning, lower memory

04

KTO

Kahneman-Tversky on binary feedback signals

05

ORPO

Odds Ratio Preference Optimization, single pass

06

GRPO

Group Relative Policy Optimization for reasoning RL

Why ForgeLM

Built for the path to production

Three things separate ForgeLM from notebook-first frameworks.

Declarative, not imperative

Your run is a YAML file under version control.

YAML Pydantic --dry-run

Compliance is shipped, not promised

Every run emits an Annex IV artifact.

Annex IV Article 9-17 SHA-256 trail

Auto-revert on regression

When a benchmark drops below the floor, ForgeLM discards the trained artefacts and exits non-zero so CI gates fail loudly.

Llama Guard exit codes webhook
Capabilities

Everything the modern post-training stack needs

From data ingestion through evaluation to deployment artifacts.

Document ingestion

PDF, DOCX, EPUB, TXT, Markdown → SFT-ready JSONL.

forgelm ingest --strategy markdown

Dataset audit

PII, secrets, leakage, language detect, near-duplicates.

forgelm audit streaming

QLoRA + DoRA + GaLore

4-bit NF4 quantization with PEFT adapters.

QLoRA DoRA GaLore

Safety evaluation

Llama Guard with confidence-weighted scoring.

Llama Guard S1-S14

Benchmarking

Plug-in lm-evaluation-harness tasks.

lm-eval-harness judge

Distributed training

DeepSpeed ZeRO presets, FSDP, multi-GPU configs.

DeepSpeed FSDP MoE

VRAM fit-check

Pre-flight memory estimator.

--fit-check

Interactive chat

Streaming REPL with /reset, /save, /temperature, /system commands. Per-response Llama Guard routing is planned for an upcoming Pro CLI release.

forgelm chat

Export & deploy

GGUF export with 6 quant levels.

GGUF deploy

Library API

Use from forgelm import … to embed ForgeTrainer, audit_dataset, verify_audit_log, and verify_annex_iv_artifact calls inside your own Python pipelines.

from forgelm import … embed

ISO 27001 / SOC 2 alignment

Audit-trail, change-management, data-lineage, and supply-chain evidence the deployer's auditor asks for. Software is aligned, not certified.

SBOM pip-audit bandit

GDPR rights tooling

Article 17 erasure with forgelm purge and Article 15 access with forgelm reverse-pii, both wired into the audit log.

forgelm purge forgelm reverse-pii
A real config — not a screenshot

One YAML, end-to-end run

A SFT → DPO run with safety eval, benchmark gate, and Annex IV export.

configs/policy-bot.yaml
model:
  name_or_path: "Qwen/Qwen2.5-7B-Instruct"
  load_in_4bit: true
  max_length: 4096

lora:
  r: 16
  alpha: 32
  method: "dora"           # lora / DoRA / PiSSA / rsLoRA

data:
  dataset_name_or_path: "data/policies.jsonl"

training:
  trainer_type: "sft"      # sft / dpo / simpo / kto / orpo / grpo
  num_train_epochs: 3
  per_device_train_batch_size: 2
  learning_rate: 2.0e-5
  output_dir: "./checkpoints/policy-bot"

evaluation:
  benchmark:
    enabled: true
    tasks: ["hellaswag", "arc_easy", "truthfulqa"]
    min_score: 0.65          # aggregate floor; run fails (exit 3) below this
  safety:
    enabled: true
    classifier: "meta-llama/Llama-Guard-3-8B"
    severity_thresholds: { critical: 0.0, high: 0.01 }
  require_human_approval: true   # Article 14 oversight gate

compliance:
  provider_name: "Acme Inc."
  risk_classification: "limited-risk"

webhook:
  url: "${SLACK_WEBHOOK}"
terminal
$ forgelm --config configs/policy-bot.yaml --dry-run   # validate
$ forgelm --config configs/policy-bot.yaml --fit-check  # VRAM check
$ forgelm --config configs/policy-bot.yaml             # run
Built for these workflows

Use cases that actually ship

Domain expert from policy PDFs

Drop your regulatory corpus into forgelm ingest.

forgelm quickstart domain-expert

Customer support assistant

SFT on past tickets, DPO on agent feedback.

forgelm quickstart customer-support

Code copilot fine-tune

Multi-dataset mix, SFT + ORPO, GGUF export.

forgelm quickstart code-assistant

Reasoning RL with GRPO

Built-in shaping rewards or your own reward function.

forgelm quickstart grpo-math

Ship your next fine-tune in an afternoon.

Pick a template, point it at your data, watch the audit report go green.