Name: ForgeLM
Author: ForgeLM

Open source · Apache 2.0 · v0.6.0

YAML in.
Fine-tuned model out.

ForgeLM is the config-driven LLM fine-tuning toolkit for teams that ship into regulated environments.

EU AI Act ready
by default.

Every run produces an Annex IV bundle, a tamper-evident audit log, a Llama Guard safety report, and a model card — automatically. Built for teams that ship into healthcare, finance, and public sector.

See compliance map Explore enterprise features

EU AI Act ready CPU-friendly audit Air-gap capable CI/CD integrated

~/forgelm — zsh

$ forgelm --config configs/policy-bot.yaml # validating config… ✓ # auditing data/policies.jsonl (8 splits, 42K rows) ✓ # PII flags: 0 critical, 3 medium ✓ # near-dup pairs: 12 (LSH banded, threshold 3) ✓ # SFT epoch 1/3 loss=1.142 # DPO preference pass β=0.1, kl=4.7 # benchmark (lm-eval-harness, 6 tasks) acc=0.812 ↑ # Llama Guard safety pass S1-S14: clean # Annex IV artifact → artifacts/annex_iv_metadata.json ✔ finished in 47m, exit 0

Trainer types

GPU profiles auto-detected

Ingest formats

CI ✓

Tests passing on every commit

The complete alignment stack

From base model to production-ready

Six post-training paradigms behind one declarative interface.

SFT

Supervised fine-tuning on instruction pairs

DPO

Direct Preference Optimization on chosen/rejected pairs

SimPO

Reference-free preference learning, lower memory

KTO

Kahneman-Tversky on binary feedback signals

ORPO

Odds Ratio Preference Optimization, single pass

GRPO

Group Relative Policy Optimization for reasoning RL

Why ForgeLM

Built for the path to production

Three things separate ForgeLM from notebook-first frameworks.

Declarative, not imperative

Your run is a YAML file under version control.

YAML Pydantic --dry-run

Compliance is shipped, not promised

Every run emits an Annex IV artifact.

Annex IV Article 9-17 SHA-256 trail

Auto-revert on regression

When a benchmark drops below the floor, ForgeLM discards the trained artefacts and exits non-zero so CI gates fail loudly.

Llama Guard exit codes webhook

Capabilities

Everything the modern post-training stack needs

From data ingestion through evaluation to deployment artifacts.

Document ingestion

PDF, DOCX, EPUB, TXT, Markdown → SFT-ready JSONL.

forgelm ingest --strategy markdown

Dataset audit

PII, secrets, leakage, language detect, near-duplicates.

forgelm audit streaming

QLoRA + DoRA + GaLore

4-bit NF4 quantization with PEFT adapters.

QLoRA DoRA GaLore

Safety evaluation

Llama Guard with confidence-weighted scoring.

Llama Guard S1-S14

Benchmarking

Plug-in lm-evaluation-harness tasks.

lm-eval-harness judge

Distributed training

DeepSpeed ZeRO presets, FSDP, multi-GPU configs.

DeepSpeed FSDP MoE

VRAM fit-check

Pre-flight memory estimator.

--fit-check

Interactive chat

Streaming REPL with /reset, /save, /temperature, /system commands. Per-response Llama Guard routing is planned for an upcoming Pro CLI release.

forgelm chat

Export & deploy

GGUF export with 6 quant levels.

GGUF deploy

Library API

Use from forgelm import … to embed ForgeTrainer, audit_dataset, verify_audit_log, and verify_annex_iv_artifact calls inside your own Python pipelines.

from forgelm import … embed

ISO 27001 / SOC 2 alignment

Audit-trail, change-management, data-lineage, and supply-chain evidence the deployer's auditor asks for. Software is aligned, not certified.

SBOM pip-audit bandit

GDPR rights tooling

Article 17 erasure with forgelm purge and Article 15 access with forgelm reverse-pii, both wired into the audit log.

forgelm purge forgelm reverse-pii

See the full feature matrix

A real config — not a screenshot

One YAML, end-to-end run

A SFT → DPO run with safety eval, benchmark gate, and Annex IV export.

configs/policy-bot.yaml

model:
  name_or_path: "Qwen/Qwen2.5-7B-Instruct"
  load_in_4bit: true
  max_length: 4096

lora:
  r: 16
  alpha: 32
  method: "dora"           # lora / DoRA / PiSSA / rsLoRA

data:
  dataset_name_or_path: "data/policies.jsonl"

training:
  trainer_type: "sft"      # sft / dpo / simpo / kto / orpo / grpo
  num_train_epochs: 3
  per_device_train_batch_size: 2
  learning_rate: 2.0e-5
  output_dir: "./checkpoints/policy-bot"

evaluation:
  benchmark:
    enabled: true
    tasks: ["hellaswag", "arc_easy", "truthfulqa"]
    min_score: 0.65          # aggregate floor; run fails (exit 3) below this
  safety:
    enabled: true
    classifier: "meta-llama/Llama-Guard-3-8B"
    severity_thresholds: { critical: 0.0, high: 0.01 }
  require_human_approval: true   # Article 14 oversight gate

compliance:
  provider_name: "Acme Inc."
  risk_classification: "limited-risk"

webhook:
  url: "${SLACK_WEBHOOK}"

terminal

$ forgelm --config configs/policy-bot.yaml --dry-run   # validate
$ forgelm --config configs/policy-bot.yaml --fit-check  # VRAM check
$ forgelm --config configs/policy-bot.yaml             # run

Built for these workflows

Use cases that actually ship

Domain expert from policy PDFs

Drop your regulatory corpus into forgelm ingest.

forgelm quickstart domain-expert

Customer support assistant

SFT on past tickets, DPO on agent feedback.

forgelm quickstart customer-support

Code copilot fine-tune

Multi-dataset mix, SFT + ORPO, GGUF export.

forgelm quickstart code-assistant

Reasoning RL with GRPO

Built-in shaping rewards or your own reward function.

forgelm quickstart grpo-math

Ship your next fine-tune in an afternoon.

Pick a template, point it at your data, watch the audit report go green.

Quickstart in 5 minutes Star on GitHub

ForgeLM — Config-driven LLM fine-tuning toolkit

YAML in.Fine-tuned model out.