🏥 Production ML · ★★★ FEATURED

Alchemist — An AutoML That Reasons

An AutoML framework where LLM agents profile the data, propose moves from a constrained action registry, and run a self-healing, paper-grounded improvement loop — across classical ML and fine-tuned small language models.

Overview

Most AutoML sweeps a grid and reports the winner. Alchemist instead puts an LLM reasoning layer in the loop: point it at a dataset (or a folder of documents) and a target, and a set of typed agents profile the data, detect the task, propose the next move, train, evaluate, and — when training crashes — diagnose the failure and patch it before retrying. The loop continues until the target metric is hit or it plateaus.

It is an experiment in making the “AI assistant sitting next to you while you train models” repeatable and bounded. The interesting work is not the model zoo (50+ classical estimators plus SLM fine-tuning); it is the contract around the LLM — how its suggestions are typed, constrained to things the system can actually execute, and degraded gracefully when a provider misbehaves. This is a working framework with a 298-test suite, not a finished product: its own SLM evaluation flags the fine-tuned runs as overfit, and that honesty is surfaced rather than hidden (see Honest Evaluation below).

STAGE MAP — The Reasoning Loop

flowchart LR
  D["Data or documents<br/>plus target metric"] --> PROF["Profiler agent<br/>quality · types · task"]
  PROF --> PROP["Proposer<br/>pick move from ACTION_SPACE"]
  PROP --> TRAIN["Trainer<br/>classical ML · MLX · Unsloth"]
  TRAIN --> EVAL["Evaluator<br/>score vs target"]
  EVAL --> GATE{"Target met<br/>or plateau?"}
  GATE -->|"yes"| OUT["Export<br/>joblib · GGUF · ONNX"]
  GATE -->|"no"| ORCH["Orchestrator<br/>Optuna HPO · next action"]
  ORCH --> PROP
  TRAIN -.->|"crash"| REC["ErrorRecoveryAgent<br/>diagnose · patch · retry"]
  REC -.-> TRAIN

The solid path is the optimization loop; the dotted path is self-healing. Every box that talks to an LLM goes through the same typed contract, and every LLM proposal is checked against the action registry before anything executes.

A Typed Agent Contract With a Three-Tier Fallback Ladder

Every reasoning step is a BaseAgent[T] — a generic abstract class parameterized by a Pydantic response model T. Subclasses fill in three hooks: a system prompt, a prompt builder, and a heuristic fallback. The base class owns the part that actually matters in the field: getting a valid, typed object back from an unreliable model.

That happens through one method, _analyze_with_fallback, which walks a deliberate ladder:

Tier 1 — structured output. If the gateway supports it, call complete_structured and let the provider’s function-calling return the Pydantic model directly.
Tier 2 — JSON to Pydantic. If structured output throws, fall back to complete_json and run the raw dict through response_model.model_validate(...), so the schema is still enforced.
Tier 3 — heuristic. If both LLM paths fail, call the agent’s own _create_fallback() and return a hand-written default that keeps the run alive.

The design choice is that no agent can return an untyped or unvalidated result — the contract is the Pydantic model, and the only question is which tier produced it. There is a full async mirror of the ladder for concurrent runs. The result is a system that bends instead of breaking when a provider rate-limits, returns malformed JSON, or drops function-calling support — a flaky model degrades the quality of a decision, never the integrity of the run.

Proposals Constrained to an ACTION_SPACE Registry

An LLM left to “suggest improvements” will happily recommend a model you do not ship or a hyperparameter you cannot set. Alchemist closes that gap with ACTION_SPACE: an explicit registry of ActionDefinition entries, each carrying a name, a description, a category (model · config · preprocessing · data), and the set of valid values it accepts.

The registry is the single source of truth in three directions:

Prompt construction — get_action_space_prompt() renders the registry into the system prompt, grouped by category, telling the model in plain terms: you can ONLY suggest the following actions. Model-change actions draw their valid values straight from the live MODEL_REGISTRY, so the menu never drifts from what is installed.
Validation — every returned action passes through validate_action(), which normalizes the name, checks it exists (with fuzzy “did you mean” suggestions for near-misses), and rejects any value outside the declared set before it touches a config.
Execution — only execute_action() mutates the run, translating a validated (action, value) pair into a concrete config or DataFrame change.

The effect is a tight leash: the model reasons freely about which lever to pull, but the levers themselves are a closed, typed set the system is guaranteed to be able to execute. Hallucinated knobs are caught at the boundary, not three stack frames into a trainer.

A Self-Healing ErrorRecoveryAgent

Training on messy real-world data fails constantly — string columns where floats are expected, NaNs in the target, empty splits after preprocessing. Rather than surfacing a stack trace, the ErrorRecoveryAgent treats a crash as input to the loop.

When a trainer throws, the agent captures an ErrorContext (message, type, trimmed traceback, current config, data shape and dtypes, a fresh data profile, and the history of previous errors), then runs a two-layer diagnosis:

A pattern matcher scans the message against a library of known failure signatures — dtype mismatches, NaN and infinity in features or targets, encoding faults, sample-count mismatches, class-imbalance traps — each mapped to a category and a suggested fix.
An LLM analysis built on the same constrained machinery: the recovery system prompt embeds the ACTION_SPACE and the available-model list, so the model’s proposed fixes are forced to be executable. It returns a structured ErrorRecoveryModel with a root cause, a recoverability flag, and a list of typed fixes; if that call fails, it falls back to the pattern-derived fix.

There is also a deterministic safety net: per-task MODEL_FALLBACK_CHAINS (for example XGBoost → LightGBM → RandomForest → … → DecisionTree) so a model that simply cannot fit the data gets swapped for the next viable one without any LLM at all. The recovery design is grounded in published work on reflective agent runtimes and tool-failure recovery, and it is the piece that lets an unattended run survive contact with dirty data.

Paper-Grounded Loop, Provider-Agnostic Gateway

The orchestrator is not an ad-hoc heuristic. Its exploration-vs-exploitation phases follow the IMPROVE methodology, and the broader design leans on AutoML-Agent and AgentHPO for the claim that LLM-guided iteration can match or beat fixed grids. Inside the loop, Optuna drives hyperparameter search with plateau, target, and budget stopping; a metalearning layer fingerprints datasets and warm-starts from a pipeline memory of past runs, so similar data does not restart from zero.

Everything LLM-facing routes through one gateway built on LiteLLM — 100+ providers behind a single interface (OpenAI, Anthropic, Google, local Ollama) — with rate limiting, per-purpose cost tracking, and the structured-output methods the agent ladder depends on. The “brain” is never welded to one vendor; the same run executes against a frontier model or a fully local one. Trainers span classical ML, time series, clustering, and anomaly detection, plus SLM fine-tuning via Unsloth (CUDA LoRA and QLoRA) and native Apple-Silicon MLX, with DPO preference alignment. A document pipeline ingests 36+ formats, generates synthetic Q&A, fine-tunes, and exports a quantized GGUF with an Ollama Modelfile — or ONNX and joblib for the classical side.

Honest Evaluation — The SLM Runs Are Overfit

The framework’s own internal evaluation does not flatter it, and that is by design. The SLM fine-tuning path was exercised end to end on Qwen2.5-0.5B over educational content and produces coherent, grammatically correct answers — but the diagnostic flags it plainly as undertrained and overfit:

only 26 Q&A pairs were generated (the target was 200–500+), so the model overfits to specific phrasing and generalizes poorly to new questions;
96% of answers carry no citation back to the source document, so facts cannot be verified and hallucination was observed on at least one probe;
only 1 of 4 available documents was actually processed, with no deduplication or quality filtering on the training pairs.

These are recorded as a root-cause analysis with a concrete hardening plan, not hidden. The classical-ML loop and the agent contract are the solid, tested core; the SLM fine-tuning path is genuinely experimental, and saying so is the point. An AutoML that reasons should also reason honestly about its own results.

Engineering Discipline

Despite being a research-grade framework, it is held to product standards: a 298-test suite spanning unit and integration coverage, ruff and mypy and pre-commit gating every change, optional structlog · Prometheus · MLflow observability, and a 9-page Streamlit control surface for driving runs and watching the loop reason in real time.

Stack

Python · LiteLLM · Pydantic · Optuna · scikit-learn · XGBoost · LightGBM · CatBoost · Unsloth · Apple MLX · DPO · GGUF · ONNX · Streamlit · pytest · ruff · mypy · Prometheus · MLflow