Act II·10 of 11·60 min

AutoResearch on a propensity model

Simple AutoResearch example — your first learning loop

Exercise 4 looped the same task N times. This one is different: the loop measures, changes a variable, and runs again — trying to beat its own previous score. This is the minimum viable AutoResearch.

We'll use a small, CPU-friendly machine learning task so the loop actually converges in the room. No GPU, no cloud, no waiting.

The baseline: StandardScaler → LogisticRegression(C=1.0). It scored:

val_auc = 0.8896
test_auc = 0.8690

AUC is a score between 0.5 (random guessing) and 1.0 (perfect) that tells you how well a model separates the "yes" cases from the "no" cases. Higher is better. The loop's job is to push these numbers up.

The task

Prompt:

Clone this repo: https://github.com/fjfok/autoresearch-edu
Place it in ~/Documents/github/

→ This means we are copying the code from GitHub to your laptop. Place it in ~/Documents/github/.

Go to the newly created folder.

THIS NEEDS TO HAPPEN MANUALLY

Open the cloned folder in a new Claude Code session

Start a new Claude Code session Claude Code — start a new session . When it asks which folder to open, double-click the autoresearch-edu folder to open it as the project.

Verify you're in the right place.

Prompt:

What files are in this folder

You should see roughly:

.claude/
.gitignore
LICENSE
README.md
prepare.py
program.md
pyproject.toml
train.py
uv.lock

The exact list may shift over time as the repo evolves — the key signal is that train.py, program.md, and .claude/ are present. If you instead see your other GitHub folders (REST-bench, etc.), Claude Code is rooted one level too high — quit and reopen, double-clicking autoresearch-edu this time.

Check if everything is installed correctly:

Prompt:

Check if you have everything installed for running this repo's code

This repository includes a custom Claude slash command located in .claude/commands/autoresearch.md.

If you use a Claude client that supports slash commands, you can trigger the ratchet loop directly from chat:

Prompt:

/autoresearch

Stuck? Don't see the /autoresearch command?

Two fallbacks:

Fully restart the Claude desktop application — quit it, don't just close the window. Project-scoped slash commands only refresh on a full restart.
If it still doesn't appear, paste this prompt instead:
Read .claude/commands/autoresearch.md and follow those instructions

The command autonomously loads the dataset, establishes a baseline, runs FLAML, executes N ratchet iterations, and generates the final visualizations.

OptionalGo Deeper optional

Ask to change the dataset:

Prompt:

Change dataset to .....

Run more versions:

Prompt:

/autoresearch

Debrief — what should have happened

A new test_auc score that beats 0.8690. A visualisation showing the ratchet — score climbing in steps, occasionally reverting when a change made things worse, climbing again. A learnings log of what worked.

This is the smallest production-grade AutoResearch loop. Internalise the shape:

Fixed: the data interface, the scorer, the eval split.
Free: what the agent edits (model, scaling, hyperparameters, feature subset).
Ratchet: keep the change if it beats the best-so-far, revert otherwise.
Stop: N iterations or no improvement for K rounds.

FAQ for this exercise

Q: I got a bunch of errors — uv not installed, Python 3.9 vs 3.10, no venv-type file. What do I do? A: If your environment allows it and you've followed internal IT/security guidance, you can ask Claude Code to walk through the installation steps with you. For company machines, prefer approved installation flows or check with developer support.

Q: When you're not with us, you said don't allow things you don't understand — but I'm allowing 99% of things. What do you recommend? A: For this course the goal is to set you up — so allowing the listed packages is fine. Beyond the course, check with IT or a developer for anything new. One safety habit: any time, you can ask Claude "look at everything I've done so far and tell me if I created any new vulnerabilities."

Q: I ran the code but the slash command never appeared. How do I get /autoresearch to work? A: Two things to check:

You must be inside the project folder (Documents/GitHub/autoresearch-edu). Project-scoped commands only show up there.
Restart Claude Code fully (not just close the window). Commands only refresh when Claude restarts.

Q: Do you plan to show the program.md file? A: Yes — it's worth understanding the structure. The magic is roughly nine steps: check state, edit train.py, commit, run, extract metrics (accuracy, wall time, status), if crashed fix it, save results with hypothesis, ratchet decision (keep if better), never stop. Out-of-scope items and a fixed budget (e.g. 180 seconds) prevent gaming.

Q: How is this better than a sweep or Monte Carlo or Bayesian optimisation? A: Reasoning agents leverage Opus's world knowledge to generate hypotheses about what to try next, instead of randomly sampling. Fewer steps to find the optimum — and crucially, it makes optimising real business processes feasible because you can't do a million tests with real customers.

Q: How do you handle holdouts? Lifts degrade over time and there are second-order effects. A: Same train/validation/test split as classical ML — agent optimises against validation, you reserve the test set for the final winner. Run small experiments in bulk → pick most promising → larger experiment → real environment. AutoResearch does the heavy lifting before you ever touch real users.

Q: What's the scope of changes AutoResearch can make? It can rewrite anything, right? A: No — program.md defines what's in scope (e.g. only train.py) and what's out of scope. There's also a fixed time budget per iteration to keep things efficient.

Go deeper

Point the same loop at a different target on the same dataset.

Re-run /autoresearch with the target column changed from conversion to LTV
(or churn, or revenue per session — pick one).
Keep everything else identical: same train/val/test split, same scorer shape,
same budget. Log to a separate results.tsv so we can compare.

When it finishes, diff the two learnings.md files. Same data, two questions — do the discovered features overlap, or does each target want a different model entirely? That's the conversation worth having with your team on Monday.

Stuck? Ask the assistant →

Ask for help

Mark exercise done — next: 6 AutoResearch on REST-bench