All exercises
Act II·10 of 11·60 min
5

AutoResearch on a propensity model

Simple AutoResearch example — your first learning loop

Exercise 4 loopedLoopWhat makes Claude an agent and not a chatbot. Instead of one ask-and-answer turn, a loop runs Claude over and over: act, observe what changed, decide the next step, repeat. the same task N times. This one is different: the loop measures, changes a variable, and runs again — trying to beat its own previous score. This is the minimum viable AutoResearchAutoResearchA loop that turns Claude into a tireless ML researcher. Give it a dataset and a metric to beat; it tries an approach, scores it, journals what it learned, and keeps going overnight..

We'll use a small, CPUGPU vs CPUA CPU is your computer's general-purpose brain — a handful of fast, flexible cores. A GPU has thousands of slower cores that do simple math in parallel — perfect for ML training.-friendly machine learning task so the loop actually converges in the room. No GPU, no cloud, no waiting.

The baselineBaselineThe score a dumb-but-honest approach gets on your problem. The bar you need to clear to claim your fancy approach actually does anything.: StandardScaler → LogisticRegression(C=1.0). It scored:

  • val_auc = 0.8896
  • test_auc = 0.8690

AUCAUCArea Under the (ROC) Curve — a number between 0.5 (random guessing) and 1.0 (perfect) that measures how well a classifier separates yes-examples from no-examples. Higher is better. is a score between 0.5 (random guessing) and 1.0 (perfect) that tells you how well a model separates the "yes" cases from the "no" cases. Higher is better. The loop's job is to push these numbers up.

The task

Prompt:

Clone this repo: https://github.com/fjfok/autoresearch-edu
Place it in ~/Documents/github/

→ This means we are copying the code from GitHubGitHubThe website where most people host git repositories. Git is the tool, GitHub is one popular place that stores the result. Owned by Microsoft, free for public projects. to your laptop. Place it in ~/Documents/github/.

Go to the newly created folder.

THIS NEEDS TO HAPPEN MANUALLY

Open the cloned folder in a new Claude Code session

Start a new Claude Code sessionClaude Code — start a new session. When it asks which folder to open, double-click the autoresearch-edu folder to open it as the project.

Verify you're in the right place.

Prompt:

What files are in this folder

You should see roughly:

  • .claude/
  • .gitignore
  • LICENSE
  • README.md
  • prepare.py
  • program.md
  • pyproject.toml
  • train.py
  • uv.lock

The exact list may shift over time as the repoRepoA folder that git is tracking — looks normal, but inside is a hidden .git/ directory holding the entire history: every file, every change, every commit, every branch. evolves — the key signal is that train.py, program.md, and .claude/ are present. If you instead see your other GitHub folders (REST-bench, etc.), Claude Code is rooted one level too high — quit and reopen, double-clicking autoresearch-edu this time.

Check if everything is installed correctly:

Prompt:

Check if you have everything installed for running this repo's code

This repository includes a custom Claude slash commandSlash commandA shortcut you type as /something to launch a workflow Claude already knows. Type /check-mail and Claude does the whole routine without re-prompting. located in .claude/commands/autoresearch.md.

If you use a Claude client that supports slash commands, you can trigger the ratchet loop directly from chat:

Prompt:

/autoresearch
Stuck? Don't see the /autoresearch command?

Two fallbacks:

  1. Fully restart the Claude desktop application — quit it, don't just close the window. Project-scoped slash commands only refresh on a full restart.
  2. If it still doesn't appear, paste this prompt instead:
    Read .claude/commands/autoresearch.md and follow those instructions

The command autonomously loads the dataset, establishes a baseline, runs FLAMLFLAMLFast and Lightweight AutoML — an open-source Python library from Microsoft Research. Give it a dataset and a metric; it tries model types and hyperparameters and returns the best in a fixed time budget., executes N ratchet iterations, and generates the final visualizations.

OptionalGo Deeper optional

Ask to change the dataset:

Prompt:

Change dataset to .....

Run more versions:

Prompt:

/autoresearch

Debrief — what should have happened

A new test_auc score that beats 0.8690. A visualisation showing the ratchet — score climbing in steps, occasionally reverting when a change made things worse, climbing again. A learnings log of what worked.

This is the smallest production-grade AutoResearch loop. Internalise the shape:

  • Fixed: the data interface, the scorer, the eval split.
  • Free: what the agent edits (model, scaling, hyperparameters, feature subset).
  • Ratchet: keep the change if it beats the best-so-far, revert otherwise.
  • Stop: N iterations or no improvement for K rounds.

FAQ for this exercise

Q: I got a bunch of errors — uv not installed, PythonPythonA programming language — the lingua franca of data science, machine learning, and scripting. A separate ecosystem from Node and npm, with its own runtime and package installer (pip). 3.9 vs 3.10, no venv-type file. What do I do? A: If your environment allows it and you've followed internal IT/security guidance, you can ask Claude Code to walk through the installation steps with you. For company machines, prefer approved installation flows or check with developer support.

Q: When you're not with us, you said don't allow things you don't understand — but I'm allowing 99% of things. What do you recommend? A: For this course the goal is to set you up — so allowing the listed packages is fine. Beyond the course, check with IT or a developer for anything new. One safety habit: any time, you can ask Claude "look at everything I've done so far and tell me if I created any new vulnerabilities."

Q: I ran the code but the slash command never appeared. How do I get /autoresearch to work? A: Two things to check:

  1. You must be inside the project folder (Documents/GitHub/autoresearch-edu). Project-scoped commands only show up there.
  2. Restart Claude Code fully (not just close the window). Commands only refresh when Claude restarts.

Q: Do you plan to show the program.md file? A: Yes — it's worth understanding the structure. The magic is roughly nine steps: check state, edit train.py, commitCommitA snapshot of your project at one moment, with a message describing what changed. Each commit knows which one came before it — that's how git rebuilds history., run, extract metrics (accuracy, wall time, status), if crashed fix it, save results with hypothesis, ratchet decision (keep if better), never stop. Out-of-scope items and a fixed budget (e.g. 180 seconds) prevent gaming.

Q: How is this better than a sweep or Monte Carlo or Bayesian optimisation? A: Reasoning agents leverage Opus's world knowledge to generate hypotheses about what to try next, instead of randomly sampling. Fewer steps to find the optimum — and crucially, it makes optimising real business processes feasible because you can't do a million tests with real customers.

Q: How do you handle holdouts? Lifts degrade over time and there are second-order effects. A: Same train/validation/test split as classical ML — agent optimises against validation, you reserve the test set for the final winner. Run small experiments in bulk → pick most promising → larger experiment → real environment. AutoResearch does the heavy lifting before you ever touch real users.

Q: What's the scope of changes AutoResearch can make? It can rewrite anything, right? A: No — program.md defines what's in scope (e.g. only train.py) and what's out of scope. There's also a fixed time budget per iteration to keep things efficient.

Go deeper

Point the same loop at a different target on the same dataset.

Re-run /autoresearch with the target column changed from conversion to LTV
(or churn, or revenue per session — pick one).
Keep everything else identical: same train/val/test split, same scorer shape,
same budget. Log to a separate results.tsv so we can compare.

When it finishes, diff the two learnings.md files. Same data, two questions — do the discovered features overlap, or does each target want a different model entirely? That's the conversation worth having with your team on Monday.


Ask for help