Operator-BuilderCase study

AI Crew Chief

Compound AI system that decides go/no-go between motorcycle racetrack sessions. Four-layer architecture with the LLM as orchestrator, not decider. Built as a personal project; the patterns now inform AI-native decision intelligence work at SAS.

Builder · December 2025

A compound AI system that decides go/no-go between motorcycle racetrack sessions. It synthesizes telemetry, ML risk modeling, real-time weather, and historical setup notes into a single recommendation I'd actually trust.

At 160 mph, hallucination is a disqualifier, not a bug.

Problem framing

Track day decisions look simple from the outside. Go out, ride, come in. Look at your data, go out again. The reality is a 20-minute window between sessions where you're physically depleted, mentally compressed, and staring at three sources of data that don't agree with each other.

The naive framing is "build a dashboard." But a dashboard solves the wrong problem. I don't lack visibility into my telemetry. I have too much of it. Brake wear, tire temps, traction control intervention frequency, lap-time consistency, lean angle distributions, weather, the setup notes from the last time I rode this track in similar conditions. Six minutes into a paddock session, with my heart rate still at 140, I'm not going to read a dashboard. I'm going to ask the wrong question of the data or skip it entirely.

The right framing is decision support, not data display. The system needs to do three things a dashboard can't: enforce hard safety limits I might rationalize past when I'm tired, surface subtle risk patterns I wouldn't notice manually, and answer questions in the form I actually ask them ("am I safe?" not "show me brake pad wear over time").

That sounds like a chatbot, which is what I built first. It didn't work. A single LLM with RAG over my telemetry got the soft questions right and the hard questions confidently wrong, most consequentially clearing me to ride on brake pads below my own threshold. The full story is in Beyond Chatbots. The short version: probabilistic models shouldn't make deterministic decisions, and the conversational interface was hiding the architectural mistake.

The real problem isn't "build an AI assistant for track days." It's: how do you architect a decision system where the LLM is useful for the interface but disqualified from the decision? That reframe produced the compound architecture. It's the same reframe any team building safety-adjacent AI needs to make: fraud detection, clinical decision support, anything where the cost of a confident wrong answer is asymmetric.

The architecture

The full argument is in Beyond Chatbots. The short version for context here.

Four layers, each chosen for what it's actually good at:

Deterministic Python guardrails. Hard safety limits (brake life, tire age, oil temp) where any probability is the wrong answer.
Random Forest risk classifier. Telemetry patterns where probability is exactly the right answer.
Open-Meteo API. Live track conditions. The LLM has no idea it's 94°F today.
ChromaDB RAG. My handwritten setup notes for qualitative, opinionated knowledge.

Llama-3.3-70b on Groq orchestrates. It interprets the question, calls the appropriate layers (guardrails always first), and synthesizes the results. It never decides. The decisions are made by the layers it routes to.

The rest of this case study is the build itself.

Data

The training data is where this case study has to be most honest, and what most distinguishes "working prototype" from "production system."

I don't yet have real CAN-bus telemetry from my Panigale. Ducati's data bus exposes a lot, but capturing it cleanly requires hardware and signal-mapping work I haven't done yet. So I built v1 on synthetic telemetry modeled on Ducati's published data shape: realistic distributions for the metrics that matter (traction control intervention frequency, lap-time variance, lean angle distribution, brake pressure traces), with noise and inter-feature correlations modeled from domain reasoning rather than sampled from real sessions.

This is a deliberate choice, not an oversight. Building the architecture, training the RF, validating the RAG, and proving the orchestration works end-to-end didn't require real telemetry. It required plausible telemetry. The synthetic approach let me ship the system in weeks instead of months and validate the architectural bet (compound over monolithic) before committing to the hardware integration work.

The honest caveats:

Fields I'm confident map cleanly to real telemetry: lap times, brake pressure traces, throttle position. These are exposed directly.
Fields that likely need derivation in production: traction control intervention frequency (probably needs event-counting from a different signal), lean angle distribution (depends on whether Ducati exposes the IMU stream directly or only the derived rider-aid outputs).
Fields I haven't audited yet: tire temperature inference, brake pad wear estimation. These may or may not be directly available.

The RAG corpus is the opposite story. Those are real handwritten notes from real track days at VIR and NCBike. Hand-typed, tagged with track, date, weather, tire compound, and setup state. Dozens of sessions at the time of writing.

Next milestone on the data side: real CAN-bus capture from the bike, mapped field-by-field to the synthetic schema. I expect this to surface a few preprocessing steps the synthetic shortcut let me skip, and to slightly degrade RF performance until I retrain on the real distribution. That's the cost of having shortcut the data work, and it's worth paying now to know the architecture holds.

Evals

Eval discipline is where most personal AI projects fall apart and where most production AI projects should but don't. For Crew Chief, I separated evaluation by layer because each layer has different failure modes and different ground truth available.

Deterministic guardrails. Unit-testable in the conventional sense. I wrote test cases for every hard limit (brake life thresholds, tire age, oil temperature) and edge cases around the boundaries. The eval question isn't "is the model accurate" but "does the code execute the rule I wrote." Boring, important, fully covered.

Random Forest risk classifier. Trained on synthetic telemetry. Cross-validation on the synthetic set shows strong precision and recall on the risk classification task, with feature importances that match domain intuition (traction control frequency and lap-time variance dominate; absolute lean angle matters less than I expected). The honest caveat: these results are bounded by how realistic the synthetic data is. Verifying the field-by-field mapping to live CAN-bus output is the next milestone, and I expect it to surface a few preprocessing steps the synthetic shortcut let me skip.

RAG over setup notes. I hand-labeled a relevance set from my own notes. For a held-out set of queries ("how do I fix a rear slide at VIR T3," "tire pressure adjustment for hot weather"), I scored ChromaDB's top-k retrievals for whether they returned the note I actually wanted. Initial top-3 retrieval was weaker than expected. Tuning chunk size and adding metadata filters (track name, weather range) brought it to a level I'd ship internally. Still not at the level I'd ship externally without more notes in the corpus.

LLM router. The hardest layer to eval because the failure mode is subtle: wrong tool call, partial tool call, or right tool call with wrong synthesis. I built a small adversarial prompt set, questions designed to trick the router into skipping the guardrails (asking about brakes obliquely, asking compound questions that mix safety-critical and stylistic concerns). The router catches the obvious cases. The compound questions are where I still see occasional bad behavior, which is why the guardrails are positioned as a forced first call rather than an optional tool.

What I can't yet evaluate. End-to-end decision accuracy against ground-truth rider outcomes. The counterfactual is unobservable (I can't run the session both ways) and the negative class is catastrophic (I'm not labeling crashes). This is the structural limit of personal-scale safety AI: the eval ceiling is bounded by how willing you are to test the failure mode. The right answer is probably synthetic adversarial scenarios constructed from public motorsport incident data, which is in the roadmap.

Trade-offs

A few build decisions worth being explicit about:

Random Forest over logistic regression or gradient-boosted models. I wanted feature importances I could explain to myself, and I wanted robustness on a small synthetic dataset. RF gives both. XGBoost would likely score marginally better at the cost of interpretability and tuning time, neither worth it at v1 scale.

Llama-3.3-70b on Groq specifically. The 20-minute paddock window is a hard latency budget. Groq's inference speed on Llama-3.3-70b lets me get a full orchestrated response back well inside the budget, including tool calls. A larger model on a slower provider would have been a better reasoner and a worse product.

ChromaDB over pgvector or BM25. Local, no infra, easy to iterate on chunking and metadata schemas. The right answer for prototype-stage RAG. If I scale this to multiple riders or longer note histories, pgvector probably wins on operational maturity, but I'd want a real reason to migrate.

Open-Meteo over a paid weather API. Free, accurate enough for track conditions, no API key management. The marginal accuracy of a paid provider doesn't change any recommendation the system would make.

Forced guardrail-first call rather than letting the LLM decide which tools to use. Tool-calling LLMs are good at routing, but "good" isn't a word that belongs near brake safety. Forcing the guardrails to run first removes an entire category of failure (router skips safety check for clever-sounding reasons) at the cost of a few wasted milliseconds. Trivially worth it.

What broke during the build

Four failures, each instructive:

The brake-pad incident. v0 was a single LLM with RAG. It cleared me to ride on brake pads below my own threshold because it had no concept of "this number must be checked against a constant." This is the failure that produced the compound architecture. Full story in Beyond Chatbots.

Random Forest overfitting on early synthetic data. First version of the synthetic data generator produced too-clean distributions. The RF learned patterns that were artifacts of how I'd generated the data rather than realistic noise. Cross-validation looked great; eyeballing the feature importances revealed nonsense. Fixed by adding realistic noise and inter-feature correlations to the generator and regenerating the training set.

RAG retrieving plausible but irrelevant notes. Early ChromaDB setup with default chunk size and no metadata filtering would return notes from the right type of question but the wrong context: VIR notes for an NCBike question, dry-weather notes for a wet-weather question. The retrieval looked right semantically and was wrong operationally. Fixed by adding metadata filters (track name as a hard filter, weather range as a soft re-ranking signal) and tuning chunk size down so individual notes weren't getting blended.

LLM router skipping the guardrail call when the question was indirect. If I asked "should I push harder this session?" instead of "am I safe?", the router would sometimes go straight to the RF and RAG without calling the deterministic guardrails first. The fix wasn't better prompting. It was architectural. The guardrails are now a forced first call regardless of question phrasing, not an optional tool the router chooses.

Each of these had the same shape: a probabilistic component doing something that looked right and was wrong in a way I'd only catch by checking. That pattern is what makes safety-critical AI hard, and what most teams underestimate when they ship.

What I'd do differently

Three things, in order of how much they'd change the system:

Start with real telemetry, even at low volume. The synthetic-data shortcut got the architecture validated faster, but I'm carrying technical debt I'll have to pay down before any production claim. If I were starting again, I'd capture five to ten sessions of real CAN-bus data first, even unprocessed, and let the data shape inform the synthetic generator rather than the other way around. The architecture would have come out the same. The data layer would be six weeks ahead.

Build the eval harness before the model. I trained the RF first and built the evaluation around it after the fact. Should have been the opposite. A pre-built eval harness with held-out sets, adversarial cases, and a clear acceptance bar would have caught the overfitting on synthetic data weeks earlier. This is also the generalizable lesson: most AI projects under-invest in evals because the model is more fun to build, and pay for it later.

Treat the LLM router as a product surface, not an implementation detail. The way the system phrases its answers matters as much as which tools it called. Early versions gave me technically-correct responses I'd ignore because they didn't match how I think about riding. The orchestration prompt is now a real artifact I version and iterate on, with explicit personas for "explain to a tired rider in the paddock" vs. "explain to me sitting at my desk reviewing data." Should have done this from the start instead of treating the LLM as a black box that produces words.

What this informed at SAS

The patterns from Crew Chief aren't theoretical. They've shaped how I think about and ship AI features in the AI-native decision intelligence work I lead at SAS.

Three specific carry-overs, at the level I can discuss publicly:

The four-layer pattern (deterministic, then ML, then live context, then generative, with the LLM as orchestrator) shows up in enterprise decisioning the same way it shows up at the track. A claims adjuster, a fraud analyst, a risk underwriter all have the same shape of problem: too much data, not enough time, asymmetric cost on confident wrong answers. The architectural answer is the same.

Forced first-call guardrails as a design pattern. In regulated decisioning, "the model should have checked X first" is the difference between a feature and a liability. Building the guardrail call as architecturally non-optional rather than prompt-suggested is a pattern I push on consistently.

Eval discipline as a product requirement, not an engineering nice-to-have. Crew Chief taught me that the eval ceiling is bounded by what you're willing to test. In enterprise AI, that ceiling is also bounded by what compliance, legal, and risk will sign off on. Treating evals as a first-class product artifact (versioned, reviewable, auditable) is what turns AI features from demoware into shippable.

The fuller version of any of these is an interview conversation, not a public writeup. The through-line: personal projects where I own every decision are where I learn the patterns that show up in enterprise work weeks later.

Roadmap

Active now. Real CAN-bus capture from the Panigale, mapped field-by-field to the synthetic schema. Retrain the RF on real distributions. Audit the gap between synthetic and real performance.

Next. Physiological readiness integration. Pulling HRV, sleep, and training load from Garmin to extend the decision from "is the bike ready?" to "is the rider ready?" Treating this as a separate project (Rider Readiness) once Crew Chief v2 is on real data.

Eventually. Adversarial eval scenarios constructed from public motorsport incident data. The structural eval gap (can't test the catastrophic negative class with real rides) closes partially if I can construct realistic failure scenarios from other riders' published telemetry and crash reports. Not committed to timeline.

compound-aisafety-criticalllmragchromadbgroqscikit-learndecision-intelligenceevaluation