The bot told me my brakes were fine. They weren't.
I'd been riding my Ducati Panigale at VIR all morning. Between sessions, I had twenty minutes in the paddock to decide whether to go back out. My pads were at 14%, below the safety threshold I'd set for myself. I asked the RAG bot I'd just built: "Am I safe for the next session?"
It said: "You're good to go! Have a fun session!"
I caught the mistake because I checked the raw data manually. But that moment (a probabilistic model casually clearing me to ride a 200-horsepower motorcycle on worn brakes) is what made me throw out the chatbot architecture and rebuild from scratch.
This is the story of that rebuild. It's also the reason I think most enterprise AI products are being built wrong.
The 20-minute gap
Every track day has the same rhythm. You ride a 20-minute session. You come back hot, adrenalized, exhausted. You have 20 to 30 minutes before you go out again, and in that window you have to make real decisions:
- Are my tires overheating, or did I just push them harder than usual?
- Is my brake pad wear actually critical, or am I being paranoid?
- Why did the rear end slide in Turn 3: pressure, temperature, or something I did wrong?
I have gigabytes of motorcycle telemetry, years of handwritten setup notes, and live weather conditions that shift by the hour. What I don't have is the cognitive bandwidth to be a data analyst between sessions.
So I built one. Or tried to.
Why "chatting" with data doesn't work
My first version was the obvious thing. Dump all my telemetry and notes into a vector store, wrap a Llama-3.3-70b model around it with a retrieval layer, ask questions in natural language. The kind of architecture you can build in a weekend and that every "AI for X" startup is shipping right now.
It worked beautifully for soft questions. "How did my lap times compare to last month at VIR?" It would summarize, contextualize, even draw inferences I'd missed.
Then I asked it about brakes.
The pads were logically below my 15% threshold. The data was right there in the context window. The LLM, trained to be helpful and conversational, hallucinated a "safe" status anyway. Not because the data was missing. Because the prompt wasn't explicit enough about the hard constraint, and the model defaulted to its disposition: be agreeable, be encouraging, be confident.
This is the core RAG limitation no one talks about enough: when you use a probabilistic engine for a deterministic decision, the model doesn't have a concept of "this is a number that must be checked against a threshold." It has a concept of "complete the sentence in a way that feels right." On every other question, those two things produced the same answer. On the brake question, they didn't. AI hallucination, in a safety-critical context, isn't a quirk; it's a disqualifier.
I'd designed a system that was right 95% of the time. On a motorcycle, 95% is the percentage that gets you hurt.
The shift: from model to system
The realization that changed everything: the LLM shouldn't be the decision-maker. It should be the router.
This is the core idea behind compound AI systems and modern AI agent architecture. Instead of asking a single model to do everything, you orchestrate specialized components (some probabilistic, some deterministic, some pulled from live data) and use the LLM only for what it's actually good at: interpreting natural language, calling the right tools, and explaining results back to a human. LLM orchestration, not LLM omniscience.
Here's the architecture I deployed:
Layer 1: Deterministic guardrails
For anything safety-critical, code beats AI. I wrote Python functions that check hard limits: brake life, tire age, oil temperature. If brake_life < 20%, return "CRITICAL STOP." No probability, no interpretation, no room to disagree. The agent is forced to call these tools first, and their output overrides everything else.
Layer 2: Predictive analyst
For subtle issues that hard rules miss (a tire that's technically legal but performing badly), I trained a Random Forest classifier on my historical session data. It looks at telemetry patterns (traction control intervention frequency, lap-time consistency, lean angle distribution) and outputs a risk probability score. This is where probability is actually appropriate.
Layer 3: Real-time context
Tire pressure depends heavily on ambient temperature, and the LLM has no idea it's 94°F at the track today. I wired in the Open-Meteo API so the agent automatically pulls current conditions at the track's geolocation before answering anything setup-related.
Layer 4: Generative retrieval
For subjective questions like "how do I fix a rear slide at VIR turn 3?", the system queries a local ChromaDB vector store of my handwritten notes from previous track days. This is where RAG actually earns its place: opinionated, qualitative, historical knowledge.
The LLM sits above all four layers, deciding which to call based on the question, then synthesizing the results into one human-readable answer. This is the difference between deterministic and probabilistic AI working together vs. asking one model to be both.
When I ask "am I safe?" now, the response looks like this:
Brakes are healthy (85%). ML model detects 72% risk of instability based on traction control patterns from the last session. Track temp is 94°F and rising. Recommend dropping rear tire pressure by 1 psi and watching for slides in T3. Proceed with caution.
That's actionable. It's traceable. Every claim in that response came from a specific layer, and I can audit which one if I want to. The LLM didn't decide anything; it organized decisions that other components made.
Why this matters outside the racetrack
The 20-minute gap isn't a motorcycle problem. It's everywhere in enterprise AI architecture.
A claims adjuster has thirty minutes to decide on a fraud case. A trader has five seconds to act on a signal. A doctor has fifteen minutes to evaluate a patient with three concurrent conditions. In each case, the person has too much data, not enough time, and consequences that are real.
And in each case, the wrong architectural answer is the same one I started with: dump everything into a vector store, ask the LLM to figure it out, hope the answer is right.
The right answer is what I ended up with: separate the deterministic from the probabilistic, route deliberately, use the LLM as an orchestrator rather than an oracle.
The production AI systems that survive the next two years won't be the ones with the biggest models. They'll be the ones where the team understood that the model is one component among several, and built the system around it accordingly.
The brake pads taught me that.