Reducing costs without losing the thread

A coaching session is often long. Forty-five minutes, seventy exchanges, sometimes more. And with an AI coach (with infinite patience) a session can be even longer as you return to the same session day after day to keep working on a problem - the equivalent of hours of coaching.

This creates a problem you can’t dodge: if you keep the entire conversation, every word exchanged, in the LLM context for every single conversational turn, then the cost grows exponentially and the user ends up paying for an assistant that is technically remembering everything but practically drowning in it.

So you “compress” the older parts of the conversation somehow, either by dropping messages, summarising them, or some mix of the two. This is what everyone working with generative AI does. The interesting question is not whether to compress but what to compress and how, and that question turns out to have a philosophical answer before it has a technical one.

There is a phrase buried in the summarisation prompt for Friday (the AI creativity coach at the heart of Revontale) that gives the whole game away: the coaching-relevant state. Three words, slightly cryptic, but they contain everything I want to say in this post. So I will come back to them.

Coding is different

AI coding assistants need near-perfect recall. If Claude Code forgets the signature of a function defined two thousand lines up, the next suggestion breaks. You cannot summarise handleUserAuthCallback as “the auth thing” and hope the generated code still compiles. Code is a formal system, which means a variable name is that variable name and nothing else, and paraphrasing is corruption. Long-context fidelity is not a nice-to-have for coding tools. It is load-bearing.

Coaching is not like that at all. And this was our exciting challenge - working out how to solve this problem from a coaching perspective, not a coding agent or chatbot perspective.

A good human coach does not remember every word you said forty minutes ago. They remember where you were, where you have been heading, what you have already ruled out, what lit you up. The conversational transcript is a means. The trajectory is the thing. And when a coach reflects something back to you, they are almost always paraphrasing… and the paraphrase is often an improvement on what you said, because it distills. Natural language is not corrupted by good paraphrasing. It gets sharper.

That is the opening we designed around. Coding cannot tolerate lossy memory because code is formal. Coaching can tolerate lossy memory (in fact it requires it) because coaching is about extracting meaning, and meaning survives paraphrasing.

What Friday actually does

Here is the mechanism, as concretely as I can put it without turning this into a (boring) technical README.

Friday always keeps the most recent forty messages in context verbatim. Nothing clever happens until your conversation crosses sixty messages back and forth between you and your AI coach. At that point, the oldest twenty messages get handed to a small, cheap model (not the main coaching model but one that is optimised for summarisation) in the background. This cheaper LLM produces a short narrative summary of that chunk. Twenty messages later, the next oldest twenty get their own summary. And so on, all the way through the session no matter how long it gets.

The key design choice (and the one I am most pleased with) is that old summaries never get rewritten or replaced. They are append-only. When chunk three (messages 41-60) needs summarising, it is generated fresh from its own twenty messages, and chunks one and two stay untouched.

This matters more than it sounds. The alternative (a single running summary that gets rewritten every so often) would mean the beginning of a long session gets compressed, then re-compressed, then re-re-compressed, losing fidelity at every pass. By the end of a two-hour session, the first twenty minutes would be a blurry rumour of themselves. With append-only chunks, the first twenty minutes stay exactly as sharp as the middle twenty.

That is closer to how good human memory actually works, or rather, how good coaching memory works. A human coach who has been with a client for years does not remember everything, but the things they do remember from the first session are not more degraded than the things they remember from last week. Salience is not a function of age.

What the summariser is told to keep

This is where the philosophy stops being an idea and becomes an instruction.

The summariser prompt does not say “compress this conversation.” It states the goal as to preserving “essential context for future conversation continuity, especially the user’s perspective and the coaching-relevant state they are in.” And there it is, the phrase from the top of this post. It is not “what was said,” it is not “what was decided,” it is the state the user is in, as a person being coached, at this point in the arc.

The prompt then lists what to include: key topics and the user’s substantive answers, any central tension or dilemma the user is grappling with, decisions and clarity gained, user-stated pressures and sources of uncertainty. And what to exclude: verbatim dialogue, assistant-led explanations, speculative psychological interpretation. The instruction here is explicit: “prioritise the user’s perspective, intent, and language over the assistant’s questions or framing.”

This is NOT a description of what a database should remember about a conversation. It is a description of what a coach should remember about a client. The summariser is being asked to take coaching notes, not to compress text.

And there is one more line in the prompt worth quoting, because it could almost be the tagline for this whole article: “capture the essence and continuity value, not exhaustive detail.”

A real example

To make this concrete and delightfully meta: my co-creator of Revontale, Frida, had a single session with Friday about the Revontale project that spanned about ten days and over a hundred messages. It started as a copywriting brainstorm, wandered through choosing a typeface for a logo, drafted a blog post for an external publication, and ended with a LinkedIn post in a completely different voice.

By the time she was on message eighty, the original copywriting brainstorm had been compressed into something like this: “She sought to move away from literal depictions of LLM chat interfaces, which she felt failed to capture the magic of the creative expansion the tool facilitates. She defined the core value as movement and flow, positioning the assistant as a work friend and coach that challenges rather than merely answers.”

All the specific slogan draft iterations are gone. The debates over individual word choices are gone. But the direction of her thinking is preserved, and when she came back later asking about fox symbolism for branding, Friday could connect the dots without needing to re-read the original messages. It was continuity of trajectory, not continuity of transcript.

That is the whole argument in one session.

What this saves

I have deliberately not put a percentage on any of this, because the honest answer depends enormously on session length and the exact pattern of usage. What I can say is that without compression, a two-hour coaching session would be priced like a deep-research assistant, not a coaching tool. With compression, it stays in a range where we can offer it as a daily practice rather than a premium occasional thing. That matters for Revontale, because Friday is meant to be used often, not rationed sparingly.

(There is also a second layer of savings from Anthropic’s prompt cache, which can still kick in because the compressed prefix stays stable for about twenty turns at a time.)

Architecture is philosophy made concrete

Every engineering choice is an implicit claim about what matters. Claude Code’s long-context fidelity is a claim that every token of your codebase could matter at any moment, and it is right for them to make that claim because code actually works that way.

Friday’s compression is a different claim. It says coaching is about where you are going, not what you literally said. And the summariser’s instructions (preserve the user’s language, drop the assistant’s explanations, keep the arc, let the transcript go) are where that claim stops being abstract and starts being something you can run on a server.

Slow AI, the philosophy behind Revontale, is not just a tone. It is a set of choices about what to remember and what to let go. This is one of them.