Tool Limitations
In Part 1, we went over converting natural language requests into commands, using LLMs. That’s a nice improvement from having to enter slash commands, but the current tool still has two major limitations:
- No flexible editing: There’s no way to update or delete an existing item using natural language. When referring to a todo, you have to use its list position (e.g., “the second item”, “1”), instead of describing it naturally (“Delete the milk reminder”, “Make filing taxes high priority”).
- No memory of the conversation: The AI treats every request in isolation – every command is a clean slate. It doesn’t remember what was said earlier, so you can’t build on prior context (e.g., “Move that task to tomorrow”, “Put this to the same category as the other one.”, “I no longer need to work on that. Delete it.”).
Why Does This Matter?
When you present a chat interface, users naturally expect a fluid, ongoing conversation. The moment they realize that isn’t possible, they’ll likely exit the tool in frustration.
What We’re Building
To overcome these frustrations and make the experience feel conversational, we’ll apply a healthy dose of “prompt engineering”. This means adding more information and examples to the system prompt. Let’s get started – you can follow along with the code on Github.
Flexible Editing: Including The Todo List
The first step in making our app conversational is giving the AI access to the actual todos. Up until now, the only way to reference an item was by its number in the list, which is clunky. Instead, we want to let users reference todos naturally—using part of the title, the whole title, or even a close variation.
To do this, we include the formatted list of todos in the system prompt. That way, when you say something like “delete the Amazon package task”, “delete the last item”, or “I’m done enrolling in FSA”, the AI can match your request against the todo list it has in context:
On the implementation side, this looks like formatting the todo list for prompt injection (code here):
private formatTodosForContext(context: any): string { if (!context.currentList || !context.currentList.todos || context.currentList.todos.length === 0) { return '- No todos in current list'; } const todos = context.currentList.todos.slice(0, 20); // Limit for context size return todos.map((todo: any, index: number) => { const status = todo.completed ? '✓' : ' '; const priority = todo.priority ? ` [${todo.priority}]` : ''; const dueDate = todo.dueDate ? ` (due: ${new Date(todo.dueDate).toISOString().split('T')[0]})` : ''; const categories = todo.categories && todo.categories.length > 0 ? ` {${todo.categories.join(', ')}}` : ''; return ` ${index + 1}. [${status}] ${todo.title}${priority}${dueDate}${categories}`; }).join('\n'); }
And here’s how it gets embedded into the system prompt:
Current Todos in "${context.currentList?.name || 'No List'}": ${this.formatTodosForContext(context)}
Since the last time, we’ve also hooked up all the remaining commands and added examples directly in the system prompt. For completion requests, Claude generated examples of how a user might phrase them. The model is explicitly instructed to check the todo list above and match the statement to the right item number.
IMPORTANT: For completion requests, look at the current todos above and match the user's statement to the specific todo number. Consider: - Exact title matches: "I returned the Amazon package" matches "return an Amazon package" - Partial matches: "went to dentist" matches "go to the dentist" - Past tense variations: "bought groceries" matches "buy groceries" - Action completion: "finished the report" matches "finish the report"
Traditionally, we format data for display, storage, or passing into another function or process. Now, formatting for LLM context has become just as common. String interpolation is a simple and convenient method for prompt construction. (These days, even Java has string interpolation – actually wait, never mind.)
Notably, one detail that Claude Code added was a cap of 20 todos in the context. That’s a trade-off: keeping prompts small vs. ensuring the model has access to the full list. In practice, this should be configurable depending on how long your lists usually get.
Chat Mode: Adding Conversation History
Now that the model can match todos by title, the next step is making the interaction feel conversational. A single prompt isn’t enough for natural conversation. If the model can’t remember what was said a moment ago, you end up either repeating yourself or losing context. To make chat mode useful, the LLM needs a short “memory” of prior messages so it can resolve things like “move that to tomorrow”, “mark those as done”., or “delete the one we discussed”. The idea is simple: store a short window of recent messages and feed it into the system prompt.
Recording Conversation History
In chat mode, our tool stores both user and assistant messages so we can reconstruct the back-and-forth. (See the full code here)
- What we store: role + message (e.g.,
User: …
,Assistant: …
). - Why both roles: assistant replies often confirm which items were acted on (useful when the user later says “undo that” or “do the same for the other one”).
- How much: we cap the array to
maxConversationHistory
and trim from the front (oldest first). - What we pass to the prompt: a short slice (e.g., last 3) to balance token length and preserving recency.
This is a simple sliding window. If you find that users rely on earlier context, increase the window and/or what is included in the prompt; if costs creep up, reduce it or try to pass only the user messages to the model. For task management, long discussions are probably rare.
const response = await openAIService.chatWithAI(message, context, conversationHistory); // DEBUG: Show request type detection and raw response console.log('🔍 DEBUG INFO:'); console.log(` Request Type Detected: ${response.action || response.type || 'unknown'}`); console.log(` Raw LLM Response: ${JSON.stringify(response, null, 2)}`); console.log(''); // Add to conversation history conversationHistory.push(`User: ${message}`); if (response.action === 'conversational') { console.log(`🤖 ${response.message}\n`); conversationHistory.push(`Assistant: ${response.message}`); } else { // Handle structured command response await handleParsedCommand(response); conversationHistory.push(`Assistant: Executed ${response.action} command`); } // Keep conversation history manageable while (conversationHistory.length > maxConversationHistory) { conversationHistory.shift(); }
Include Recent Messages
Now, we can take the three (3) most recent messages and append them under a Recent Conversation section in a new “chat mode” prompt, alongside the current list, today’s date, and the formatted todos. (The prompt is only partially shown here.)
getChatSystemPrompt(context: any, conversationHistory: string[] = []): string { if (!this.isConfigured()) { return 'AI is not configured. Set OPENAI_API_KEY to use chat mode.'; } const historyContext = conversationHistory.length > 0 ? `\n\nRecent Conversation:\n${conversationHistory.slice(-3).join('\n')}\n` : ''; return `You are a helpful AI assistant for a todo list application. Help users manage tasks, provide insights, and suggest command sequences for complex operations. Current Context: - Current list: ${context.currentList?.name || 'None'} - Available lists: ${context.availableLists?.map((l: any) => `${l.name} (${l.todos.length} items)`).join(', ') || 'None'} - Today's date: ${new Date().toISOString().split('T')[0]} Current Todos in "${context.currentList?.name || 'No List'}": ${this.formatTodosForContext(context)}${historyContext}
Action Type: Conversational vs Command
To take advantage of this, we’ve introduced a new “chat mode”. Unlike the earlier “command-only” approach, chat mode allows the model to return either a structured JSON command (for actions like adding or completing todos) or a free-form conversational reply. When the response can’t be parsed as JSON, the app interprets it as a new “conversational” action. This is how the AI can answer user questions, provide summaries, or simply acknowledge requests in natural language—without trying to force everything into a command (code here).
async chatWithAI(message: string, context: any, conversationHistory: string[] = []): Promise<any> { if (!this.client) { throw new Error('OpenAI client not configured. Please set OPENAI_API_KEY environment variable.'); } const systemPrompt = this.getChatSystemPrompt(context, conversationHistory); try { const response = await this.client.chat.completions.create({ model: 'gpt-4.1', messages: [ { role: 'system', content: systemPrompt }, { role: 'user', content: message } ], max_tokens: 600, temperature: 0.3 }); const content = response.choices[0]?.message?.content; if (!content) { throw new Error('No response from OpenAI'); } // Try to parse as JSON first, if it fails return as conversational text try { return JSON.parse(content); } catch { // Return as conversational response return { action: 'conversational', message: content }; } } catch (error) { console.error('OpenAI chat error:', error); throw new Error(`AI chat failed: ${error instanceof Error ? error.message : 'Unknown error'}`); } }
Below are the relevant parts of the prompt that decide between conversational or command, and examples to guide the LLM:
SIMPLE DECISION RULE: Does the user's message indicate ANY change to todo status or data? **YES → Return JSON command to execute the change** This includes: - Explicit commands: "complete task 1", "mark as high priority", "add new task" - Implicit statements: "I bought X", "finished Y", "already did Z", "got the groceries" - Reference-based: "mark those done", "complete what we discussed" - Status updates: "I've done the report", "actually completed that yesterday" **NO → Return conversational response** This includes: - Information requests: "which todos mention X?", "what's due today?" - Analysis questions: "how many tasks left?", "show me completed items" - Strategy questions: "what should I focus on?", "how to organize?"
Here are examples of information requests (conversational), and some additional instructions:
**Information Requests (→ Conversational Response):** - "Which todos mention Riya?" → "2 todos mention Riya: 'buy Riya a new lunch bag' and 'Register Riya for belt test'" - "What's due today?" → "3 tasks are due today: [list specific tasks]" - "How many tasks left?" → "You have 4 incomplete tasks: [list them]" - "What should I focus on?" → "Based on priorities and due dates, I recommend..." **CRITICAL INSTRUCTIONS:** 1. **Confidently resolve references** using conversation history - if context is clear, execute directly 2. Apply the simple decision rule: Change request = JSON, Information request = Conversational 3. For implicit completions like "I bought X", find the matching todo and mark it complete 4. NEVER return JSON as text in conversational responses - execute it or don't mention it **Reference Resolution Confidence:** - If previous message mentioned specific todos and user says "those", "them" → directly act on those todos - Only ask for clarification if the reference is genuinely ambiguous - "Change due date for those" after listing high priority todos → directly edit those high priority todos Use both todo data and conversation history for accurate responses.`; }
Without these instructions, the model will either hallucinate with JSON that doesn’t map to any existing commands (i.e. it makes up new commands), respond back that it cannot fulfill the request, or respond with free form text when we need structure.
Chat Mode: Resolving Pronouns and Indirect References
One of the biggest steps toward natural interaction is teaching the model how to resolve pronouns like it, that, those, or them. Without this ability, users are forced to restate full todo titles in every request—an unnatural way to talk.
In our **Reference Resolution Confidence:** (above), we added explicit guidance for reference resolution. The idea is to push the model to confidently act when the context is clear, instead of asking the user to clarify. This section of the prompt specifies the following:
- If the previous message mentioned specific todos and the user says those or them → directly act on those todos.
- Only ask for clarification if the reference is genuinely ambiguous.
- For example: “Change due date for those” after listing high-priority todos should trigger an edit of those high-priority items immediately.
We also tell the model to use both the current todo data and the conversation history to disambiguate references. This way, the assistant can correctly interpret requests that hinge on what was just said.
Model Behavior in Practice
When I first tried this with gpt-3.5-turbo, it sometimes hesitated or asked for clarification even when the prompt had more explicit instructions. I then switched to gpt-4.1, which just worked, often resolving pronouns correctly without elaborate prompt scaffolding. I’m now experimenting with gpt-4o-mini, curious to see if it balances accuracy with lower cost. If it holds up, this may become the default for chat mode as well. There’s nothing preventing us from using different models based on the use case.
The “Aha” Moment
The breakthrough here was realizing that pronoun resolution makes the app feel conversational in a very human way. Once you can say things like “Move those to next week” or “Delete that one” without spelling out the full title, the tool stops feeling like a command line and starts feeling like a real chat partner.
Chat Mode: Unintended Side Effect
When we added “chat mode,” we expected it to make the app more conversational. What we didn’t anticipate was a new side effect: the ability to ask the AI general questions that have nothing to do with todos—and then turn those answers into todos.
For example, you might type:
- “What do I need to pack for a beach vacation?”
The assistant can respond with a list of suggestions, which you could then promote into actual todo items. In other words, chat mode doubles as a true assistant or idea generator. Below is a recording of this behavior:
This opens the door to something bigger. Instead of only managing what you explicitly type in, the LLM could become a proactive todo builder—drawing from additional data sources. Imagine:
- Calendar integration – The model sees an event on Friday and suggests: “Prepare slides for Friday’s meeting”.
- Tool calls / MCP – The assistant queries external services, then translates results into todos or deadlines.
- Contextual due dates – If it knows a bill is due on the 15th, it can propose a reminder automatically.
What started as a way to make the assistant remember conversation history has turned into a foundation for a richer workflow: one where todos don’t just come from you, but can also come from your apps, your schedule, and even the AI’s own reasoning.
Prompt Inflation
As we’ve expanded features, the system prompt has ballooned in size. Every new capability, whether handling pronouns / resolving references or command examples, adds another block of instructions. This is natural, but it introduces a few challenges:
- Maintainability
At some point, a giant monolithic prompt becomes unmanageable: it’s hard to see what parts are still relevant, which instructions are redundant, and where new guidance should go. A better approach is to break the prompt into smaller, reusable components. Think of it as a “prompt builder” pattern that assembles the right pieces depending on the context. - Version Control
Prompts should be versioned just like code. Track changes, commit them, and roll back if needed. This gives you a clear history of what modifications were made and why. More importantly, you can branch and experiment with different prompt variations without losing your baseline. - Testing and Evaluation
The only way to know if prompt changes help (or hurt) is to test them. Ideally, you’d have a suite of representative test cases (sample inputs and expected outputs) that you can run each new prompt against. Even a small test suite is better than nothing. Over time, you can grow coverage and track a key metric: what percentage of test cases still pass after a change. - Knowing When to Stop
Sometimes, adding more instructions is not the answer. Bloated prompts can overwhelm the model, introduce contradictions, or push you past token limits. Instead of inflating further, you may need to experiment with simplifying, or even restructure how context is delivered (for example, by shifting some logic into tool calls).
Prompt evaluation and tooling are big topics on their own and beyond the scope of this post. But if you’re planning to grow your app beyond a prototype, it’s worth thinking early about how you’ll manage, version, and test your prompts, because “prompt inflation” will happen sooner than you expect.
Prompt Caching
As prompts grow larger (>= 1024 tokens), they become eligible for OpenAI’s automatic prompt caching. This feature reduces latency and cost by caching the static portion of a prompt on the server, so repeated requests with the same prefix do not have to be reprocessed in full.
The key detail is that cache hits depend on prefix matching on the prompt. Our current prompts unfortunately start with dynamic content (like today’s date, the current todo list, recent messages), which means they miss the cache every time. To fix this, we can reorganize our prompts so that the static content comes first. Instructions, decision rules, and examples remain fixed at the top, while the dynamic context is appended at the end. You can see the changes on a separate branch, here. With this structure, the long shared prefix remains stable across requests, making it far more likely to hit the cache.
prompt_cache_key
As you can see in the commit changes, OpenAI also supports a prompt_cache_key parameter, which helps group related prompts under the same cache prefix. We added values like “todo-parse-v1” for natural-language parsing and “todo-chat-v1” for chat mode. This ensures multiple requests that share the same static prefix can route to the same cached result. Cache hits are not guaranteed, but they may provide a latency improvement for repeated queries.
JSON Schema
In the previous post, I mentioned that we would switch to using structured outputs, which requires a JSON Schema definition. This schema tells the LLM to adhere to a well-defined format, giving us back a type-safe object. You can follow along with the code here.
To enable this, you need to:
- Define a JSON Schema document.
- Set response_format to
{ type: "json_schema" }
. - Pass the schema along with the request.
We can use Zod for this end-to-end:
- It provides TypeScript validation at runtime.
- It can generate the JSON Schema that OpenAI requires.
- The OpenAI library provides a helper function to produce the OpenAI-specific response_format structure directly.
As a sanity check, I had Claude Code generate a raw JSON Schema first. This step wasn’t strictly required, but it let me visually validate the shape before having Claude generate the Zod code (todoCommandSchema.ts). From that Zod schema, we can output both the OpenAI format and use it locally for validation.
One wrinkle: OpenAI’s API does not currently support anyOf
(discriminated unions at the root) or optional
properties. To work around this, Claude had to flatten the Zod object structure and convert optional
fields into nullable
ones, which resulted in a simpler schema (todoCommandSimple.ts).
Time for Some Excedrin
When I switched to structured outputs with JSON Schema, things started breaking in new ways:
- Conversational detection stopped working. Because the API always forces a valid JSON response in this mode, the old fallback of “if it’s not JSON, treat it as conversational” never triggered.
- Calls to the LLM became noticeably slower. I haven’t measured call latency yet.
- Workaround: I added a new
conversational
command type. That restored detection, but the model’s conversational replies became less fluid and more error-prone. I wonder if that’s because it’s trying to form a JSON response, forcing it to lose some richness. Admittedly, I didn’t spend much time researching this.
Other issues cropped up as well:
- add_multiple_todos vs. add_todo: The LLM often outputs a sequence of individual
add_todo
commands instead of a singleadd_multiple_todos
. Worse, when it does attemptadd_multiple_todos
, the JSON is frequently malformed. Since a sequence ofadd_todo
commands is simpler and already supported, I removedadd_multiple_todos
entirely. You can see that code here. (I believe Claude Code initially created this command type, but when we specified command sequences, it created a new command type but did not removeadd_multiple_todos
. Coding agents can very quickly create vestigial or dead code!) - Conversation history depth too short: With only the last three messages in history, context quickly falls out of scope when errors occur or the conversation bounces back and forth. That leads to the model losing track of references it should remember.
At this point, I’m not ready to merge structured outputs into main. One alternative to try out: first call without structured outputs to decide if a request is conversational, and if not, make a second call with structured outputs enabled. For now, though, I’m leaning toward avoiding structured outputs in this app.
Ummm… Where’s That Excedrin?
The second headache: model behavior. Results vary widely depending on which GPT variant I use:
- gpt-4.1: Works the most consistently, especially in chat mode. But it’s expensive — about 13x the cost of gpt-4o-mini.
- gpt-4.1-mini: Performs better than 4o-mini, but still nearly 3x the cost.
- gpt-4o-mini: Initially great (when chat mode was simpler), but as complexity increased I now have to nudge it to generate commands and remind it to use conversation history. It struggles with both fluidity and reliability.
- gpt-4.1-nano: Not usable for our use case. It refuses to generate commands at all, responding with chatty filler like: “That sounds exciting! While I can’t create a packing list specifically for the Bahamas, I can help you prepare a packing list based on typical items for a beach vacation. Would you like me to generate a packing list for your Bahamas trip?” When I said “Yes, go ahead,” it treated that as another conversational reply — and did nothing.
- gpt-3.5-turbo: We started out with this (in the last post), but upgraded due to cost. Also, it was not great at resolving pronouns / implicit references.
The State of Play
I’m not happy with where things stand. We started with gpt-3.5-turbo, moved to gpt-4o-mini (which initially worked well), but as features grew more complex, only gpt-4.1 has delivered consistent results. The downside is cost.
The reality is this will require much more testing of prompts and models to find a balance between consistency and cost-effectiveness. Right now, structured outputs add complexity without enough payoff, and cheaper models aren’t keeping up with the demands of chat mode.
What’s Still Missing?
There’s plenty we haven’t tackled yet! A few of the bigger gaps:
- Logging: capturing all requests and responses, along with debug logging, so we can actually trace and troubleshoot behavior.
- Testing and Evaluations: adding a robust regression test suite for features and prompt (and model) evaluations.
- Guardrails: sanitizing input to handle sensitive information (PII), profanity, or irrelevant queries before they reach the LLM.
- User feedback: collecting signals on which responses are useful and which aren’t is key if we want to improve quality over time.
These are important building blocks for a production ready app, but outside the scope of this post.
Lessons from Building This Feature
- Conversation history adds a lot of value – Even a small slice of context makes the assistant feel like it is listening, not just executing. Suddenly the assistant can resolve pronouns and follow up on past requests.
- Chat mode can be very powerful – Once we add “chat mode,” the assistant can do more than manage todos. It can brainstorm, suggest new items, even answer questions that indirectly turn into todos. It can also open the door to future integrations, like calendars, reminders, and external data sources.
- Formatting data for LLMs is now a core pattern – Just as we format for UI, storage, or APIs, we now format specifically for prompt context. String interpolation turns out to be one of the simplest and most useful tools here.
- Models can behave very differently – Even after reading the descriptions of each model, it’s not clear to me how a model will behave. Deviations from expected behavior feel unpredictable.
- Structured outputs need tradeoffs – JSON Schema gives us type safety, but quirks in OpenAI’s support mean flattening schemas and tweaking optional fields. It also appears to slow down the calls to the LLM, and does not play as well if you want purely conversational responses sometimes.
- Prompt inflation is real – As prompts grow, creating reusable prompt templates and builders will become essential to keep things manageable. Here is a recommendation by Claude Code.
- Prompt caching depends on structure – Reordering prompts so static content comes first enables OpenAI’s prompt caching.
In the next installment, we’ll shift gears and examine the UX trade-offs between chat-based and graphical interfaces.