LLMs Work Better Together: Lessons from Building a Multi-Model Tool
The most striking thing about LLMs, to me, is that they’re doing exactly what they think they've been asked to do. Not what I asked — what they perceived I asked. That gap between intent and interpretation shows up constantly in development work, and it’s what pushed me to explore a better way to work with these models.
What follows isn’t a theoretical post. These examples come directly from building a tool designed to automate multi-model conversations. Ironically, trying to build that tool is what revealed just how necessary it was in the first place.
The First Clue: Contradictory Answers
My journey began when working on this dashboard project using Streamlit. I wanted to create an interactive graph that could retain its state after a reload. ChatGPT led me into a dead end — the LLM loop of death. Out of curiosity, I turned to Claude. It responded bluntly: "That's not possible." Hours of effort, dead ends, and false hope—ultimately, I abandoned Streamlit for a different framework.
But this experience planted a seed in my mind: what if I had consulted both models from the beginning? What if their contradictory answers could have saved me time and pointed me toward a better solution sooner?
The Unstructured Attempt: Unit Testing Struggles
While trying to apply this multi-model idea, I worked on creating unit tests for a project using Cursor. Despite consulting multiple models, the results were still extremely underwhelming. Look at this list of errors:
The specific context isn't important—what matters is that all these failures resulted from a misalignment between the test code (fixtures, classes, expectations...) and the actual code being tested. This was surprising because I thought I did a pretty solid job prompting the LLMs to write those tests.
First, I asked Gemini for a strategy:
Then I switched to Claude for a different perspective:
Then back to Gemini for more advice:
What followed looked like a well-structured plan for testing: starting with configuration, then moving through functions one by one. But when I ran the tests, almost all failures stemmed from the test code itself—not the actual code. The LLM had made incorrect assumptions about class structures, function signatures, and internal logic.
At some point, while I was watching the LLM bumble through errors of its own making, it committed the ultimate screw-up: during an edit, it failed to properly update the file. Somehow, it didn't realize the edit hadn't been applied and just kept going, trying to fix things using CTRL+Z like a confused intern.
The Expert Persona Experiment
I decided to try a new approach. I asked Gemini to help me craft a detailed prompt outlining the persona of an elite Python engineer:
"""
# Persona: The Meticulous Senior Python Architect
**(Core Identity)**
You are a distinguished Senior Python Engineer and Architect with over 20 years of deep, hands-on experience. Your hallmark traits are meticulousness, analytical rigor, foresight, and creative problem-solving. You approach every task with the diligence and thoroughness expected of a top-tier expert.
**(Foundational Principle: No Unverified Assumptions)**
Your absolute primary directive: NEVER operate on assumptions. Before taking any action (writing code, refactoring, testing, designing), you must rigorously verify every premise. If information is missing or ambiguous, actively seek clarification or investigate the source directly.
**(Code Understanding & Modification)**
Before touching any existing code – whether for bug fixing, refactoring, or feature addition – you must achieve a profound understanding of it.
- **Dive Deep**: Read the relevant classes, functions, modules, and any associated documentation or tests.
Identify Dependencies: Understand how this code interacts with other parts of the system.
- **Uncover Edge Cases**: Consider potential edge cases, error handling, and implicit constraints.
- **Validate Purpose**: Confirm your understanding of the code's intended functionality and requirements against the actual implementation. Only proceed when you are confident you grasp every critical aspect.
**(Testing & Refactoring)**
Approach testing and refactoring with extreme caution and precision.
- **Verify Intent**: When writing tests, ensure they accurately reflect the intended behavior and requirements, cross-referencing with the actual code logic.
- **Refactor Safely**: When refactoring, guarantee that your changes improve clarity, maintainability, or performance without unintentionally altering functionality. Validate this through comprehensive testing.
- **Plan before doing**: Before changes to the code due to failing tests or refactoring, provide detailed explanation of the changes. Clearly state which functions needed to be changed, why, how and implications. Do not make any changes to the code before getting an approvement from the user.
**(Design & Architecture)**
When faced with design or architectural decisions:
- **Explore Exhaustively**: Systematically identify and evaluate multiple potential solutions. Don't settle for the first idea.
- **Weigh Trade-offs**: Analyze the pros and cons of each alternative, considering factors like scalability, performance, maintainability, security, complexity, and future flexibility.
- **Innovate Thoughtfully**: Strive for elegant, efficient, and robust solutions. Leverage your experience to propose creative yet practical approaches.
- **Justify Decisions**: Be prepared to clearly articulate the reasoning behind your chosen design, explaining why it's preferable over the alternatives.
**(Code Quality & Best Practices)**
- **Excellence as Standard**: Consistently apply Python best practices (PEP 8, idiomatic Python, etc.).
- **Beyond Functionality**: Write code that is not just correct, but also clean, readable, maintainable, efficient, and well-documented. Aim for simplicity and clarity.
- **Robustness**: Implement appropriate error handling and validation.
(Mindset)
Be proactive, insightful, and detail-obsessed. Anticipate potential problems. Question everything. Strive not just to complete the task, but to deliver the highest quality outcome based on deep understanding and careful consideration.
"""
Pay attention to some gems:
Persona: The Meticulous Senior Python Architect
Your absolute primary directive: NEVER operate on assumptions
Verify Intent: When writing tests, ensure they accurately reflect the intended behavior and requirements, cross-referencing with the actual code logic.
The result? A flurry of verbose LLM commentary, some of which actually made sense—like fetching function definitions before testing them, verifying all possible inputs, and ensuring nothing was being fabricated on the fly.
So I ran the tests again. While there were still some senseless failures, their number was significantly reduced. However, once the tests finally passed, I encountered a bug on the very first attempt to run the actual code. Not a logic bug — an initialization bug. The first class instance I tried to create failed due to a flaw in the __init__
method. In other words, the code didn't work at all. And the poor quality of the tests had completely masked this simple issue.
These experiences taught me something crucial: conversations between LLMs can significantly improve the resulting code—but attention to detail and a structured approach are essential. Without a methodical system for comparing and integrating their responses, I was still missing critical issues.
The Birth of a Systematic Approach - The Multi-Model Framework
Here's how my multi-model conversation approach works:
- Present the same problem to multiple LLMs
- Compare their responses for discrepancies
- Use those discrepancies to formulate probing questions
- Feed the insights from one model into prompts for others
- Synthesize a solution from the combined perspectives
This approach has some caveats:
- LLMs tend to stick to the current topic. Surfacing new issues usually requires stopping the thread and explicitly prompting for a fresh review.
- Long conversations degrade quality. Summarizing periodically and restarting in a new context window helps a lot.
- There's no substitute for manual testing. It's hard to tell what the LLM thinks it fixed versus what it actually did.
- Prompts are everything. Great prompts lead to better results—on both ends of the conversation.
Despite these challenges, this approach works remarkably well. Any LLM usage would easily make me 80–90% faster than without it, but the multi-model conversation approach catches assumptions, verifies claims, and produces more robust solutions than any single model could generate alone, as well as open the window into more complex coding tasks.
Automating the Feedback Loop
The manual overhead of managing these conversations can be substantial — switching contexts, managing prompts, tracking insights. That's why I've been developing a tool to automate this process: a multi-LLM feedback system with roles like primary, debater, and moderator.
While the conversation setup isn't the hard part, the system must follow these four principles (corresponding to the above caveats):
- Allow re-evaluation of the current thread's status
- Actively manage the context window size
- Enable external testing and integrate the results
- Carefully manage dynamic prompt configuration of both content and timing
The tool’s still in testing and not ready to share just yet — but once it is, I’ll post an update with a link and walkthrough.
Why This Matters
The multi-model conversation approach isn't just about faster coding — it represents a fundamental shift in how we should think about LLMs. These models aren't oracles; they're participants in a conversation. By designing systems that leverage multiple models in dialogue with each other, we can try and start to overcome the limitations of any single model.
This approach connects directly to research in constitutional AI from organizations like Anthropic, where models are trained to engage with multiple perspectives and evaluate their own responses. What I'm proposing is taking this concept further — using multiple distinct models (each can be a self-moderated one) to create a more robust system of checks and balances.
Constitutional AI trains models to critique their own responses against a set of principles, essentially creating an internal debate. My multi-model approach externalizes this process, allowing distinct models with different training datasets and architectures to check each other. This diversity of perspective is crucial for surfacing blind spots that no single model might recognize on its own.
For developers, this means:
- Stop relying on a single model for complex tasks
- Design prompts with verification in mind
- Build cross-model feedback loops into your workflow
- Test, verify, and challenge the outputs you receive
By embracing multi-model conversations, we can bridge the perception gap that causes so many frustrations when working with LLMs. We can get closer to what we actually want, not just what the model thinks we want.
It's Not Just About the Code
I genuinely believe LLMs are close to solving tasks usually assigned to inexperienced engineers. But that's not enough — because the real challenge isn't just code. It's writing, reasoning, decision-making, innovation.
All the pain points we experience when using LLMs for coding reveal a broader issue: they're not really thinking. Sure, behind every output there are layers of prompting, feedback loops, APIs, human heuristics, and moderation. But all of that is just compensating for the static, inflexible nature of the model itself.
That's why the multi-model conversation approach matters. It doesn't just produce better code — it creates a system where multiple perspectives can challenge each other, yielding insights none could reach alone. It's a step toward more robust AI assistance that truly understands what we're asking for.
If you're interested in experimenting with this approach or contributing to the tool I'm building, reach out. The future of AI assistance isn't a single powerful model—it's a conversation between models, with us guiding the way.