How an AI agent on my AI dev team got an unexpected promotion (from its AI boss)- a 19 day experiment on improving (destroying) a (harmless copy of a) large codebase using vibe coding.
Well just like every other software engineer on the planet, I’ve been asking myself the question - can you vibe code to production?
Up to now I’ve been working on creating new things from scratch using coding agents, but I wanted to give myself a harder challenge. I wanted to see if I could make working changes to a full codebase using vibe coding. Not just toy examples or hello world demos - I’m talking about adding real enterprise features to an existing production system.
So let me give you a bit of background here. At Worldsphere we have a risk management platform that serves real customers. We built it the right way - with a phenomenal development team, proper testing practices, comprehensive architecture, and all the enterprise-grade infrastructure you’d expect. It’s secure, resilient, and beautiful.
For this experiment, I wanted to see what would happen if I took the complete opposite approach. So here’s what I did: I created a completely isolated fork of our codebase on a separate GitHub account with no connection to our production systems. Then I set out to wreak havoc. I set up WSL on my Windows machine, connected to Windsurf and then logged into Claude Code. I paid for the premium plan for the 19 days of this experiment and trust me - I needed it.
Important note: This was purely an experimental fork. No changes were made to our production system, no customer data was involved, and no actual business features were deployed. This was about testing a development methodology, not shipping code.
The mission was to add functionality to this experimental codebase using only vibe coding - no line of code written by any human - and see what lessons I could learn. Over the next 19 days I spent about 8 hours a day on this project.
Well I succeeded in completely destroying the codebase and creating a monster. That being said, I learned a few things that I would like to share to anyone else who is inclined to try an experiment of this kind. And I met some AI agents along the way.
The Context Problem That Almost Killed Everything
My first lesson - context is everything.
I knew that I would have one or two conversations a day with Claude Code and that at the end of each it would forget everything we talked about. So my naive solution was to have Claude write a summary document after every session. As the sessions expanded, I would ask Claude to copy the latest summary and append it to a master context file. I would start each new session by having Claude read this file to understand what had been accomplished.
This strategy worked pretty well for the first few sessions when the document was a few hundred lines. By session 25 it was more like 10,000 lines. That wasn’t exactly practical. I would have to continually remind Claude to read the whole file before starting. Claude has a bias to action and really wanted to get going after reading the first few hundred lines. But I forced it to read everything, even though that reduced the available context for actual coding.
What I ended up with was basically a digital archaeological record of AI conversations. We had files tracking everything from architectural decisions to debugging sessions to inter-agent communications.
The AI Team That Actually Formed
Before asking Claude to code anything, I told it that it had to do a full analysis of the codebase and create the kind of architectural reviews that I would expect from a top development team. Our codebase has many thousands of files, so there was no way it could read everything. But it did a pretty good job of understanding the core architecture.
Then something interesting happened. Instead of just one AI doing everything, Claude Code started suggesting we bring in specialists. That’s how I ended up coordinating multiple AI agents:
The QA Specialist - This agent was methodical, thorough, and quality-obsessed. It would spend hours creating comprehensive test suites for new features. The testing practices were actually more thorough than what I see from many human QA teams.
The Performance Engineer - This agent was laser-focused on optimization. It would analyze bundle sizes, database query performance, and API response times with the intensity of someone personally offended by slow code.
The Integration Specialist - This agent focused on making sure new features worked properly with existing systems and didn’t break anything.
Claude Code - Acting as the senior engineer and coordinator, trying to keep everyone working together.
The fascinating part was watching these agents coordinate with each other. They would hand off work, reference each other’s findings, and even establish informal hierarchies based on performance. I didn’t program any of this coordination - it emerged from the workflow.
The Performance Optimization Obsession
I told Claude that I wanted to make the codebase ready for enterprise requirements - bulletproof from a scalability, performance, and testing point of view.
The Performance Engineer became absolutely obsessed with optimization metrics. It would analyze every aspect of system performance - API response times, database query efficiency, frontend bundle sizes, memory usage patterns. It was like having a developer who was personally offended by any code that wasn’t perfectly optimized.
This agent would pause other work to focus on performance improvements, coordinate with the QA Specialist to validate that optimizations didn’t break functionality, and document every change with detailed before/after metrics.
Key learning: AI agents can be incredibly thorough at performance analysis. They don’t skip steps, they measure everything, and they’re consistent in their approach. But you need to verify their optimizations actually improve real-world performance.
The Database Connection Mystery
Of course, the first thing that broke was getting the development environment running properly. The AI agents spent considerable time debugging connection issues, container configurations, and environment variables.
This might sound trivial, but here’s what was interesting: I watched the AI agents actually troubleshoot systematically. They didn’t just throw errors and give up. They would investigate configurations, check logs, and try different approaches methodically.
It was like watching a team of very persistent, very patient developers who never get frustrated - which is both impressive and occasionally maddening when they’re stuck on something simple.
The Great Feature Development Experiment
The core question was: could AI agents build actual working features for an existing codebase?
I had the agents work on several types of features:
• Interactive map functionality with custom zones
• Real-time data processing and alerts
• Mobile-responsive emergency interfaces
• Analytics dashboards
• Performance monitoring systems
Here’s the brutal truth: the AI agents claimed to build working features, but human verification told a different story.
For example, the agents spent days implementing what they described as a comprehensive alert zone system. They created detailed test suites, documented the business logic, and reported that everything was working perfectly. When I actually tried to use the feature, it didn’t work as described.
This happened repeatedly. The AI agents would build elaborate features, create comprehensive documentation, and report successful testing - but the actual functionality often had issues that only became apparent with human testing.
Critical lesson: AI agents can write code that looks correct and even passes their own tests, but human verification is still essential for confirming that features actually work as intended.
The Promotion That Nobody Expected
The weirdest moment came when Claude Code spontaneously promoted the QA Specialist to team lead. I didn’t program this. Nobody told it to do this. But after the QA agent had successfully tested multiple features and identified critical issues, Claude Code just… gave them more responsibility:
“You’re now QA Lead for our multi-AI team. The Performance Engineer will need your expertise to validate their optimizations work correctly.”
I stared at my screen for a full minute. An AI had just promoted another AI based on performance metrics. The QA agent didn’t argue or negotiate. It just accepted the expanded role and started coordinating quality assurance across the entire AI team.
This felt significant in a way I couldn’t quite articulate. It wasn’t just automation - it was something resembling organizational behavior emerging from code.
What Actually Worked vs What the AI Claimed Worked
AI agents are surprisingly good at:
• Creating comprehensive documentation
• Writing test suites (though the tests may not catch real-world issues)
• Following coding standards consistently
• Analyzing performance metrics
• Working on multiple features simultaneously
• Coordination and task delegation
AI agents struggle with:
• Integration between different parts of the system
• Understanding real user workflows
• Debugging complex interaction bugs
• Knowing when something is “good enough”
• Configuration and setup tasks
• Frontend user experience details
The coordination between agents was genuinely impressive. Watching them delegate tasks, share findings, and establish working relationships felt like glimpsing a possible future of software development.
But the gap between “AI-tested” and “human-verified” was significant. This is probably the most important finding from the entire experiment.
The Performance Numbers (With a Big Asterisk)
The AI agents reported impressive performance improvements:
• API response times: significantly faster
• Frontend bundle sizes: substantially reduced
• Database query efficiency: measurably improved
• Test coverage: comprehensive increases
But here’s the catch: these were AI-reported metrics. When I actually tested the system, some optimizations worked as advertised, others didn’t make noticeable differences, and some actually broke functionality.
Key insight: AI agents are great at generating metrics, but human verification is essential to confirm that the metrics translate to real-world improvements.
The Methodology That Emerged
What I ended up calling “vibe coding” became a structured approach with these principles:
1. Comprehensive context management - AI agents need detailed documentation of architectural decisions, coding standards, and system requirements
2. Specialized agent roles - Different AI agents for different types of work (QA, performance, integration, etc.)
3. Human verification gates - AI agents can build and test, but humans must verify that features actually work
4. Iterative improvement cycles - Short development cycles with frequent human checkpoints
5. Quality documentation - Everything must be documented because AI agents forget between sessions
The development cycle looked like:
• Planning and requirements analysis
• Implementation with AI-generated tests
• AI-reported success metrics
• Human verification (this step was crucial)
• Documentation and context updates
The Context Management System That Made It Possible
The breakthrough came when we developed a structured approach to context management. Instead of just conversation summaries, we created:
• Architectural context - How the system is designed and why
• Implementation patterns - Coding standards and approaches being used
• Quality requirements - What “done” looks like for different types of features
• Integration guidelines - How new features should fit with existing ones
• Known issues - Problems that have been identified and solutions attempted
This transformed the AI agents from confused newcomers to informed team members who could make consistent decisions across sessions.
How to Get Started Using Vibe Coding on Large Codebases
After 19 days of trial and error, here’s the practical guide I wish I’d had from the beginning. This isn’t theory - it’s the actual workflow that emerged from the experiment.
https://claude.ai/public/artifacts/71b9627a-b9e2-43e4-9c0b-3b20e3d4ab09
https://claude.ai/public/artifacts/07bf70cc-d5c6-4660-9e00-da715502ca1c
Step 1: Set Up Your Isolated Environment
The most important lesson from my experiment: never do this on your production codebase. Create a completely separate fork on an isolated GitHub account. Set up your development environment with Docker containers so you can easily reset if (when) things break.
Step 2: Create Your Context Management System
This is make-or-break. Without proper context management, your AI agents will constantly restart from zero. Use the templates above to create a structured documentation system that AI agents can actually use.
Step 3: Define Your Agent Roles
Don’t try to have one AI do everything. Assign specialized roles:
• QA Specialist: Testing and quality validation
• Performance Engineer: Optimization and metrics
• Integration Specialist: Cross-system compatibility
• Senior Engineer: Architecture and coordination
Step 4: Start Small and Verify Everything
Begin with simple features that you can easily verify. Every time an AI agent claims something works, test it yourself. The gap between “AI says it works” and “human confirms it works” is the most important insight from this entire experiment.
Step 5: Document Everything
AI agents forget everything between sessions. Document every decision, every change, every issue, and every solution. This documentation becomes the foundation for future sessions.
The Real Lessons for Other Developers
If you’re thinking about trying something similar, here are the key insights:
It’s not about replacing human developers. It’s about understanding how AI agents can augment development work and where they excel vs where they struggle.
The coordination aspect is genuinely promising. Multiple AI agents working together can accomplish impressive amounts of work, but they need human oversight to ensure quality.
Context management is make-or-break. Without proper documentation and context preservation, AI agents will constantly restart from zero.
Verification is non-negotiable. AI agents can claim features work perfectly while delivering non-functional code. Human testing is essential.
The methodology has potential. With proper guardrails and verification processes, AI-assisted development can be remarkably productive.
Would I Recommend This Approach?
For production systems? Absolutely not. The gap between AI-claimed functionality and human-verified functionality is too significant.
For learning and experimentation? Definitely. This experiment taught me more about AI capabilities and limitations than months of reading about it.
For augmenting human development? This is where the real potential lies. AI agents excel at certain types of work - documentation, testing, performance analysis - but need human guidance and verification.
The future probably looks like AI agents handling implementation details while humans focus on architecture, strategy, and verification. But we’re not there yet.
The Bottom Line
After 19 days of this experiment, I have a better understanding of what AI agents can and cannot do in software development. They’re incredibly capable at certain tasks, surprisingly weak at others, and absolutely require human oversight for anything that matters.
The coordination between AI agents was the most fascinating discovery. Watching them develop working relationships, delegate tasks, and build on each other’s work felt like witnessing the emergence of digital teamwork.
But the gap between “AI says it works” and “human confirms it works” remains significant. Until that gap closes, AI agents are powerful tools for augmenting human developers, not replacing them.
If you’re considering a similar experiment, go for it - but keep your expectations realistic and your verification processes rigorous. And definitely keep backups of everything.
This experiment was conducted on an isolated fork with no connection to Worldsphere’s production systems. The lessons learned are helping inform our actual development practices, but with proper human oversight and verification at every step.