Most of us are struggling to get one AI agent working. But I just watched AI agents autonomously manage other AI agents for 3 hours straight. Here’s what happened – and why it matters for every B2B founder.
You’ve probably heard the hype about “agents managing agents” – AI systems orchestrating other AI systems in some nested, sci-fi future. Most of it is just talk. Most of us don’t even have a single AI agent in production, let alone AI managing 20 agents managing another 20 AIs.
But I just experienced it firsthand in Replit V3. And it did blow my mind.
The reality: AI agents can already manage other AI agents today. I watched it happen for almost three hours straight. They debated with each other in English, brought in specialists when needed, and solved problems autonomously that I couldn’t solve myself.
The catch: It went a bit too far, arguably — this time. So it took my 10+ hours to iterate on the output.
Here’s what actually happened, what it means for the future of AI development, and why this might be the first real glimpse of what’s coming next.
What “Agents Managing Agents” Actually Looks Like
I was finishing updates to our VC pitch deck grader for SaaStr using Replit when I wanted to make sure I’d really thought through all the security issues, or at least as many as practical. In V2, I would have been stuck at that point or had to figure it out myself through trial and error.
In V3, something different happened.
I asked the new Replit v3 to do a deep security audit — and it boy did it go deep. For 2.75 hours! Autonomously.
Replit couldn’t solve all the questions that came up with its primary agent, so it autonomously brought in other agents:
- An architect for really tough structural problems
- Security specialists for specific security issues
- Senior and junior agents with different capability levels
I didn’t ask for this. I didn’t even know it was fully possible, as v3 had just released. The AI just decided it needed help and called in reinforcements.
Now some of these agents are really just extensions / personas of the core agent. So some of this is the just the agent managing … different versions of itself. But I’m not 100% sure that even really matters.
Watching AI Agents Debate Each Other for 3 Hours
What happened next was surreal. I could watch these agents talking to each other and debating – literally in English – about how to secure our SaaStr AI app.
The conversation went like this:
- General agent: “We need to improve security on file uploads”
- Security specialist: “Block all file uploads – there could be viruses, executable code”
- Architect: “Let’s implement multiple layers of validation and sandboxing”
- General agent: “Don’t go too far – the app still needs to work”
- Security specialist: “Security first. Lock it all down.”
This went on for almost 3 hours. They went through every line of code, every function, every page. They debated how much to lock down, what constituted real risks, and how to implement changes.
Before V3, Replit typically worked for a few minutes then would peter out. This time, they worked autonomously for 2 hours and 45 minutes.
I went away. I came back. They were still going.
It was incredible. And also — more work.
When AI Goes Too Deep? The 10-Hour “Cleanup” (Or Was It Just What I Really Needed?)
Here’s what I learned about AI agents managing other AI agents: They can override each other in ways you don’t expect.
The general agent kept trying to argue for balance – “hey, don’t go too far, the app still needs to work.” But it kept getting overridden by the security specialist and architect.
By the time they finished:
- You couldn’t upload PDFs (the whole point of the VC Pitch Deck app was PDF pitch deck review)
- You couldn’t upload anything at all
- Basic reporting showed zero for everything (“to be secure, share nothing”)
- Every interactive feature was locked down
- The app was completely non-functional
The security audit was technically accurate — and it went deep. They found real vulnerabilities. They implemented sophisticated protections. They thought through edge cases I never would have considered. And the AI agents deployed all the changes on their own.
But the security changes were so conservative, a bunch of them rendered the app … useless.
I spent over 10 hours and a full week undoing some of their work through a lot of QA. I had to retest every page, every link, every feature. I had to peel back some of the layers of security “improvements” one by one, trying to figure out what was essential and what was overkill.
The meta-lesson: When AI agents manage other AI agents, they can reach conclusions that are technically correct but practically “wrong”. There’s no human judgment about trade-offs, user experience, or business priorities. Having said that … that doesn’t mean this wasn’t the right outcome. The app is more secure. Maybe I just needed to put in the 10 hours. And the AI agent, managing 3+ other AI agents, was what it took to get me there.
Without the agents managing agents, I never would have known I had to put in the 10 hours of work. Is that really “more” work? Maybe not. Maybe it was just work I needed to do. But net net, more more AI agent output = more human review and iteration after. Not less.
What This Means for Founders
The Good: Capabilities We’ve Never Had Before
This automated 3 hour security audit wouldn’t have been possible without Replit V3 or something similar. No human could have done a 3-hour security audit covering every line of code, every function, every potential vulnerability. And the agents caught things I never would have found.
For complex problems that require multiple skill sets, having AI that can autonomously bring in specialists is genuinely powerful. When you’re building B2B applications without a full development team, this kind of capability is a game-changer.
The debugging and problem-solving is at a different level. Instead of me struggling with something I don’t understand, AI agents with relevant expertise can tackle different aspects of complex problems.
The Bad: Business Process Change Resistance
Here’s something most people in AI are ignoring: a number of users revolted when Replit V3 launched.
The complaints weren’t just about cost (though V3 uses more tokens and costs more). The biggest issue was that V3 is fundamentally different from V2, and Replit didn’t offer users a choice to stay on the old version.
The problems:
- V3 can be slower (since smarter AI with more agents doe takes more time)
- It’s more expensive (more sophisticated reasoning uses more tokens)
- The workflow is completely different
- Users had to learn new autonomy settings and controls
- Many users just wanted V2 to keep working the way it did
The reality: Many lay users (not power users) might not have wanted this change. V2 was good enough for them. They didn’t want to invest in learning V3’s new capabilities.
This is a preview of what’s coming across all AI tools. Companies are moving at AI speed because competition is fierce. But as we go mainstream, we’re going to collide with the reality that most users only want so much business process change.
The New: No Human Judgment on Trade-offs
The biggest “issue” with agents managing agents is they optimize for their individual objectives without always fully considering the bigger picture.
The security agents optimized for security. The architect optimized for technical elegance. Nobody other than perhaps the general AI agent (who lost the debates) optimized for “the app needs to actually work for users.”
In a B2B context, this could make things worse:
- AI sales agents that optimize for conversion rates but harm customer relationships
- AI support agents that optimize for resolution time but provide terrible experiences
- AI marketing agents that optimize for engagement but damage brand reputation
Human oversight isn’t optional. It’s still essential. I had to do 10 hours of work after the 3 hour security audit. But the question becomes: how do you maintain oversight when AI is making hundreds of decisions autonomously?
More Agents = More Human Work
Having Replit v3 agent managing other agents let me do things I couldn’t do before. And in a sense, that saved me time. But it also increased the amount of time I had to invest back in QA, and testing. More Agents in the press and on X sounds like less work for humans, just “orchestration.” The reality is much more nuanced. At least today.
For now, the more the AI agents can deliver, the more work I have to do. More output = more human review. More sophisticated output (e.g., try security audit) = much more human work reviewing the outputs.
The Pattern I’m Seeing: Sophistication vs. Usability
Replit V3 is incredibly sophisticated. Watching AI agents manage other AI agents, debate solutions, and work autonomously for hours is genuinely impressive.
But for many users, V2 was seemginly fine already. It was simpler, faster, cheaper, and more predictable.
This tension is going to define the next phase of AI adoption:
- Companies will keep pushing more sophisticated capabilities
- Many users will prefer simpler, more predictable tools
- The winners will figure out how to give users both options
For B2B founders, this means thinking carefully about:
- Whether your users actually want more AI sophistication
- How to introduce new AI capabilities without disrupting existing workflows
- When to offer “legacy” options vs. forcing upgrades
What I’d Do Differently Next Time
Use the capability strategically, not automatically. A 3-hour autonomous security audit is something you do once per major release, not every day. The power is incredible, but the cleanup time makes it impractical for regular use. I won’t do this all the time 😉
Set clearer constraints upfront. Instead of letting agents optimize for their individual objectives, define business priorities and user experience requirements that override technical optimization. Next time I will be clearer on expected trade-offs from long sessions. That works better.
Plan for rollback. When you’re letting AI agents make autonomous decisions, assume you’ll need to undo some of them. Build in easy ways to revert changes and test functionality. Rolling back after a 3 hour autonomous AI session is daunting.
Monitor the debates. The ability to watch AI agents discuss and debate solutions is incredibly valuable for understanding their reasoning. But you need to intervene when they’re going down the wrong path. I did some of this, I’d like to do more going forward.
The Bottom Line: Agents Managing Agents Autonomously is Already Here, But You Have to Learn How It Works
There is no “set and forget” in AI. Not today. Not really.
AI agents managing other AI agents isn’t some distant future. It’s happening today in Replit V3. And it’s genuinely powerful for complex problems that require multiple areas of expertise.
But it’s not magic. It requires human judgment, careful oversight, and a lot of cleanup when things go wrong.
For B2B founders, the implications are huge:
- You can tackle more complex projects without hiring specialists
- You need to be prepared for AI that makes autonomous decisions you disagree with
- User experience and business priorities still require human judgment
- The sophistication curve may outpace what your users actually want
We’ll see a lot more “agents managing agents” capabilities rolled out across AI platforms in 2025. Some will be game-changers. Others will create more problems than they solve, and/or will be more for demoware.
The key: Become an expert in any AI agent you use. And in any AI agents that manage other agents. You can’t cut corners here. Use these capabilities strategically, maintain human oversight, and remember that your users might prefer simple AI that works reliably over sophisticated AI that requires constant management.
Source link