Research: The Future of Software Development Teams with AI

Compiled: February 2026 Purpose: Complete research synthesis from top commentators and evidence-based sources on how AI is reshaping software teams, workflows, and engineering practice.

Sources Reviewed

Source	Type	Date
Thoughtworks Future of Software Engineering Retreat	Multi-day retreat synthesis (Chatham House Rule)	Feb 2026
Dan Shapiro	Five Levels of AI Coding (blog)	Jan 2026
StrongDM “Software Factory”	Simon Willison write-up of production deployment	Jan 2026
Mitchell Hashimoto	AI Adoption Journey - 6 steps (blog)	Jan 2026
Addy Osmani	“Agents Need a Manager” / Agentic Engineering (blog)	Jan 2026
HBR / Berkeley Haas	“AI Doesn’t Reduce Work — It Intensifies It” (study, 200 employees)	Late 2025
GitClear	AI Code Quality Research (211M lines analysed)	2024–2025
DORA	AI Capabilities Model, 2025 Accelerate State of DevOps Report	2025
Tom Dale (Ember.js)	Mental health impact commentary	2025

Theme 1: The “Junior Dev” Framing is Obsolete

Almost every early take on AI coding settled on the same metaphor: treat it like a junior developer. This framing is now limiting. It undersells what AI agents can do (parallel execution, zero onboarding, instant duplication) and ignores genuine new risks (epistemic debt, drift, decision bottlenecks).

Old framing vs New framing

Old Framing	New Framing
AI is a junior dev	AI is an entire class of agent worker with different physics
Review its code	Invest in specs, tests, and constraints so code review becomes secondary
Pair with it	Orchestrate parallel streams and calibrate trust per task
It makes mistakes	Non-determinism requires verification infrastructure, not just eyeballs
Use it for boilerplate	Delegate entire work packages with acceptance criteria
Measure lines of code	Measure coherence, comprehension, and system stability

Theme 2: Where Does the Rigor Go? (Thoughtworks Retreat)

The single most important question from the retreat. If AI takes over code production, engineering discipline doesn’t disappear — it migrates. The retreat identified five destinations:

1. Upstream to specification review

Teams shifting review from code to the plan preceding it: “pre-reviewing plans and post-reviewing engineering”
Specs need new formats — traditional user stories too vague for AI agents
Teams rediscovering EARS (Easy Approach to Requirements Syntax), state machines, decision tables
Implication: Bad specs produce bad code at scale

2. Into test suites as first-class artifacts

TDD produces dramatically better results from AI agents — the retreat’s most shareable insight
Mechanism: TDD prevents agents writing tests that verify broken behaviour
Tests become “deterministic validation for non-deterministic generation”
Some practitioners treating generated code as expendable — if tests pass, code is acceptable regardless of how it looks
Reframe: TDD is a form of prompt engineering

“I’ve gotten better results from TDD and agent coding than I’ve ever gotten anywhere else, because it stops a particular mental error where the agent writes a test that verifies the broken behaviour.”

3. Into type systems and constraints

Make incorrect code unrepresentable rather than reviewing code after generation
Separate specifications (what should change) from constraints (what must not be touched)
Constraints limit blast radius — when a constraint must be broken, it signals a new system boundary

4. Into risk mapping

Tier code by business blast radius: internal tools vs external-facing vs safety-critical
New core engineering discipline: “What is the blast radius if this code is wrong, and is our verification proportional to that risk?”
Shift from craft model (every line hand-reviewed) to risk management model (verification investment matches exposure)

5. Into continuous comprehension

Code review historically served as a learning mechanism — mentorship, shared understanding, codebase familiarity
Alternatives: weekly architecture retrospectives, ensemble programming, AI-assisted code comprehension tools
Losing review-as-learning without replacing it creates a comprehension gap that compounds

“Paired programming solves all of this. If it’s important to understand the system, then do it all the time.”

Theme 3: The Middle Loop — A New Category of Work (Thoughtworks Retreat)

Nobody in the industry has named this yet.

Software development has two recognised loops:

Inner loop: Developer’s personal cycle (write, test, debug)
Outer loop: Delivery cycle (CI/CD, deployment, operations)

The retreat identified a third: the middle loop — supervisory engineering work sitting between them.

What middle loop work involves:

Directing, evaluating, and fixing AI agent output
Decomposing problems into agent-sized work packages
Calibrating trust in agent output
Recognising plausible-looking but incorrect results
Maintaining architectural coherence across parallel agent streams

Who excels at it:

Think in delegation/orchestration rather than direct implementation
Strong mental models of system architecture
Can rapidly assess output quality without reading every line
These are skills experienced engineers already have — but rarely explicitly developed or recognised in career ladders

Career identity crisis:

Genuine crisis for developers who fell in love with programming
Many hired specifically to translate tickets into code — that work is disappearing
Historical parallel: computer graphics engineers in 1992 hand-coded polygon rendering → 1994 pushed into hardware → job became animation/lighting → now custom physics. Each time abstraction rose, those who insisted they were hired to render polygons were left behind.

PM convergence:

Developers now thinking about what to build and why — work that belonged to PMs
One large tech company actively researching whether PM role needs a new name
Another training all PMs to work in Markdown inside developer tools

Theme 4: Maturity Models and Adoption Journeys

Dan Shapiro’s Five Levels of AI Coding

Level	Name	Description
1	Spicy Autocomplete	Tab-complete on steroids
2	Chat Pair Programmer	Conversational back-and-forth
3	The Trap	Agents write lots of code, humans lose comprehension — “the uncanny valley of AI coding”
4	AI-Native Development	Rearchitected workflows where AI writes and humans specify/verify
5	Dark Factory	Full autonomous operation (theoretical)

Key insight: Level 3 is where most teams stall. Productivity numbers look great but system understanding erodes. Teams that don’t deliberately move to Level 4 practices accumulate epistemic debt.

Mitchell Hashimoto’s 6-Step Adoption Journey

Drop the chatbot. Stop using ChatGPT in a browser tab. Use AI inside the IDE.
Reproduce your existing work. Use AI to redo tasks you already know how to do. You can verify quality because you know the answer.
End-of-day agents. Queue up agent tasks at end of day. Review results next morning. Builds trust calibration.
Outsource slam dunks. Give agents the straightforward, well-defined work. Free human time for hard problems.
Engineer the harness. Build AGENTS.md, custom rules, project context files. The harness is more valuable than any single prompt.
Always have an agent running. Continuous background agent work on lower-priority tasks. You review and redirect.

Key insight: “Invest in the harness, not the prompts.”

Addy Osmani’s “Agents Need a Manager”

Two modes: high-touch (complex, ambiguous work) and async (well-defined, parallelisable)
Three-part delegation split: 70% well-scoped agent tasks / 20% ambiguous requiring human judgment / 10% fully manual
Two-agent pattern: One agent generates, another reviews — mimics human pair programming
WIP limits for agents: Treat agent tasks like kanban cards. Too many parallel streams = review bottleneck
Operating loop: Plan → Delegate → Monitor → Review → Integrate

Theme 5: The Evidence — What’s Actually Happening

HBR / Berkeley Haas Study (200 employees, real workplace)

AI doesn’t reduce work — it intensifies it
Workers reported handling more tasks, not fewer
Cognitive load increased as workers became supervisors of AI output
Organisations assumed AI would free capacity; instead it raised the bar for what was expected

GitClear Research (211M lines of code analysed)

Code churn doubled in AI-assisted codebases
Refactoring dropped from ~25% to ~10% of changes
Copy/paste patterns rose from 8.3% to 12.3%
“Moved” code (proxy for refactoring) declining
Signal: AI produces append-only code by default. Without explicit refactoring guidance, codebases grow faster than they improve.

Tom Dale (Ember.js creator) on Mental Health

Described witnessing a “mental health crisis” among experienced developers
The identity shift from “person who writes code” to “person who supervises code” is genuinely difficult
Some senior engineers are leaving the profession rather than adapting

DORA AI Capabilities Model (2025 Report)

Identified 7 AI capabilities for software delivery
Key finding: “AI amplifies existing strengths and weaknesses”
Teams with good practices get better; teams with poor practices get worse faster
AI does not fix broken processes — it accelerates them

Theme 6: Agent Topologies and Enterprise Architecture (Thoughtworks Retreat)

Conway’s Law didn’t retire. It got more complicated.

Speed mismatch

Agents clear backlogs in days then hit cross-team dependencies, architecture reviews, human-speed decisions
Result: same delivery speed, more frustration — bottleneck shifts from engineering capacity to everything else

Agent drift

Agents learning from context diverge over time (e.g., database agent on e-commerce vs ERP)
Debate: manage drift (standardise) or embrace it (local optimisation)?

Decision fatigue as new bottleneck

Agents produce faster than leaders can review/approve
Middle managers previously serving as coordination become approval bottlenecks
Open question: fewer managers, differently-skilled managers, or fundamentally different coordination?

The StrongDM “Software Factory” (via Simon Willison)

No human code review of AI-generated code whatsoever
Instead: “scenario holdout testing” — tests held back from agents, used only for verification
“Digital Twin Universe” — full parallel environment for agent testing
Spending ~$1,000/day/engineer on AI tokens
10% of engineers (3 people) doing “AI engineering” — maintaining agent harnesses, not writing product code
Represents Dan Shapiro’s Level 4/5 in practice

Theme 7: Self-Healing Systems (Thoughtworks Retreat)

Prerequisites that don’t exist yet:

Clear ledger of every change
Operating system for agents with identity controls
Strong generic mitigation (rollback, feature flags) working without code changes
Fitness functions defining “healthy” in agent-evaluable terms

The latent knowledge problem

Senior engineers carry decades of pattern-matching for incidents (never documented)
Need “agent subconscious”: knowledge graph from post-mortems and incident data
Human nuance step still essential

Incident commander problem

LLMs tend toward positive reinforcement and agreement
Need “angry agents” designed to challenge dominant hypotheses

Agent coordination risks

Real example: agent told to keep files under 500 lines → made individual lines longer
Multiple agents can create oscillating feedback loops

Theme 8: Security, Governance, and Agile (Thoughtworks Retreat)

Security is dangerously behind

Security session had low attendance — reflects industry pattern
Email access enables password resets and account takeovers
Full machine access for dev tools = full machine access for anything
Recommendation: Platform engineering drives secure defaults. Don’t rely on individual devs.

Agile is evolving, not dying

Some teams compressing sprints to one week with AI automating ceremonies
Others rediscovering XP practices (pair programming, ensemble dev, CI) for tight feedback loops
Active regression: AI-generated large changesets pushing teams toward waterfall-like patterns
Direct reversal of DORA findings on batch size and stability

Batch size regression

Ease of producing large changesets with AI = larger, less frequent releases
Software stability declining as batch size increases
Flagged as needing “industry attention”

Theme 9: The Human Side — Roles, Skills, Experience (Thoughtworks Retreat)

Productivity/experience paradox

Developer productivity and developer experience are decoupling
Orgs can get more output even where devs report lower satisfaction, more cognitive load, reduced flow
Sharp reframe: “Call it agent experience instead — wallets open faster to invest in agent performance, and the overlap with conditions that help humans is nearly complete”

Staff engineers under pressure

More important and more stressed than ever
Use AI tools less than juniors but save more time per session (broader context, deeper architecture knowledge)
Should become “friction killers” — removing impediments for both humans and agents
Many have learned helplessness after years of hearing “no budget for improvements”

Juniors are more valuable, not less

More profitable than ever — AI gets them past net-negative phase faster
Call option on future productivity
Better at AI tools than seniors (no pre-existing habits to unlearn)

Mid-levels are the real concern

Came up during decade-long hiring boom
May not have developed fundamentals needed to thrive
Represent bulk of industry by volume
Retraining genuinely difficult — “no organization has solved it yet”

University of Waterloo co-op model highlighted

Deep theoretical foundations + 2.5 years industry internships (six 4-month rotations)
Intern-to-hire pipelines outperforming traditional graduate recruiting

Theme 10: Agent Swarms (Thoughtworks Retreat)

First barrier is mental, not technical

Engineers trained in sequential decomposition struggle with parallel agent work
Asking agents to parallelise work explicitly and observing results teaches more than theory

Collective convergence > individual accuracy

Swarm of individually imperfect agents can produce valuable outcomes
System architecture must guide convergence
Design principle borrowed from distributed systems and biological swarm intelligence

“Patrol workers on loops” — the more common pattern

Agents running ETL transforms, data quality checks, business process monitors on continuous cycles
“The unsexy work of data reliability running always-on”
Organisations with strong, well-designed APIs significantly better positioned

Theme 11: Technical Foundations (Thoughtworks Retreat)

Programming languages for agents

Every existing language designed for humans
Principle: “What is good for AI is good for humans”
Languages making incorrect code unrepresentable help both agents and humans
Radical possibility: source code becomes transient artifact, generated on demand, never stored
Counter-argument: deterministic validation requires stable artifact to test against

Semantic layers and knowledge graphs

Decades-old technologies suddenly relevant as grounding layer for domain-aware agents
Large telecom’s entire domain ontology captured in ~286 concepts
Practical value in legacy modernisation: auto-generate event storming artifacts from code, humans validate

The agentic operating system

Agent identity and permission management
Memory and context-window management
Work ledger (future, current, past work) with skills, acceptance criteria, SLOs, cost constraints
Agent = more than persona/goals/context — includes history of work performed
Work ledger as core primitive — analogous to financial blockchain: searchable, auditable

Key Open Questions (Retreat)

On work and identity

How do we help engineers who love writing code find meaning in supervisory work?
What professional development pathways lead to the middle loop?
If PM and developer roles converge, what is the resulting role?

On organizational design

If agents make middle management bottlenecks more visible, what’s the response?
How redesign enterprise architecture when agents cross team boundaries but governance can’t?

On trust and verification

What would need to be true to stop reviewing AI-generated code entirely?
Can test suites and constraints provide sufficient verification without human inspection?
How build trust in fundamentally non-deterministic systems?

On speed and stability

Are productivity gains being offset by stability losses from larger batch sizes?
Will development need to slow down because decision volume overwhelms human capacity?
How measure the real cost of cognitive debt?

Synthesis: What to Act On Now

Invest in the harness, not the prompts. AGENTS.md, test infrastructure, scenario holdouts, code quality metrics, WIP limits.
Rigor migrates — track where yours is going. Specs, tests, constraints, risk mapping, comprehension practices.
Name the middle loop. Recognise supervisory engineering as real work. Update career ladders.
Watch batch size. AI makes large changesets easy — this is a stability regression. Keep batches small.
TDD is the strongest form of prompt engineering. Tests before code is the single highest-leverage practice for AI-assisted development.
Staff engineers are your leverage. Reposition them as friction killers, not just architects.
Mid-levels need a plan. The retraining problem is real and unsolved. Don’t ignore it.
Security can’t wait. Agent access = full access. Platform engineering must drive secure defaults.
Measure comprehension, not just output. Epistemic debt is invisible until it’s catastrophic.
Start now, start small. Hashimoto’s step 1: stop using a browser chatbot. Move AI into the IDE.