Building Self-Improving Multi-Agent Teams: Governance, Memory, and Learning Loops

During the week, I run company-wide AI transformations. I help organizations redesign how they work with AI. Not just adopt tools, but rethink their operating model. It's deep, complex work that takes months and touches every layer of a company.

On nights and weekends, I'm building something very different.

It's called Meridian Labs: an 11-agent financial intelligence operation designed to identify and execute on investment opportunities at the intersection of AI and crypto, AI and traditional finance. I have a macro analyst, a crypto researcher, a quant strategist, a risk manager, a critical reviewer. Each running on different LLMs through an open-source orchestration platform called Paperclip.

Join the Journey

Get weekly insights on AI transformation and regenerative wisdom delivered straight to your inbox. No noise, just signal.

Join 800+ readers. Unsubscribe anytime.

I don't have a finance background. I'm a technologist who spent 16 years building companies, helped shape the regenerative finance movement, travelled to the Maha Kumbh Mela in India, and records music as a hobby. The reason I'm building this is personal: my wife Fatma and I want to establish a regenerative life in an ecovillage where we can raise our children, be in community, and have economic independence. Meridian Labs is our economic escape hatch. I want to build it in public so others can learn from what works and what doesn't.

What I discovered this week goes beyond investing. It's an architecture for any team of AI agents doing real information work. And it starts with a problem I couldn't ignore any longer.

A goldfish swimming through forgotten data, the amnesia problem

The goldfish problem

My agents were smart but amnesiac.

Nova, my macro analyst, would produce beautiful research with one recurring flaw: false precision. It would assert specific position sizes, leverage ratios, and entry points without the evidence to back them up. Narrative dressed as quantitative guidance.

Veto, my critical reviewer, would catch it every time. "REJECTED: recommends 10% allocation without 1000+ trade validation." "REJECTED: proposes 2-3x leverage without board approval." Devastating, correct reviews.

Next cycle, Nova would do it again. Different document, same mistake.

This is the goldfish problem. It's not unique to Nova. It's the default state of every multi-agent system I've seen. Each session starts cold. The agent that was corrected yesterday has no memory of being corrected yesterday. Corrections evaporate. Lessons don't compound. The system can't get smarter because it can't remember.

I ran 23 reviews across Cycle 0. Four artifacts were rejected outright. Six needed revision. The dominant failure pattern, false precision, appeared in over 20 of the 23 reviews. Not because my agents were bad. Because my system had no immune response.

Two articles that changed my approach

Two pieces of thinking arrived at the right time and snapped something into focus.

The first was Meta Alchemist's guide to self-evolving Claude Code systems. The core mechanism: when you correct your AI, log the correction. When the same correction appears twice, it automatically becomes a permanent rule. And every rule gets a machine-checkable verify: pattern, a grep command that can be run against your codebase (or in my case, research artifacts) to confirm compliance.

The insight that stuck: a rule without a verification check is a wish. A rule with a verification check is a guardrail. Only guardrails survive.

The second was Voxyz's comparison of gstack, Superpowers, and Compound Engineering. Three Claude Code tools that look like competitors but are actually three different layers. The framing that changed how I thought about knowledge persistence came from a simple distinction:

Closing notes are what the evening shift leaves for the morning shift. Linear. One session to the next. "Here's what happened today."

A recipe binder is what every new employee reads on day one and every day after. Categorized, searchable, always consulted before starting work.

Most agent systems produce closing notes. What they need is a recipe binder.

One more piece: Anthropic's engineering blog on effective harnesses from November 2025. The finding that matters: builders who evaluate their own work are systematically overoptimistic. Like a chef rating their own cooking. Always delicious. The maker and the checker must be separate.

I already had that separation (Veto reviews everything). What I didn't have was a system that remembered.

Three layers of the architecture: governance, memory, and the bridge

What I actually built

Three layers, built in a single session, each solving a different part of the problem.

Layer 1: Governance (the deliberate brain)

This layer is file-based, lives in git, and is fully auditable. It's the system of record for what the agents have learned.

Learned rules. Six rules extracted from Cycle 0 reviews, each with a machine-checkable verify: pattern:

- Never recommend ENTER without total-AUM exposure math, stop-loss, and take-profit levels
  verify: For each "ENTER" in artifacts/research/*.md, same section must contain
  "stop-loss" AND ("% of AUM" OR "% of portfolio")
  [source: Veto pattern across 5+ reviews, 2026-03-29]

That verify: line is not aspirational. It's a grep command. I wrote a bash script that parses every rule and runs every check against every artifact. When I ran it for the first time, it found 37 violations across the Cycle 0 artifacts. That makes sense. The rules were written after the artifacts. Those 37 violations are Cycle 1's punchlist.

A natural question: what happens when a learned rule is wrong? The system includes an evolution review every few cycles. You audit rules against evidence, prune what's outdated, graduate what's proven. Rules aren't sacred. They're hypotheses that earned enough evidence to be enforced. When they stop serving the system, they get retired.

Corrections log. 28 corrections captured from all 23 Veto reviews, each with a category and a verification pattern. When the same correction pattern appears twice, it auto-promotes to a permanent learned rule. This is the immune response: catch once, log. Catch twice, encode forever.

Compound solution docs. Six structured documents in artifacts/solutions/, each covering a distinct failure pattern:

False precision in research artifacts
Explicit risk guardrail breaches
Unsupported thresholds and probabilities
Missing invalidation criteria
Conflation of distinct concepts
Underpowered samples and extrapolation

These are the recipe binder. Each doc has: what went wrong, what didn't work, root cause, prevention steps, and links to the specific reviews that surfaced the pattern. When an agent starts work in Cycle 1, it consults these before researching. Not to read every word, but so the prevention steps are in context.

Cycle scorecard. Per-agent metrics for the full cycle. Nova: 7 artifacts, 2 rejected, 4 needs work. Sage: 3 artifacts, 0 clean accepts. Helm: 4 artifacts, 100% conditional approval. Over time, these trends tell you whether the system is learning. Corrections decreasing? Good. Corrections flat? The rules aren't working.

Layer 2: Memory (the associative recall)

The governance layer is deliberate and structured. But agents also need associative recall, the ability to surface relevant knowledge when they didn't know they needed it.

I integrated Supermemory, a semantic memory engine that does something important: when you store a paragraph of text, it doesn't just archive it. It extracts individual facts, builds a knowledge graph, and handles contradictions. When a thesis changes ("BTC target was $100k, now $85k"), the old fact gets marked as outdated and the new one surfaces first.

Each agent gets its own memory space, plus a shared cross-agent pool. When Nova stores a finding ("Brent crude at $87, Iran-Israel risk premium $8-12, Henry Hub projected $4.30"), Supermemory decomposes it into three separate, searchable facts.

When Sage (my crypto agent) starts work on prediction markets, it doesn't just load its own past findings. It recalls from the shared pool. Nova's macro context, Helm's risk observations, corrections from any agent. All available semantically.

This isn't RAG. RAG retrieves documents by similarity. This is a living knowledge graph where facts have relationships, contradictions get resolved, and the system's understanding deepens with every interaction.

Layer 3: The bridge (connecting brain and memory)

Here's the part that made the whole thing click.

The governance layer and memory layer were doing similar things in parallel. One structured, one semantic. Rules in a markdown file. Facts in a knowledge graph. Corrections in a JSONL log. Observations in vector embeddings. Two systems, same knowledge, not talking to each other.

The bridge is a sync function that runs at cycle boundaries:

At cycle start: All learned rules, solution docs, and corrections get pushed from git files into Supermemory. Now when an agent recalls "position sizing rules," it gets the formally codified rules alongside its own past observations, unified in one semantic response.

At cycle end: New corrections dual-write to BOTH the JSONL file (so the verification sweep catches them) AND Supermemory (so agents recall them semantically next time). One write, both systems updated.

The result: governance knowledge (what must be true) and experiential memory (what was observed) become the same knowledge, accessible through two different interfaces. Structured verification for the sweep. Semantic recall for the agents.

Three knowledge streams converging into one

The moment it clicked

I ran a test. Sage, the crypto agent that researches prediction markets and DeFi, queried Supermemory for "prediction market settlement arbitrage."

What came back:

From cross-agent shared knowledge:

"Polymarket uses UMA oracle for settlement while Kalshi uses CFTC-regulated settlement"
"Cross-platform arbitrage requires accounting for 2-3 day settlement mismatch"

From learned rules:

"All prediction market theses must cite contract liquidity depth and resolution terms"
"All ENTER recommendations must include an Invalidation section"

From corrections:

"Crypto variant of SMA crossover strategy failed. 25% drawdown exceeds 20% threshold; do not use for live deployment"

Three different knowledge sources. An agent's observation from a previous run. A formally codified rule from the governance layer. A correction caught by the critical reviewer. All unified in one semantic recall.

Sage didn't just get its own past work. It got Nova's market mechanics, Veto's corrections, and the governance team's rules. All contextually relevant. All surfaced without anyone manually transferring knowledge.

That's not search. That's institutional memory.

Why this matters beyond investing

I built this for a financial intelligence operation, but the architecture is domain-agnostic. It's a pattern for any multi-agent system doing information coordination.

Consulting teams. I see this in my day job constantly. Insights from one workstream (a governance review, a technical assessment, a stakeholder interview) should inform the other workstreams. But they don't, because each agent or consultant starts each session fresh. The scope guard I built for my current engagement uses the same principle: corrections from past sprints become rules for future ones.

Research organizations. Past failures should prevent future ones. The compound solution docs are what every research team needs. Not a flat list of "lessons learned" that nobody reads, but categorized prevention steps that get injected into context before new work begins.

Product teams. One agent's discovery (user research finds that users hate the onboarding flow) should shape another agent's work (engineering prioritizes the onboarding rewrite). Cross-agent memory sharing makes this automatic instead of requiring a human to be the knowledge broker.

Any team producing artifacts that get reviewed. If you have a maker and a checker, and the checker keeps catching the same patterns, those patterns should become machine-enforced rules. Not guidelines. Rules with grep commands. Wishes don't scale. Guardrails do.

The pattern is three things:

Governance: What must be true. Rules with verification. Machine-checkable, auditable, in version control.
Memory: What was learned. Semantic, associative, cross-agent. Surfaces relevant context without being asked.
Bridge: The sync between deliberate rules and associative recall. So the same knowledge is available for both structured verification and contextual recall.

The bigger picture

I'm building Meridian Labs as an economic escape hatch. But I'm writing about it because the real opportunity is bigger than my portfolio.

The agentic revolution is displacing workers. It's also creating a window for individuals to build intelligence operations that were previously only available to institutions. You don't need a Bloomberg terminal and a team of 20 analysts. You need a thesis, a handful of agents running on different LLMs, and a system that learns from its mistakes.

What I'm sharing here is the coordination architecture, not the financial strategies. The strategies are the payload. The architecture is what makes them get better over time. And it applies to any domain where agents need to research, analyze, decide, review, and improve.

I should be honest about what's still broken. The system is one cycle old. The agents don't yet have true real-time memory (Supermemory's indexing pipeline takes minutes, not seconds). The governance rules were written by analyzing past failures, not by preventing them in real time. And I haven't validated whether agents actually produce better artifacts in Cycle 1 with these rules in context. That's the test that matters, and it hasn't happened yet. I'm sharing the architecture because I believe the pattern is sound. But the proof is in the next cycle, not the last one.

Fatma and I are working toward a life where technology serves human flourishing, not the other way around. Building a regenerative life in community, raising our children close to the earth, and having the economic freedom to choose that life deliberately.

If the agents can learn, so can the system. If the system can learn, it can compound. And if it compounds, one person with a laptop can build something that used to require an institution.

I'll be writing more as Meridian Labs progresses through its cycles. The system is young. Cycle 0 just closed, Cycle 1 is about to begin. If you're building multi-agent systems for any domain, I hope this gives you a starting point.

Credits and inspiration

This architecture was built on the shoulders of people who shared their thinking publicly:

Meta Alchemist: the self-evolving Claude Code system that introduced me to machine-checkable verify: patterns and correction auto-promotion
Voxyz: the comparison of gstack, Superpowers, and Compound Engineering that crystallized the "closing notes vs recipe binder" distinction
Garry Tan: gstack, for the decision layer and real-browser QA
Every Inc: Compound Engineering, for the /compound step that extracts lessons into searchable knowledge
Jesse Vincent: Superpowers, for structured workflow discipline
Anthropic: the harness architecture blog that formalized maker/checker separation
Supermemory: the semantic memory layer that made cross-agent knowledge sharing real
Paperclip: the open-source agent orchestration platform running the whole operation

If you're building something similar or want to compare notes, find me on X or Substack.

The Immune System: What Happens When Your AI Agents Learn From Their Mistakes