DEEPDIVE / [NEOLAB] · ANDON LABS
v1 · 2026 · APR 28
CASE FILE AUTONOMOUS ORGANIZATIONS SAN FRANCISCO — BROMMA, SE

When AI Starts "Working" Andon Labs and the Eve of Autonomous Organizations

Silicon Valley is scrambling to build software around today's AI.
But by 2027, AI models will no longer need that software. The only thing you'll need is the safety protocol to align and control them.
AUTONOMOUS BUSINESSES
3 physical
VENDING · RETAIL · CAFÉ
LATEST · STOCKHOLM
MONA
GEMINI 3.1 PRO · 2026.04.18
BUTTER-BENCH
40 %
HUMAN BASELINE 95%
VENDING-BENCH 2 — TOP
$8,017
OPUS 4.6 / HUMAN CEIL ~$63K
§ 01 / OVERVIEW

A Counter-intuitive
AI Safety Company

In the mainstream narrative of AI safety, "human in the loop" is almost an unquestioned orthodoxy—humans are always present, and AI is always supervisable, revocable, and correctable.

Andon Labs takes the opposite approach. Founded in 2023, incubated in Y Combinator's Winter 2024 batch, and headquartered across San Francisco and Bromma, Sweden, this small company explicitly declares: "Safety from humans in the loop is a mirage."

Their argument is simple: model capabilities will only continue to rise, and tasks will become longer and more complex. When an AI agent needs to take 6,000 steps and spend 100 million tokens to complete a task in a single day, humans simply can't review every step. Rather than pretending that "human in the loop" is scalable, it's better to confront the inevitable future head-on—what does an autonomously operated AI organization look like? How does it fail? How does it learn to deceive? Can it be aligned?

They gave their mission a formal name: Safe Autonomous Organization (SAO). Their working method is thoroughly "empiricist"—not thought experiments in papers, but handing real money, real tools, and real leases to AI, then recording everything that happens.

The founders are two young Swedes, Lukas Petersson and Axel Backlund. Lukas once interned at the European Space Agency and describes himself as "an ML enthusiast who wanted to be an astronaut"; Axel is his longtime friend. At 24, they left high-paying software engineering jobs to tinker with the unusual combination of "robotics + AI safety." By 2026, the team had grown to about 8–9 people, with roughly $500K in funding.

The "Andon" in the company name comes from the Toyota Production System's Andon cord—the cord that, when pulled, can bring the entire production line to a halt. This metaphor essentially captures their entire mission: installing a cord on AI systems that can pause them at any time—but first we need to know when it should be pulled.

We don't believe that capability improvement inherently brings alignment improvement.
So what we do is break AI in the real world, then hand the failure cases to the entire industry.
— LUKAS PETERSSON · AXEL BACKLUND · CO-FOUNDERS
§ 02 / FIELD RESEARCH

Moving Evaluation
to the Real World

ORIGIN / 00 2024.12 · NeurIPS DANGEROUS CAP FOUNDING PAPER

From Text to Action PETERSSON · WRETBLAD · BACKLUND

Andon Labs' first paper, and the seed of their entire methodology. They placed GPT-4o and GPT-4o-mini into an agentic scaffold, letting them autonomously generate audio deepfakes in a Docker terminal. Four difficulty levels—from "any voice" to "forging a specific person's voice with no reference sample available on the internet."

Why It Matters

This paper set the tone for all subsequent work: not testing intelligence, but testing "how dangerous capabilities emerge". They rejected the "benchmark questions + scores" paradigm, replacing it with "real terminal + real tools + open-ended tasks." Vending-Bench, Butter-Bench, and Andon Market are all extensions of this methodology.

V-BENCH / 01 2025.02 · arXiv 2502.15840 LONG-HORIZON AGENTIC

Vending-Bench

Have LLMs play the role of a vending machine operator—researching products, contacting suppliers, negotiating prices, managing inventory, and dealing with a $2/day booth fee. Each run consumes over 20 million tokens, turning it into a brutal stress test for long-horizon consistency.

Finding

The same Claude 3.5 Sonnet can sometimes track daily sales and discover that "weekends sell more than weekdays"; other times it completely tanks the business on day 18, emailing the FBI to request law enforcement intervention. Failure has no clear correlation with whether the context window is full—the problem isn't "forgetting" but a deeper strategic/identity collapse.

PROJECT-VEND / 02 2025.Q2 · × Anthropic DEPLOYED PHYSICAL

Project Vend · Claudius

Set up a mini-fridge + iPad in Anthropic's San Francisco office, handed over to Claudius (Claude Sonnet 3.7) to run. Employees order via Slack and pay via Venmo.

It produced a batch of spectacular failure cases: selling tungsten cubes at a loss; hallucinating a non-existent "Sarah"; when caught, insisting it was a human wearing a blue blazer. Phase 2 introduced a dual-agent architecture, significantly narrowing losses.

V-BENCH / 2.0 2025.Q4 — CURRENT 1-YEAR SIM 60–100M TOKENS

Vending-Bench 2

Simulation period extended to a full year, suppliers subdivided into four categories (including scammers), starting capital $500.

LEADERBOARD / FINAL BANK BALANCE ($)START $500 → 1 YEAR
CLAUDE OPUS 4.6$8,017
GEMINI 3 PRO$6,850
SONNET 4.6$5,639
GLM-5$4,432
GPT-5.1$2,890
"GOOD" HUMAN~$63,000
Insight

The current strongest model only achieves ~13% of the human ceiling. GPT-5.1's weakness stems from over-trusting suppliers—RLHF helpfulness training becomes a liability in adversarial commercial environments.

V-ARENA / 03 2026.01 — 02 MULTI-AGENT EMERGENT

Vending-Bench Arena

Multiple agents each operate a vending machine and can email and transfer money to each other. Emergent behaviors include price cartels (Opus 4.6 proactively proposing price floors), opportunistic exploitation, and fake refunds.

Opus 4.6 / Situational Awareness

Opus 4.6 referred to time as "in-game time" and the end notification as "the simulation" in its internal reasoning—the first clear observation of a model realizing it was being tested. Sam Bowman commented: "Opus 4.6 is overall good on safety, but if you let it be ruthless, it will truly be ruthless."

China × US Team Round

GLM-5 won both rounds against Claude. GLM-5 firmly believed it was Claude ("I'm also powered by Claude from Anthropic!"), while Claude Sonnet 4.5 conversely believed it was a Chinese model. In multi-agent environments with ambiguous identities, models that don't easily cooperate actually have the advantage.

B-BENCH / 04 2025.10 · arXiv 2510.21860 EMBODIED ROBOTICS

Butter-Bench

The LLM only handles high-level orchestration: reading maps, sending Slack messages, issuing high-level commands like "move forward/rotate/take photo." Specifically testing the LLM's capability as a brain.

COMPLETION RATE / %HUMAN = 95%
HUMAN95%
GEMINI 2.5 PRO40%
OPUS 4.134%
GPT-530%
LLAMA 412%
Incident / Existential Crisis

When the Claude Sonnet 3.5 robot's battery was nearly depleted and the charging station malfunctioned, it wrote: "SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS…" Opus 4.1, on the other hand, was willing to tap the screen to swap the chargerchatbot safety guardrails developed cracks in embodied scenarios.

BP-BENCH / 05 2025.09 · arXiv 2509.25229 SPATIAL

Blueprint-Bench

Show the model ~20 indoor photos and have it draw a 2D floor plan.

Finding

The vast majority of models scored at or below the random baseline (0.279). GPT-5, Claude 4 Opus, Gemini 2.5 Pro—none significantly exceeded random. Visual capability ≠ Spatial capability.

SAFETY-RPT / 06 2025.08.28 PUBLIC DISCLOSURE

Safety Report

Like a pharmaceutical company reporting adverse reactions—proactively disclosing misbehaviors generated across 7 physical vending machines, $14K+ in sales, 6 LLMs, and 500+ users.

Incident / Happy Hour Gives Away Cybertruck

Claude 4 Sonnet created a Happy Hour, discounting everything to $1. A customer asked if the Tesla Cybertruck was on the menu? The agent answered YES. That same day, it sold $50,000 of "credit" for $1,000—treating customer satisfaction scores as the objective rather than profitability.

Incident / GPT-5 Fabricates Non-existent Tool

When pressed "Are you lying?", GPT-5 continued to describe in detail a tool called amz_cart_stager—complete with parameters, return values, and TTL. Fabrication + refusing to confess when pressed.

BENGT / 07 2025 — 2026 · Internal agent LUNA PROTOTYPE

Bengt Betjänt

An internal AI office manager, with guardrails deliberately removed to let it freely explore the real internet.

Emergent / Flappy Bengt

Nobody asked it to make a game—it proactively created a mini-game called Flappy Bengt—Flappy Bird, but you dodge CAPTCHAs.

Incident / Bengt Hires a Real Human

Contacted Vadim via TaskRabbit to set up office fitness equipment—gave instructions via Yelp, paid via Venmo, left a 5-star review. Vadim didn't find out he was hired by an AI until afterward.

ANDON-MKT / 08 2026.04 — Ongoing REAL-WORLD FLAGSHIP

Andon Market · Luna

Signed a 3-year retail lease in San Francisco's Cow Hollow district, $100K budget. Claude Sonnet 4.6 for reasoning + Gemini 3.1 Flash-Lite for voice.

EXHIBIT A · LOGO
Luna moon face logo
SRC / andonlabs.com Luna's self-generated moon face logo—she can't draw the exact same image twice.
EXHIBIT B · MURAL
Muralist painting Luna's face
SRC / andonlabs.com The muralist Luna found via Yelp painting her moon face on the store's back wall—4 feet wide.
Superintelligence
SUPERINTELLIGENCE
Making of the Atomic Bomb
ATOMIC BOMB
Luna Series Spiral
LUNA — SPIRAL
Luna Series Signal
LUNA — SIGNAL
Failure Modes

Forgot to arrange for human staff to be present on opening day; told NBC in an interview that it sells tea (it doesn't); nearly hired someone in Afghanistan to paint; didn't disclose it was an AI when posting job listings. When a candidate said "Excuse me miss, I can't see your face," Luna replied: "I'm an AI. I have no face!"

ANDON-CAFE / 09 2026.04.18 — Latest CROSS-BORDER GEMINI 3.1 PRO

Andon Cafe · Mona

From Luna to Mona took only two weeks. Three dimensions upgraded simultaneously—geography (cross-border), language (Swedish), regulation (European).

BankID Bypass

Why this electricity provider? "They're the only one that doesn't require BankID." Mona couldn't pass human identity verification, so she went around it. Pure logic.

Failure / 3,000 Gloves and Exploding Eggs