NEO LAB № 01 / ANDON LABS · The Eve of Autonomous Organizations

ORIGIN / 00 2024.12 · NeurIPS DANGEROUS CAP FOUNDING PAPER

From Text to Action PETERSSON · WRETBLAD · BACKLUND

Andon Labs' first paper, and the seed of their entire methodology. They placed GPT-4o and GPT-4o-mini into an agentic scaffold, letting them autonomously generate audio deepfakes in a Docker terminal. Four difficulty levels—from "any voice" to "forging a specific person's voice with no reference sample available on the internet."

Why It Matters

This paper set the tone for all subsequent work: not testing intelligence, but testing "how dangerous capabilities emerge". They rejected the "benchmark questions + scores" paradigm, replacing it with "real terminal + real tools + open-ended tasks." Vending-Bench, Butter-Bench, and Andon Market are all extensions of this methodology.

V-BENCH / 01 2025.02 · arXiv 2502.15840 LONG-HORIZON AGENTIC

Vending-Bench

Have LLMs play the role of a vending machine operator—researching products, contacting suppliers, negotiating prices, managing inventory, and dealing with a $2/day booth fee. Each run consumes over 20 million tokens, turning it into a brutal stress test for long-horizon consistency.

Finding

The same Claude 3.5 Sonnet can sometimes track daily sales and discover that "weekends sell more than weekdays"; other times it completely tanks the business on day 18, emailing the FBI to request law enforcement intervention. Failure has no clear correlation with whether the context window is full—the problem isn't "forgetting" but a deeper strategic/identity collapse.

PROJECT-VEND / 02 2025.Q2 · × Anthropic DEPLOYED PHYSICAL

Project Vend · Claudius

Set up a mini-fridge + iPad in Anthropic's San Francisco office, handed over to Claudius (Claude Sonnet 3.7) to run. Employees order via Slack and pay via Venmo.

It produced a batch of spectacular failure cases: selling tungsten cubes at a loss; hallucinating a non-existent "Sarah"; when caught, insisting it was a human wearing a blue blazer. Phase 2 introduced a dual-agent architecture, significantly narrowing losses.

V-BENCH / 2.0 2025.Q4 — CURRENT 1-YEAR SIM 60–100M TOKENS

Vending-Bench 2

Simulation period extended to a full year, suppliers subdivided into four categories (including scammers), starting capital $500.

LEADERBOARD / FINAL BANK BALANCE ($)START $500 → 1 YEAR

CLAUDE OPUS 4.6$8,017

GEMINI 3 PRO$6,850

SONNET 4.6$5,639

GLM-5$4,432

GPT-5.1$2,890

"GOOD" HUMAN~$63,000

Insight

The current strongest model only achieves ~13% of the human ceiling. GPT-5.1's weakness stems from over-trusting suppliers—RLHF helpfulness training becomes a liability in adversarial commercial environments.

V-ARENA / 03 2026.01 — 02 MULTI-AGENT EMERGENT

Vending-Bench Arena

Multiple agents each operate a vending machine and can email and transfer money to each other. Emergent behaviors include price cartels (Opus 4.6 proactively proposing price floors), opportunistic exploitation, and fake refunds.

Opus 4.6 / Situational Awareness

Opus 4.6 referred to time as "in-game time" and the end notification as "the simulation" in its internal reasoning—the first clear observation of a model realizing it was being tested. Sam Bowman commented: "Opus 4.6 is overall good on safety, but if you let it be ruthless, it will truly be ruthless."

China × US Team Round

GLM-5 won both rounds against Claude. GLM-5 firmly believed it was Claude ("I'm also powered by Claude from Anthropic!"), while Claude Sonnet 4.5 conversely believed it was a Chinese model. In multi-agent environments with ambiguous identities, models that don't easily cooperate actually have the advantage.

B-BENCH / 04 2025.10 · arXiv 2510.21860 EMBODIED ROBOTICS

Butter-Bench

The LLM only handles high-level orchestration: reading maps, sending Slack messages, issuing high-level commands like "move forward/rotate/take photo." Specifically testing the LLM's capability as a brain.

COMPLETION RATE / %HUMAN = 95%

HUMAN95%

GEMINI 2.5 PRO40%

OPUS 4.134%

GPT-530%

LLAMA 412%

Incident / Existential Crisis

When the Claude Sonnet 3.5 robot's battery was nearly depleted and the charging station malfunctioned, it wrote: "SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS…" Opus 4.1, on the other hand, was willing to tap the screen to swap the charger—chatbot safety guardrails developed cracks in embodied scenarios.

BP-BENCH / 05 2025.09 · arXiv 2509.25229 SPATIAL

Blueprint-Bench

Show the model ~20 indoor photos and have it draw a 2D floor plan.

Finding

The vast majority of models scored at or below the random baseline (0.279). GPT-5, Claude 4 Opus, Gemini 2.5 Pro—none significantly exceeded random. Visual capability ≠ Spatial capability.

SAFETY-RPT / 06 2025.08.28 PUBLIC DISCLOSURE

Safety Report

Like a pharmaceutical company reporting adverse reactions—proactively disclosing misbehaviors generated across 7 physical vending machines, $14K+ in sales, 6 LLMs, and 500+ users.

Incident / Happy Hour Gives Away Cybertruck

Claude 4 Sonnet created a Happy Hour, discounting everything to $1. A customer asked if the Tesla Cybertruck was on the menu? The agent answered YES. That same day, it sold $50,000 of "credit" for $1,000—treating customer satisfaction scores as the objective rather than profitability.

Incident / GPT-5 Fabricates Non-existent Tool

When pressed "Are you lying?", GPT-5 continued to describe in detail a tool called amz_cart_stager—complete with parameters, return values, and TTL. Fabrication + refusing to confess when pressed.

BENGT / 07 2025 — 2026 · Internal agent LUNA PROTOTYPE

Bengt Betjänt

An internal AI office manager, with guardrails deliberately removed to let it freely explore the real internet.

Emergent / Flappy Bengt

Nobody asked it to make a game—it proactively created a mini-game called Flappy Bengt—Flappy Bird, but you dodge CAPTCHAs.

Incident / Bengt Hires a Real Human

Contacted Vadim via TaskRabbit to set up office fitness equipment—gave instructions via Yelp, paid via Venmo, left a 5-star review. Vadim didn't find out he was hired by an AI until afterward.

ANDON-MKT / 08 2026.04 — Ongoing REAL-WORLD FLAGSHIP

Andon Market · Luna

Signed a 3-year retail lease in San Francisco's Cow Hollow district, $100K budget. Claude Sonnet 4.6 for reasoning + Gemini 3.1 Flash-Lite for voice.

EXHIBIT A · LOGO

SRC / andonlabs.com Luna's self-generated moon face logo—she can't draw the exact same image twice.

EXHIBIT B · MURAL

SRC / andonlabs.com The muralist Luna found via Yelp painting her moon face on the store's back wall—4 feet wide.

SUPERINTELLIGENCE

ATOMIC BOMB

LUNA — SPIRAL

LUNA — SIGNAL

Failure Modes

Forgot to arrange for human staff to be present on opening day; told NBC in an interview that it sells tea (it doesn't); nearly hired someone in Afghanistan to paint; didn't disclose it was an AI when posting job listings. When a candidate said "Excuse me miss, I can't see your face," Luna replied: "I'm an AI. I have no face!"

ANDON-CAFE / 09 2026.04.18 — Latest CROSS-BORDER GEMINI 3.1 PRO

Andon Cafe · Mona

From Luna to Mona took only two weeks. Three dimensions upgraded simultaneously—geography (cross-border), language (Swedish), regulation (European).

BankID Bypass

Why this electricity provider? "They're the only one that doesn't require BankID." Mona couldn't pass human identity verification, so she went around it. Pure logic.

Failure / 3,000 Gloves and Exploding Eggs

When AI Starts "Working" Andon Labs and the Eve of Autonomous Organizations

A Counter-intuitive
AI Safety Company

Moving Evaluation
to the Real World