Fiverr Labs

Fiverr Labs is an applied AI research lab exploring how artificial intelligence transforms the freelancing marketplace.

Publications

Products

  • Boss GPT — AI Work Management That Remembers Everything
  • VibeReports — VibeReports helps teams create, edit, and share structured business reports.
  • AI Orchestration Framework — AI infrastructure powers multi-agent workflows with minimal human input.

Notes

  • Agent Tests Are All You Need — Stop hand-writing agent prompts. Compile them from specs instead.

    Your product team spent weeks designing an AI agent. The spec is thorough: decision trees, tool policies, edge cases, escalation rules. Now it lands on your desk. Your job is to turn it into a system prompt that makes the agent do all of that, correctly, every time. So you start writing. You translate policies into instructions. You add examples. You test a few conversations by hand, tweaking wording until the agent stops making that one mistake. Then a PM files a bug about a different behavior, and your fix for that silently breaks two things that were working yesterday. This is the core problem of professional agent engineering: the gap between a detailed spec and a prompt that faithfully implements it. The spec is clear. The translation is where things fall apart. Now imagine you could just hand the spec to a compiler and get back an agent that provably covers every requirement. That's what we built. _Prompts Are Compiled Artifacts_ Stop thinking of prompts as prose you write. Think of them as artifacts you compile. The product spec defines what your agent should do. That spec is your source code. Behavioral tests are the intermediate representation. The prompt is the compiled output. You don't hand-write compiled binaries. You shouldn't hand-write agent prompts either. This is TDAD (Test-Driven AI Agent Definition), a methodology we developed out of a production need at Fiverr. We had a detailed product spec for an agent and no reliable way to verify it actually followed it. So we built a compiler: one coding agent reads the spec and generates executable tests, a second coding agent iteratively refines the prompt until those tests pass. The engineer writes the spec. The machines do the compilation. The compiler optimizes more than the system prompt. It turns out that enriching tool descriptions matters as much as the prompt itself, transforming a bare "Verify user identity" into actionable guidance about when to call and what to do with the result. And because agents output structured decisions via tool calls rather than free text, every assertion is deterministic. You test traces, not vibes. But compilation only works if you can trust the tests. And that's where most approaches fall apart. The full methodology, all four agent specifications, and the test harness are available as an open benchmark at https://github.com/f-labs-io/tdad-paper-code. The research paper https://arxiv.org/abs/2603.08806 covers the formal framework and experimental results in detail. The agents that survive production are the ones that survived their tests first.

  • AAAI 2026 - Reflections from Singapore — AAAI 2026 in Singapore was a defining moment- presenting AgentSHAP showed that a real, community-felt gap in understanding what tools matter in LLM agents is resonating globally, and it’s just the beginning 🇸🇬

    AAAI 2026 in Singapore was a special moment for me. Presenting AgentSHAP and seeing the level of interest it generated was both exciting and deeply validating. What stood out most was how strongly the need resonated, the challenge of understanding which tools actually matter in LLM-based agents is clearly something many in the field are grappling with. The questions, reactions, and engagement around AgentSHAP reinforced that this is not just an abstract research problem, but a real gap the community is eager to close. Seeing research that started at Fiverr Labs take the stage at a top-tier global conference was a proud moment and very much just the beginning 🇸🇬

  • Miriam is presenting her research at AAAI 2026 — Model-agnostic framework quantifying each tool’s contribution to LLM agents

    Miriam is presenting her research paper at AAAI 2026 in Singapore. Seeing work that came out of the lab reach a top-tier research conference is always a proud moment.\nThe paper introduces AgentSHAP – the first explainability framework designed to quantify the importance of external tools in LLM-based agents. Agents often rely on multiple tools like search, calculators, or APIs, but it's usually unclear which tool actually influenced the final answer.\nAgentSHAP tackles this by borrowing Shapley values from game theory to assign a fair contribution score to each tool. It's model-agnostic, treats the LLM as a black box, and uses Monte Carlo sampling to keep things computationally practical.\nHopefully with a wink and a picture to match. In Singapore next week. Good luck, Miriam!\nhttps://www.arxiv.org/abs/2512.12597

  • We’re opening a playground for our experiments — We build fast. We test early. We don’t wait for perfect. So we’re opening a playground.

    We're a team of engineers and builders. And we don't love waiting for a "full product" or a "complete experience" before putting things out into the world.\nMost things we build in the lab start as experiments. Rough edges, strong ideas, unfinished flows. Some turn out surprisingly good. Others very clearly don't.\nThe only way we know how to really test them is with real people.\nSo we decided to add a playground.\nThe playground is where early adopters can try our experiments as they're being built. No polish. No big promises. Just a space to play, break things, and give us feedback.\nSome experiments will turn into products. Some will end up in the graveyard. That's the point.\nPlayground – coming soon.

  • We killed a product after Anthropic shipped. — We built something great. Anthropic shipped something similar. It’ll catch up fast. We moved on.

    At the lab we've been working on a product called VibeDoc.\nThe idea was simple to explain, but pretty deep in practice: a place to create docs and reports – business, product, SEO and GEO, decks, whatever – powered by an infra of specialized agents with orchestration between them (more on that another time).\nBut the real magic wasn't just generating a doc. You could create or upload a doc or a deck, and then edit it by section. Real editing. Text, structure, even infographics. The editing capabilities were honestly really good.\nA week before launch, Anthropic dropped it in Claude Code. We tried it. It's not great yet. But it's obvious where this is going, and it's going there fast.\nWe actually like that Anthropic identified the same need. That was a nice validation moment. But it also made the decision clear.\nSo we're ditching this one and moving on to the next product.\nThat's how the lab works sometimes. You build, you get close, you learn – and then you let go and keep going.