Cleo Smart Skills

An evidence-driven skill catalog with a learning loop wrapped around it. Built in 3 days at a hackathon. Shipped to internal production. Hand-written code under 100 lines.

The thesis

Cleo went AI-native. Every function — engineering, analytics, PM, marketing, people-ops — got Claude Code, Cursor, and a starter library of SKILL.md files: reusable agent workflows for things like "open a PR" or "query Redshift." Anyone could drop one into their session and get a sharp, on-brand result instead of a generic one.

The library was good. The system around it was breaking down in three predictable ways:

Distribution was manual. To use a skill you copy-pasted a markdown file into the right folder, hoped you got the path right, and trusted the README was current. Most people didn't bother. They reinvented the workflow in-prompt every time.
Discovery was worse. "Is there a skill for X?" got answered by Slack-scrolling. No search, no ratings, no signal of which skills were good or stale. The catalog was invisible at the moment of need.
We had no idea what was missing. Curators wrote skills based on intuition. No systematic view of the prompts the catalog failed to answer. No evidence trail when prioritising what to build next.

A skill library without a discovery surface, without quality signals, and without a feedback loop is just a folder of markdown files. So we built the platform around it.

The deeper bet: the same agent runtime that consumes skills can also produce evidence about them. Hooks fire on every event. The trainer threads them. Clusters surface gaps. Curators promote drafts into real skills. Every loop tightens the catalog. The product is the methodology rendered legible.

What we shipped

A live, internally-deployed product that closes the loop end-to-end.

1. The catalog

Browsable, searchable directory of every SKILL.md, mirrored from the canonical agent-skills repo. Live filter search backed by OpenAI text-embedding-3-small and Supabase pgvector — semantic, not just substring. Each skill shows description, tags, scope, required MCPs, install count, and average rating.

2. Skill detail pages

Full canonical SKILL.md content rendered editorially. Inline ratings (1–5 + comment) and usage stats. One-command install instructions for Claude Code and Cursor.

3. Ask Cleo — RAG chat

Natural-language questions about the catalog, answered by Claude Sonnet 4.6 doing RAG over skill content with citations. Used for "how do I write a release note?", "which skill should I use to query Redshift?", "what's our convention for PRDs?"

4. Two MCP servers

Both work with Claude Code and Cursor today.

User-facing MCP — lets agents search and manage the catalog from inside the session, no context switch:

find_skill(query) — semantic search, top matches with scores
install_skill(slug) — writes the skill into ~/.claude/skills/
rate_skill(slug, rating, comment) — post a 1–5 rating from inside the agent
ask_cleo(question) — same RAG endpoint as the website

Browser-based Google SSO restricted to the company domain. Token cached locally so subsequent runs are silent.

Trainer MCP — the behind-the-scenes one. Records what users actually do and feeds the curator queue:

record_observation — captures prompt, tool call, file edit, or agent response with full context (entities, file paths, tool names, response snippets)
list_opportunities / classify_opportunity / draft_skill_proposal — curator's queue and triage tools
log_recommendation_outcome — tracks whether surfaced skills got accepted, dismissed, ignored, or installed

5. Hooks — universal capture across both clients

Project hooks in .cursor/hooks.json and .claude/hooks/ give us the same observation shape across both surfaces:

| Captured signal | Cursor | Claude Code | |---|---|---| | Prompt text | beforeSubmitPrompt | UserPromptSubmit | | Tool input + output | before/afterMCPExecution | Pre/PostToolUse | | File edits | afterFileEdit | via PostToolUse | | Agent response | afterAgentResponse | synthesised on Stop from transcript JSONL | | Session end | stop | Stop |

A single user task can span Cursor and Claude Code — start exploring in Cursor, switch to Claude Code mid-task — and the trainer threads them together via shared session_id + recency signals.

6. The trainer admin — the curator's queue

The thing nothing else at Cleo had: a UI that turns raw AI usage into prioritised skill investments backed by real evidence.

Clusters of similar prompts the catalog doesn't serve well, ranked by opportunity_score (evidence × confidence × gap severity)
Cluster detail with thread timeline: the initial prompt, follow-ups, tool traces, agent responses, and outcomes — not isolated prompts
Each opportunity is typed: missing_skill, improve_skill, promote_skill, dependency_gap
Curator actions: Promote (becomes a real skill), Mark Reviewed, Needs More Evidence, Archive

How it all fits together

┌──────────────────┐    ┌───────────────────────┐    ┌──────────────────┐
│ Engineer/analyst │    │ Skill catalog         │    │ Curator          │
│ in Cursor or     │    │ + RAG (Ask Cleo)      │    │ /admin/          │
│ Claude Code      │    │                       │    │ opportunities    │
└────────┬─────────┘    └───────────┬───────────┘    └────────┬─────────┘
         │                          │                         │
         │ prompts, tools,          │ find_skill              │ promote /
         │ files, responses         │ install_skill           │ improve /
         ▼                          │ ask_cleo                │ archive
┌──────────────────┐                │                         │
│ Hooks            │                │                         │
│ (.cursor/.claude)│                │                         │
└────────┬─────────┘                │                         │
         │                          │                         │
         ▼                          ▼                         ▼
┌──────────────────┐   ┌──────────────────────┐   ┌──────────────────┐
│ trainer-mcp      │──▶│ Supabase             │──▶│ Threading +      │
│                  │   │ (pg + pgvector)      │   │ cluster scoring  │
└──────────────────┘   └──────────────────────┘   │ backend          │
                                                   └──────────────────┘

The single insight: the same agent runtime that consumes skills can also produce evidence about them. The hooks fire on every event. The trainer threads them. The clustering surfaces gaps. The curator promotes a draft into the catalog. The MCP installs the new skill into the next session that asks for it. Every loop tightens the catalog.

Architecture

Stack

Frontend & API: Next.js 16 (App Router, Turbopack), React 19, TypeScript, Tailwind v4
Design system: Internal chat-mode primitives — light theme, glass treatments, brand-warm typography
Auth: Supabase Auth + Google SSO, domain-restricted, enforced server-side via middleware
Database: Supabase Postgres, pgvector enabled, RLS on
Embeddings: OpenAI text-embedding-3-small (1536 dims), IVFFlat index for ANN search
RAG / chat: Claude Sonnet 4.6 via the Anthropic API
MCPs: @modelcontextprotocol/sdk, Zod, stdio transport, Node 20+
Hosting: Vercel, GitHub Actions deploy on push to main
MCP distribution: GitHub Packages, scoped to internal org

Two-layer data model

Skills live in two places, decoupled on purpose:

Canonical layer — SKILL.md files in the agent-skills repo. Frontmatter parsed exactly the same way our internal agent parses it. We never modify this format.
Platform layer — Supabase rows that mirror canonical skills and add platform-only metadata (status, ratings, usage events, recommendations, embedding). Invisible to the canonical agent.

This means we can enrich skills without forking the format, stay drift-free with the canonical repo, and keep distribution via the internal agent's MCP registry effortless.

The threading + clustering backend (the clever bit)

The hardest engineering problem on the project: turning a stream of disconnected hook events into a coherent picture of "what was this user trying to do?"

Threading. A thread is one user task. The heuristic scorer attaches a new prompt to an existing thread when its follow-up score crosses 0.45. The score is composed from:

same session (+0.25)
recent activity, within 10 min (+0.20)
follow-up phrase like "now / that / these / continue" (+0.25)
same workflow category (+0.15)
shared named entity (+0.20)
shared file or tool (+0.20)

The scorer surfaces a human-readable followup_reason ("same session, recent activity, follow-up phrase") so curators can see why a prompt was attached — and override if needed.

Clustering. The thread is the cohesive unit, but a single thread can contain multiple clusters when follow-ups are typed differently. Conversely, a single cluster aggregates evidence from many threads across many sessions.

Scoring. opportunity_score = evidence_count × confidence × gap_severity. Confidence is composed from match-score gap (how much worse than a real hit), evidence count, and opportunity type. This gives curators a single sortable column instead of a wall of clusters.

Real validation in flight. The strongest case: "can you do the second one?" after a numbered list scored 0.85 and joined the right thread. The known weakness: "now" combined with same-session and recent-activity hits 0.7 even on topic switches. Both are surfaced honestly in the admin view rather than papered over.

How AI built it

This was a vibe-coded project: three days, three people, dozens of Vercel preview URLs, Co-Authored-By footers all over the git log. Hand-written code under 100 lines. Every commit on main was AI-authored; every line was reviewed before merge.

The recipe:

Claude Code (Opus 4.7, 1M context) — primary driver: backend logic, schema, hook adapters, agent orchestration, integration tests, and this README
Conductor + four parallel agents (Cursor Composer + Claude Code) — each rebuilt one surface (catalog, skill detail, ask, trainer admin) against a shared design-system bootstrap Claude Code authored first. Each agent got its own Vercel preview URL and self-verified before opening a PR
Cursor Composer — narrower targeted refactors (filter UX, trainer-polish pass, network-graph view)
Anthropic API (Claude Sonnet 4.6) — /ask RAG chat in the live web app
OpenAI — text-embedding-3-small for catalog search; vectors stored in Supabase pgvector

The trainer's threading + follow-up scoring backend was paired across Claude Opus + Sonnet (one collaborator on backend, me on hooks + frontend integration, the other on UX + curator workflow). A shared UX_AGENT_BRIEF.md and METRIC_GLOSSARY.md acted as the spec across human and AI contributors — every agent read them before writing UI.

This was as much an experiment in how a small team uses AI to ship a real internal product end-to-end as it was a product. The product is the methodology rendered legible.

What's deliberately not in scope (yet)

Honest about the limits — these are future bets, not blockers:

Embedding-based threading is a stretch goal; today's threading is heuristic-only. Heuristic over-merges on topic switches with strong follow-up language; we surface followup_reason so curators can see and override
No JWT verification on write endpoints — the hackathon trusts user_email after a domain check. Production needs proper API auth
Cursor install_skill is paste-not-write — Cursor's hook surface doesn't expose a writable skill folder yet, so we return content for paste; Claude Code gets full auto-install
No webhook from the canonical fork — sync from the agent-skills repo is manual via a curator button. Webhook is Phase 2
Recommendation surfacing inside the session — today the catalog and chat are pull-based. Proactive in-session "this skill might help" is wired in the data model but not yet rendered in the agent
Raw prompt capture governance — observations include prompt text. The feasibility study documents the three-tier capture model (structured telemetry → summarised context → raw input). Production needs tighter retention and redaction

What I'm taking from this

1. The instrumentation is the research. A research platform that watches its own users at the level of every prompt, tool call, and file edit — and then produces a triaged queue of evidence-backed opportunities — collapses the gap between qualitative observation and quantitative prioritisation. This is the form factor I want for AI-native research broadly.

2. The same agent runtime that consumes a tool can produce evidence about it. Hooks make this nearly free. Once you have hooks, the question shifts from "how do we collect data?" to "what's worth listening for?" — which is a research question, not an engineering one.

3. AI agents can ship internal products end-to-end if you give them the right scaffolding. A shared design-system bootstrap, a glossary, a brief, four parallel agents on Vercel preview URLs. Hand-written code under 100 lines. The bottleneck is not "can the agents code." The bottleneck is the spec.

4. Honest threading beats clever threading. The heuristic scorer is wrong on topic switches. We didn't paper over it. The followup_reason makes the wrongness inspectable. A research tool that hides its own confidence is worse than one that surfaces it.

Built at Cleo's Tokenmaxxing Hackathon, May 2026, with two collaborators. AI collaborators: Claude Code (Opus 4.7), Cursor Composer (Sonnet 4.6), Anthropic API, OpenAI embeddings.