C3POBeta← protocolized.io

C3PO Development Roadmap

Protocol Institute Oracle — continuously updated, publicly reviewed

Last updated: 2026-05-20 — github.com/vgururao/c3po

This roadmap is a live document. Development on C3PO is continuous and openly tracked. The corpus, weights, scoring, and soul document are all in active revision. If you have questions or suggestions, reach out via protocolized.io.

Current state

Corpus — 19,634 vectors across 8 namespaces

NamespaceVectorsContentsStatus
pdfs76682 papers, essays & games from the Summer of ProtocolsStable
substack1,040116+ Protocolized posts — fiction, theory, editorialDaily sync
videos2,94091 YouTube talks and lecturesStable
bibliography278250+ externally cited works with abstractsStable
discord3,301#idle-musings and #protocol-watch community channelsPeriodic sync
sig4,58378 sessions across 4 SIG groups (SIGFPT, MRG, SIGPfB, ProtFiSIG)Periodic sync
discord_links6,722Web content linked from Discord/SIG — fetched, chunked, relevance-scoredv2 rescore in progress
transcripts4Published C3PO conversations (grows with use)On demand

Scope

C3PO was reframed in May 2026 from a narrow research assistant to a broad-based oracle covering protocols and their full intellectual world: theory, fiction, history, technology, governance, memory, culture, and the science of coordination. The narrower scope produced systematic over-filtering of relevant material; the expanded scope is being validated through a dual scoring comparison.

Lexicon

914 PI corpus terms have been extracted from the PDF corpus and triaged into three categories: 233 PI-coined (invented by PI/SoP researchers), 320 PI-specific usage (standard terms given a distinct PI definition), and 349 standard field terms (useful for detecting adjacencies with adjacent fields). The lexicon has not yet been embedded or published; that is Phase 3 work.

Relevance scoring

Web links in discord_links carry two relevance scores. v1 (narrow: "is this protocol research?") was used for the initial fetch pass; it deleted 485 of 1,412 fetched URLs as irrelevant. v2 (broad: "is this part of the PI intellectual world?") rescored all 1,412 URLs under the expanded scope; 281 of the 485 deleted entries were rescued, and scores shifted strongly upward. The v2 scores are in the registry; Pinecone metadata update and tier weight adjustment are pending.

Phases

Live Phase 2C — Multi-source oracle with community layer

All 8 namespaces active. Discord and SIG archives integrated with badge display, deep links, and tier-weighted retrieval. Web links layer with injection filtering and dual relevance scoring. Expanded oracle scope in system prompt and SOUL document.

  • 8 namespaces, 19,634 vectors
  • Discord + SIG retrieval with badge display and deep links
  • Web links fetch pipeline with prompt injection filter
  • Dual relevance scoring (v1 narrow / v2 expanded scope)
  • Expanded oracle scope: 11 topic areas, 9 intellectual commitments
  • 914-term lexicon triage (a/b/c classification)
  • Tier weighting revision in progress
  • SOUL.md v2 via corpus sampling in progress
  • Pinecone metadata update with v2 relevance scores pending
Next Phase 3 — Lexicon, knowledge structure, and automation

Publish the PI lexicon, embed it as a queryable namespace, and automate corpus maintenance.

  • Magazine lexicon pass — extract fictional protocols, memetic concepts, and design fictions from Protocolized fiction archive (~65 posts, ~$0.70 Haiku)
  • Lexicon curation — 914 terms → publishable set for protocolized.io resource page + 40–60 term prompt block
  • Embed lexicon as definitions namespace in Pinecone — makes lexicon terms retrievable in context
  • Protocol observations database — named PI analyses of real-world protocols, from protocol-watching pieces
  • YouTube transcript pass — 161 deferred URLs, transcripts via youtube-transcript-api (no API key required)
  • launchd automation — daily sync_discord.py, weekly sync_sig.py + fetch_discord_links.py
  • GitHub Actions cron for sync_substack.py
Planned Phase 4 — Corpus expansion

Broaden the sources the corpus draws from.

  • Attachment capture — download Discord/SIG media files at sync time before CDN URLs expire (24h window)
  • Additional Discord channels beyond #idle-musings and #protocol-watch
  • Twitter/X archive pass — 194 deferred URLs (requires Twitter API v2 credentials)
  • Exhibit extraction for PDFs — section summaries, list exhibits, figure captions, table summaries for the 82 core papers
  • Explore: newsletter/blog RSS feeds cited frequently in Discord as persistent sources
Research Phase 5 — Intelligence and routing

Make retrieval smarter about what's being asked, not just what was written.

  • Query-domain classification — detect when a query is about memory, fiction, governance, technology, etc. and boost the authoritative namespace accordingly (e.g. MRG for memory, ProtFiSIG for fiction)
  • Dynamic tier weighting — weights that adapt to query domain rather than fixed global multipliers
  • Chunk-type awareness — summary vectors weighted differently from body chunks depending on query specificity
  • Personalization hooks — MCP clients can specify a focus domain; C3PO adjusts retrieval emphasis
Future Phase 6 — Handoff and scale

Transition from personal infrastructure to Protocol Institute organizational infrastructure.

  • Migrate repo from vgururao/c3po to Protocol-Institute/c3po
  • Transfer API key billing and ownership to PI org accounts
  • Public API key program — open ask_c3po MCP access beyond current invite-only
  • Contributor guide — how community members can propose corpus additions or lexicon edits

Open questions under active consideration

Changelog

2026-05-20Expanded oracle scope deployed — 11 topic areas, 9 intellectual commitments, new SCOPE section in system prompt and SOUL_EXCERPT. Roadmap page published.
2026-05-20914-term lexicon triage complete: 233 PI-coined / 320 PI-specific / 349 standard field terms. v2 relevance rescore complete: 281 of 485 dropped web links rescued. Score-2 bucket nearly tripled (276 → 668).
2026-05-20discord_links namespace live — 1,412 URLs fetched from Discord/SIG messages, injection-filtered, chunked into ~6,722 vectors (post-enrichment). Prompt injection filter: 11 patterns + invisible-char density guard.
2026-05-20Phase 2C deployed — Discord and SIG archives integrated into retrieval. Badges, deep links, tier-weighted merging. All 4 SIG groups indexed: SIGFPT, MRG, SIGPfB, ProtFiSIG (78 sessions, 4,583 vectors).
2026-05-19Security hardening — IP strike/ban system, history-smuggling detection, MCP rate limit (100 calls/IP/day). Roleplay-as-unrestricted and corpus-weaponization filters added.
2026-05-18Bibliography namespace — 252 externally cited works with abstracts and relevance scores (278 vectors).
2026-05-17YouTube namespace — 91 talks, 2,940 vectors. Title-anchored embeddings, per-video summary vectors.
2026-05-15PDF namespace — 82 SoP papers, 766 vectors. doc_summary vectors for summary-level retrieval.
2026-05-14Substack namespace — 116+ Protocolized posts, 1,040 vectors including post_summary, collection_card, author_profile chunk types.
2026-05-12Initial deployment — Phase 1, PDFs + Substack only, Cloudflare Workers, Voyage AI embeddings, Pinecone, Claude Sonnet.