C3PO Development Roadmap
Protocol Institute Oracle — continuously updated, publicly reviewed
Last updated: 2026-05-20 — github.com/vgururao/c3po
This roadmap is a live document. Development on C3PO is continuous and openly tracked. The corpus, weights, scoring, and soul document are all in active revision. If you have questions or suggestions, reach out via
protocolized.io.
Current state
Corpus — 19,634 vectors across 8 namespaces
| Namespace | Vectors | Contents | Status |
pdfs | 766 | 82 papers, essays & games from the Summer of Protocols | Stable |
substack | 1,040 | 116+ Protocolized posts — fiction, theory, editorial | Daily sync |
videos | 2,940 | 91 YouTube talks and lectures | Stable |
bibliography | 278 | 250+ externally cited works with abstracts | Stable |
discord | 3,301 | #idle-musings and #protocol-watch community channels | Periodic sync |
sig | 4,583 | 78 sessions across 4 SIG groups (SIGFPT, MRG, SIGPfB, ProtFiSIG) | Periodic sync |
discord_links | 6,722 | Web content linked from Discord/SIG — fetched, chunked, relevance-scored | v2 rescore in progress |
transcripts | 4 | Published C3PO conversations (grows with use) | On demand |
Scope
C3PO was reframed in May 2026 from a narrow research assistant to a broad-based oracle covering protocols and their full intellectual world: theory, fiction, history, technology, governance, memory, culture, and the science of coordination. The narrower scope produced systematic over-filtering of relevant material; the expanded scope is being validated through a dual scoring comparison.
Lexicon
914 PI corpus terms have been extracted from the PDF corpus and triaged into three categories: 233 PI-coined (invented by PI/SoP researchers), 320 PI-specific usage (standard terms given a distinct PI definition), and 349 standard field terms (useful for detecting adjacencies with adjacent fields). The lexicon has not yet been embedded or published; that is Phase 3 work.
Relevance scoring
Web links in discord_links carry two relevance scores. v1 (narrow: "is this protocol research?") was used for the initial fetch pass; it deleted 485 of 1,412 fetched URLs as irrelevant. v2 (broad: "is this part of the PI intellectual world?") rescored all 1,412 URLs under the expanded scope; 281 of the 485 deleted entries were rescued, and scores shifted strongly upward. The v2 scores are in the registry; Pinecone metadata update and tier weight adjustment are pending.
Phases
Live
Phase 2C — Multi-source oracle with community layer
All 8 namespaces active. Discord and SIG archives integrated with badge display, deep links, and tier-weighted retrieval. Web links layer with injection filtering and dual relevance scoring. Expanded oracle scope in system prompt and SOUL document.
- 8 namespaces, 19,634 vectors
- Discord + SIG retrieval with badge display and deep links
- Web links fetch pipeline with prompt injection filter
- Dual relevance scoring (v1 narrow / v2 expanded scope)
- Expanded oracle scope: 11 topic areas, 9 intellectual commitments
- 914-term lexicon triage (a/b/c classification)
- Tier weighting revision in progress
- SOUL.md v2 via corpus sampling in progress
- Pinecone metadata update with v2 relevance scores pending
Next
Phase 3 — Lexicon, knowledge structure, and automation
Publish the PI lexicon, embed it as a queryable namespace, and automate corpus maintenance.
- Magazine lexicon pass — extract fictional protocols, memetic concepts, and design fictions from Protocolized fiction archive (~65 posts, ~$0.70 Haiku)
- Lexicon curation — 914 terms → publishable set for protocolized.io resource page + 40–60 term prompt block
- Embed lexicon as
definitions namespace in Pinecone — makes lexicon terms retrievable in context
- Protocol observations database — named PI analyses of real-world protocols, from protocol-watching pieces
- YouTube transcript pass — 161 deferred URLs, transcripts via youtube-transcript-api (no API key required)
- launchd automation — daily
sync_discord.py, weekly sync_sig.py + fetch_discord_links.py
- GitHub Actions cron for
sync_substack.py
Planned
Phase 4 — Corpus expansion
Broaden the sources the corpus draws from.
- Attachment capture — download Discord/SIG media files at sync time before CDN URLs expire (24h window)
- Additional Discord channels beyond #idle-musings and #protocol-watch
- Twitter/X archive pass — 194 deferred URLs (requires Twitter API v2 credentials)
- Exhibit extraction for PDFs — section summaries, list exhibits, figure captions, table summaries for the 82 core papers
- Explore: newsletter/blog RSS feeds cited frequently in Discord as persistent sources
Research
Phase 5 — Intelligence and routing
Make retrieval smarter about what's being asked, not just what was written.
- Query-domain classification — detect when a query is about memory, fiction, governance, technology, etc. and boost the authoritative namespace accordingly (e.g. MRG for memory, ProtFiSIG for fiction)
- Dynamic tier weighting — weights that adapt to query domain rather than fixed global multipliers
- Chunk-type awareness — summary vectors weighted differently from body chunks depending on query specificity
- Personalization hooks — MCP clients can specify a focus domain; C3PO adjusts retrieval emphasis
Future
Phase 6 — Handoff and scale
Transition from personal infrastructure to Protocol Institute organizational infrastructure.
- Migrate repo from
vgururao/c3po to Protocol-Institute/c3po
- Transfer API key billing and ownership to PI org accounts
- Public API key program — open
ask_c3po MCP access beyond current invite-only
- Contributor guide — how community members can propose corpus additions or lexicon edits
Open questions under active consideration
- Tier weighting after scope expansion — PDFs and Substack at 1.0× made sense as the primary corpus. Now that SIG meeting summaries (4,583 vectors) and web links (6,722 vectors) are large and well-scored, the hierarchy needs revision. Currently being designed.
- How to sample the corpus for SOUL.md v2 — the soul document should emerge from what the corpus actually contains, not from what we think it should contain. Need a systematic sampling approach across all 8 namespaces before writing v2.
- Fiction lexicon integration — fictional protocols from Protocolized fiction should be surfaced by C3PO but clearly marked as fictional. The right format (a separate system prompt block? a metadata flag?) is under consideration.
- Scoring for general-interest web content — the v2 rubric rescues much more web content as "adjacent" (score 1). Score-1 content is retained in Pinecone but weighted at 0.55×. Is that the right floor, or does it add noise?
Changelog
2026-05-20Expanded oracle scope deployed — 11 topic areas, 9 intellectual commitments, new SCOPE section in system prompt and SOUL_EXCERPT. Roadmap page published.
2026-05-20914-term lexicon triage complete: 233 PI-coined / 320 PI-specific / 349 standard field terms. v2 relevance rescore complete: 281 of 485 dropped web links rescued. Score-2 bucket nearly tripled (276 → 668).
2026-05-20discord_links namespace live — 1,412 URLs fetched from Discord/SIG messages, injection-filtered, chunked into ~6,722 vectors (post-enrichment). Prompt injection filter: 11 patterns + invisible-char density guard.
2026-05-20Phase 2C deployed — Discord and SIG archives integrated into retrieval. Badges, deep links, tier-weighted merging. All 4 SIG groups indexed: SIGFPT, MRG, SIGPfB, ProtFiSIG (78 sessions, 4,583 vectors).
2026-05-19Security hardening — IP strike/ban system, history-smuggling detection, MCP rate limit (100 calls/IP/day). Roleplay-as-unrestricted and corpus-weaponization filters added.
2026-05-18Bibliography namespace — 252 externally cited works with abstracts and relevance scores (278 vectors).
2026-05-17YouTube namespace — 91 talks, 2,940 vectors. Title-anchored embeddings, per-video summary vectors.
2026-05-15PDF namespace — 82 SoP papers, 766 vectors. doc_summary vectors for summary-level retrieval.
2026-05-14Substack namespace — 116+ Protocolized posts, 1,040 vectors including post_summary, collection_card, author_profile chunk types.
2026-05-12Initial deployment — Phase 1, PDFs + Substack only, Cloudflare Workers, Voyage AI embeddings, Pinecone, Claude Sonnet.