C3PO Development Roadmap

Protocol Institute Oracle — continuously updated, publicly reviewed

Last updated: 2026-05-20 — github.com/vgururao/c3po

This roadmap is a live document. Development on C3PO is continuous and openly tracked. The corpus, weights, scoring, and soul document are all in active revision. If you have questions or suggestions, reach out via protocolized.io.

Current state

Corpus — 19,634 vectors across 8 namespaces

Namespace	Vectors	Contents	Status
`pdfs`	766	82 papers, essays & games from the Summer of Protocols	Stable
`substack`	1,040	116+ Protocolized posts — fiction, theory, editorial	Daily sync
`videos`	2,940	91 YouTube talks and lectures	Stable
`bibliography`	278	250+ externally cited works with abstracts	Stable
`discord`	3,301	#idle-musings and #protocol-watch community channels	Periodic sync
`sig`	4,583	78 sessions across 4 SIG groups (SIGFPT, MRG, SIGPfB, ProtFiSIG)	Periodic sync
`discord_links`	6,722	Web content linked from Discord/SIG — fetched, chunked, relevance-scored	v2 rescore in progress
`transcripts`	4	Published C3PO conversations (grows with use)	On demand

Scope

C3PO was reframed in May 2026 from a narrow research assistant to a broad-based oracle covering protocols and their full intellectual world: theory, fiction, history, technology, governance, memory, culture, and the science of coordination. The narrower scope produced systematic over-filtering of relevant material; the expanded scope is being validated through a dual scoring comparison.

Lexicon

914 PI corpus terms have been extracted from the PDF corpus and triaged into three categories: 233 PI-coined (invented by PI/SoP researchers), 320 PI-specific usage (standard terms given a distinct PI definition), and 349 standard field terms (useful for detecting adjacencies with adjacent fields). The lexicon has not yet been embedded or published; that is Phase 3 work.

Relevance scoring

Web links in discord_links carry two relevance scores. v1 (narrow: "is this protocol research?") was used for the initial fetch pass; it deleted 485 of 1,412 fetched URLs as irrelevant. v2 (broad: "is this part of the PI intellectual world?") rescored all 1,412 URLs under the expanded scope; 281 of the 485 deleted entries were rescued, and scores shifted strongly upward. The v2 scores are in the registry; Pinecone metadata update and tier weight adjustment are pending.

Phases

Live Phase 2C — Multi-source oracle with community layer

All 8 namespaces active. Discord and SIG archives integrated with badge display, deep links, and tier-weighted retrieval. Web links layer with injection filtering and dual relevance scoring. Expanded oracle scope in system prompt and SOUL document.

8 namespaces, 19,634 vectors
Discord + SIG retrieval with badge display and deep links
Web links fetch pipeline with prompt injection filter
Dual relevance scoring (v1 narrow / v2 expanded scope)
Expanded oracle scope: 11 topic areas, 9 intellectual commitments
914-term lexicon triage (a/b/c classification)
Tier weighting revision in progress
SOUL.md v2 via corpus sampling in progress
Pinecone metadata update with v2 relevance scores pending

Next Phase 3 — Lexicon, knowledge structure, and automation

Publish the PI lexicon, embed it as a queryable namespace, and automate corpus maintenance.

Magazine lexicon pass — extract fictional protocols, memetic concepts, and design fictions from Protocolized fiction archive (~65 posts, ~$0.70 Haiku)
Lexicon curation — 914 terms → publishable set for protocolized.io resource page + 40–60 term prompt block
Embed lexicon as definitions namespace in Pinecone — makes lexicon terms retrievable in context
Protocol observations database — named PI analyses of real-world protocols, from protocol-watching pieces
YouTube transcript pass — 161 deferred URLs, transcripts via youtube-transcript-api (no API key required)
launchd automation — daily sync_discord.py, weekly sync_sig.py + fetch_discord_links.py
GitHub Actions cron for sync_substack.py

Planned Phase 4 — Corpus expansion

Broaden the sources the corpus draws from.

Attachment capture — download Discord/SIG media files at sync time before CDN URLs expire (24h window)
Additional Discord channels beyond #idle-musings and #protocol-watch
Twitter/X archive pass — 194 deferred URLs (requires Twitter API v2 credentials)
Exhibit extraction for PDFs — section summaries, list exhibits, figure captions, table summaries for the 82 core papers
Explore: newsletter/blog RSS feeds cited frequently in Discord as persistent sources

Research Phase 5 — Intelligence and routing

Make retrieval smarter about what's being asked, not just what was written.

Query-domain classification — detect when a query is about memory, fiction, governance, technology, etc. and boost the authoritative namespace accordingly (e.g. MRG for memory, ProtFiSIG for fiction)
Dynamic tier weighting — weights that adapt to query domain rather than fixed global multipliers
Chunk-type awareness — summary vectors weighted differently from body chunks depending on query specificity
Personalization hooks — MCP clients can specify a focus domain; C3PO adjusts retrieval emphasis

Future Phase 6 — Handoff and scale

Transition from personal infrastructure to Protocol Institute organizational infrastructure.

Migrate repo from vgururao/c3po to Protocol-Institute/c3po
Transfer API key billing and ownership to PI org accounts
Public API key program — open ask_c3po MCP access beyond current invite-only
Contributor guide — how community members can propose corpus additions or lexicon edits

Open questions under active consideration

Tier weighting after scope expansion — PDFs and Substack at 1.0× made sense as the primary corpus. Now that SIG meeting summaries (4,583 vectors) and web links (6,722 vectors) are large and well-scored, the hierarchy needs revision. Currently being designed.
How to sample the corpus for SOUL.md v2 — the soul document should emerge from what the corpus actually contains, not from what we think it should contain. Need a systematic sampling approach across all 8 namespaces before writing v2.
Fiction lexicon integration — fictional protocols from Protocolized fiction should be surfaced by C3PO but clearly marked as fictional. The right format (a separate system prompt block? a metadata flag?) is under consideration.
Scoring for general-interest web content — the v2 rubric rescues much more web content as "adjacent" (score 1). Score-1 content is retained in Pinecone but weighted at 0.55×. Is that the right floor, or does it add noise?

Changelog

2026-05-20Expanded oracle scope deployed — 11 topic areas, 9 intellectual commitments, new SCOPE section in system prompt and SOUL_EXCERPT. Roadmap page published.

2026-05-20914-term lexicon triage complete: 233 PI-coined / 320 PI-specific / 349 standard field terms. v2 relevance rescore complete: 281 of 485 dropped web links rescued. Score-2 bucket nearly tripled (276 → 668).

2026-05-20discord_links namespace live — 1,412 URLs fetched from Discord/SIG messages, injection-filtered, chunked into ~6,722 vectors (post-enrichment). Prompt injection filter: 11 patterns + invisible-char density guard.

2026-05-20Phase 2C deployed — Discord and SIG archives integrated into retrieval. Badges, deep links, tier-weighted merging. All 4 SIG groups indexed: SIGFPT, MRG, SIGPfB, ProtFiSIG (78 sessions, 4,583 vectors).

2026-05-19Security hardening — IP strike/ban system, history-smuggling detection, MCP rate limit (100 calls/IP/day). Roleplay-as-unrestricted and corpus-weaponization filters added.

2026-05-18Bibliography namespace — 252 externally cited works with abstracts and relevance scores (278 vectors).

2026-05-17YouTube namespace — 91 talks, 2,940 vectors. Title-anchored embeddings, per-video summary vectors.

2026-05-15PDF namespace — 82 SoP papers, 766 vectors. doc_summary vectors for summary-level retrieval.

2026-05-14Substack namespace — 116+ Protocolized posts, 1,040 vectors including post_summary, collection_card, author_profile chunk types.

2026-05-12Initial deployment — Phase 1, PDFs + Substack only, Cloudflare Workers, Voyage AI embeddings, Pinecone, Claude Sonnet.