Public · Verifiable
Methods & Data Transparency
What is actually in the knowledge graph, which models are live, and what is on the roadmap. Read this before any procurement, partnership, or due-diligence conversation. Counts are pulled live from /api/stats — no marketing numbers anywhere on this page.
Live graph state · auto-refreshed
Loading from production database…
Data sources · what is loaded vs planned
| Source | What it contains | Currently in graph | Full size | Status | License |
|---|---|---|---|---|---|
| PrimeKG (Harvard MIMS) | Multimodal biomedical KG — drugs, genes, diseases, pathways, phenotypes | ~89,000 (genes, diseases, drugs, pathways, etc.) | Same — full PrimeKG ingested | loaded | Open (Apache 2.0) |
| IMPPAT 2.0 curated subset | Indian Medicinal Plants Phytochemistry & Therapeutics — Ayurvedic compounds | 100 high-export compounds with literature-cited target genes, Sanskrit names, family, plant part | Full IMPPAT 2.0 = 17,967 compounds — academic data agreement with ACTREC Mumbai in progress | partial | Academic agreement required for full set |
| IndiGen (CSIR-IGIB) | Indian population PGx variant frequencies | 14 PGx variants currently loaded; expanding to 50+ via PharmGKB-IndiGen overlay | IndiGen 1,029 genomes covers ~3,500 clinically actionable variants | partial | Open (IGIB data release) |
| CPIC + PharmGKB | Pharmacogenomic dosing guidelines (NUDT15, TPMT, MTHFR, CYP3A5, CYP2C19) | Encoded in /pedonco rule engine, not as graph edges | Same — rule engine matches latest CPIC versions | loaded | Open (CPIC) |
| CTRI (Clinical Trials Registry India) | Active and completed Indian clinical trials | 180 trials linked to drugs + diseases (HAS_INDIAN_TRIAL, INVESTIGATES_DISEASE edges) | CTRI has ~50,000 trials; we ingest only those with structured drug + disease linkage | partial | Open |
| MAMMAL 458M DTI (IBM Research) | Drug-target interaction binding predictions (pKd) | 192 PREDICTED_BINDING edges (24 phytochemicals × 8 CYP enzymes) — generated locally on RTX 3060 | Same — scaling to all 100 curated compounds in next refresh | partial | Apache 2.0 |
| RCSB Protein Data Bank | Experimental 3D crystal structures of CYP enzymes (for /herbcheck viewer) | Fetched on-demand via /structure endpoint; cached | 8 CYP enzymes (CYP1A2, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, 3A4) | loaded | Open (RCSB) |
| GenomeIndia (DBT) | 10,000-genome Indian reference dataset | Not yet ingested as graph edges | 10,000 genomes; access via DBT data-sharing | planned | Government access agreement |
| ACTREC retrospective cohort | Pediatric ALL clinical outcomes (for BlastProfiler v1 validation) | Not ingested — partnership initiation phase | Largest Indian pediatric oncology cohort | planned | MoU required (research collaboration) |
| EpiOnco v0 — Curated epifactors | Epigenetic regulators (writers/erasers/readers/remodelers) for Tumour Ability Score | 50 high-priority epifactors with cancer role + drug links + Hanahan-Weinberg hallmark mapping | Full EpifactorDB ≈ 800 — ingestion via epifactordb.com on roadmap | partial | Open (EpifactorDB CC) |
| EpiOnco v0 — Indian cancer epigenomes | Documented Indian-specific epigenetic signatures distinct from TCGA | 3 signatures (Indian OSCC PMID:17139279 · NE India HNC PMID:33033692 · Indian PDAC PMID:36059159) | Indian Cancer Genome Atlas (ICGA) — partnership in progress | partial | Published (open access via PubMed) |
| EpiOnco v0 — Epigenetic drugs | Approved + clinical epigenetic therapeutics linked to epifactor targets | 14 drugs (Tazemetostat, Azacitidine, Decitabine, Vorinostat, Romidepsin, etc.) with INHIBITS_EPIFACTOR edges | Same — covers all FDA-approved epigenetic therapies as of 2024 | loaded | Open |
| TCGA methylation + RNA-seq | 33-cancer-type methylation (450K array) + RNA-seq for global reference | Not ingested — used as conceptual baseline only | 10,000+ samples; ~500GB methylation + 200GB RNA-seq via GDC API | planned | Open (NIH GDC) |
Foundation models + classifiers · what is live
| Model | Role | Live | Provenance |
|---|---|---|---|
| Llama 3.3 70B (Groq) | Natural language reasoning + grounded synthesis (entities restricted to result rows) | ✓ live | Groq Cloud API |
| MAMMAL 458M DTI | Drug-target binding affinity predictions (pKd, rank-based) | ✓ live | IBM Research, Apache 2.0; 192 predictions pre-computed locally |
| BlastProfiler classifier v0 | Pediatric leukemia subtype from marker panel + driver mutations | ✓ live | Peer-reviewed cell-marker rules (WHO 2016, COG/BFM protocols) |
| CPIC PGx rule engine | Dose recommendation for thiopurines, MTX, vincristine | ✓ live | CPIC 2018-2022 published guidelines, PharmGKB Level 2A evidence |
| scGPT fine-tuned on PedSCAtlas | BlastProfiler v1 — direct scRNA-seq classification | roadmap | Roadmap: fine-tune on Mumme 2025 (540K cells) — pending ACTREC validation cohort |
| ESM3 (protein variant impact) | Indian-specific variant pathogenicity prediction | roadmap | Roadmap: EvolutionaryScale Forge API integration |
Hallucination guardrails
- • LLM synthesis prompts include an explicit grounding contract: the model may not name a drug, gene, pathway, disease, or variant that does not appear in the Cypher result rows.
- • If a query returns fewer than 2 rows, the system short-circuits to an honest "no evidence + closest matches" response instead of generating prose.
- • Entity resolution maps abbreviations and Sanskrit names to canonical graph nodes before any LLM call — surfaces "not in graph" suggestions when the user term is sparse.
- • Every clinical recommendation surfaces its CPIC guideline version, evidence grade, confidence, and triggering variants — auditable per output.
- • Every protected action is logged to Firestore with a server-timestamp under
audit/{uid}/events.
Known gaps · openly disclosed
- ⚠ IMPPAT: 100 curated · full set (17,967) requires ACTREC academic agreement (in progress)
- ⚠ IndiGen PGx: 14 variants loaded · 50+ expansion via PharmGKB overlay planned
- ⚠ GenomeIndia: 10,000-genome dataset not yet ingested · DBT access agreement required
- ⚠ BlastProfiler v1 (scGPT/PedSCAtlas): classifier currently uses peer-reviewed marker rules · scGPT fine-tune pending ACTREC validation cohort
- ⚠ Outcomes data: no clinical validation paper published yet · seeking research MoU with one academic hospital
- ⚠ SaMD certification: positioning for CDSCO Software-as-Medical-Device pathway · not yet submitted
This page is the source of truth. If anything elsewhere on PetriDish conflicts with what is shown here, this page wins.
Decision support only — does not replace clinician judgment.