Public · Verifiable

Methods & Data Transparency

What is actually in the knowledge graph, which models are live, and what is on the roadmap. Read this before any procurement, partnership, or due-diligence conversation. Counts are pulled live from /api/stats — no marketing numbers anywhere on this page.

Loading from production database…

SourceWhat it containsCurrently in graphFull sizeStatusLicense
PrimeKG (Harvard MIMS)Multimodal biomedical KG — drugs, genes, diseases, pathways, phenotypes~89,000 (genes, diseases, drugs, pathways, etc.)Same — full PrimeKG ingestedloadedOpen (Apache 2.0)
IMPPAT 2.0 curated subsetIndian Medicinal Plants Phytochemistry & Therapeutics — Ayurvedic compounds100 high-export compounds with literature-cited target genes, Sanskrit names, family, plant partFull IMPPAT 2.0 = 17,967 compounds — academic data agreement with ACTREC Mumbai in progresspartialAcademic agreement required for full set
IndiGen (CSIR-IGIB)Indian population PGx variant frequencies14 PGx variants currently loaded; expanding to 50+ via PharmGKB-IndiGen overlayIndiGen 1,029 genomes covers ~3,500 clinically actionable variantspartialOpen (IGIB data release)
CPIC + PharmGKBPharmacogenomic dosing guidelines (NUDT15, TPMT, MTHFR, CYP3A5, CYP2C19)Encoded in /pedonco rule engine, not as graph edgesSame — rule engine matches latest CPIC versionsloadedOpen (CPIC)
CTRI (Clinical Trials Registry India)Active and completed Indian clinical trials180 trials linked to drugs + diseases (HAS_INDIAN_TRIAL, INVESTIGATES_DISEASE edges)CTRI has ~50,000 trials; we ingest only those with structured drug + disease linkagepartialOpen
MAMMAL 458M DTI (IBM Research)Drug-target interaction binding predictions (pKd)192 PREDICTED_BINDING edges (24 phytochemicals × 8 CYP enzymes) — generated locally on RTX 3060Same — scaling to all 100 curated compounds in next refreshpartialApache 2.0
RCSB Protein Data BankExperimental 3D crystal structures of CYP enzymes (for /herbcheck viewer)Fetched on-demand via /structure endpoint; cached8 CYP enzymes (CYP1A2, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, 3A4)loadedOpen (RCSB)
GenomeIndia (DBT)10,000-genome Indian reference datasetNot yet ingested as graph edges10,000 genomes; access via DBT data-sharingplannedGovernment access agreement
ACTREC retrospective cohortPediatric ALL clinical outcomes (for BlastProfiler v1 validation)Not ingested — partnership initiation phaseLargest Indian pediatric oncology cohortplannedMoU required (research collaboration)
EpiOnco v0 — Curated epifactorsEpigenetic regulators (writers/erasers/readers/remodelers) for Tumour Ability Score50 high-priority epifactors with cancer role + drug links + Hanahan-Weinberg hallmark mappingFull EpifactorDB ≈ 800 — ingestion via epifactordb.com on roadmappartialOpen (EpifactorDB CC)
EpiOnco v0 — Indian cancer epigenomesDocumented Indian-specific epigenetic signatures distinct from TCGA3 signatures (Indian OSCC PMID:17139279 · NE India HNC PMID:33033692 · Indian PDAC PMID:36059159)Indian Cancer Genome Atlas (ICGA) — partnership in progresspartialPublished (open access via PubMed)
EpiOnco v0 — Epigenetic drugsApproved + clinical epigenetic therapeutics linked to epifactor targets14 drugs (Tazemetostat, Azacitidine, Decitabine, Vorinostat, Romidepsin, etc.) with INHIBITS_EPIFACTOR edgesSame — covers all FDA-approved epigenetic therapies as of 2024loadedOpen
TCGA methylation + RNA-seq33-cancer-type methylation (450K array) + RNA-seq for global referenceNot ingested — used as conceptual baseline only10,000+ samples; ~500GB methylation + 200GB RNA-seq via GDC APIplannedOpen (NIH GDC)
ModelRoleLiveProvenance
Llama 3.3 70B (Groq)Natural language reasoning + grounded synthesis (entities restricted to result rows)✓ liveGroq Cloud API
MAMMAL 458M DTIDrug-target binding affinity predictions (pKd, rank-based)✓ liveIBM Research, Apache 2.0; 192 predictions pre-computed locally
BlastProfiler classifier v0Pediatric leukemia subtype from marker panel + driver mutations✓ livePeer-reviewed cell-marker rules (WHO 2016, COG/BFM protocols)
CPIC PGx rule engineDose recommendation for thiopurines, MTX, vincristine✓ liveCPIC 2018-2022 published guidelines, PharmGKB Level 2A evidence
scGPT fine-tuned on PedSCAtlasBlastProfiler v1 — direct scRNA-seq classificationroadmapRoadmap: fine-tune on Mumme 2025 (540K cells) — pending ACTREC validation cohort
ESM3 (protein variant impact)Indian-specific variant pathogenicity predictionroadmapRoadmap: EvolutionaryScale Forge API integration
  • • LLM synthesis prompts include an explicit grounding contract: the model may not name a drug, gene, pathway, disease, or variant that does not appear in the Cypher result rows.
  • • If a query returns fewer than 2 rows, the system short-circuits to an honest "no evidence + closest matches" response instead of generating prose.
  • • Entity resolution maps abbreviations and Sanskrit names to canonical graph nodes before any LLM call — surfaces "not in graph" suggestions when the user term is sparse.
  • • Every clinical recommendation surfaces its CPIC guideline version, evidence grade, confidence, and triggering variants — auditable per output.
  • • Every protected action is logged to Firestore with a server-timestamp under audit/{uid}/events.
  • IMPPAT: 100 curated · full set (17,967) requires ACTREC academic agreement (in progress)
  • IndiGen PGx: 14 variants loaded · 50+ expansion via PharmGKB overlay planned
  • GenomeIndia: 10,000-genome dataset not yet ingested · DBT access agreement required
  • BlastProfiler v1 (scGPT/PedSCAtlas): classifier currently uses peer-reviewed marker rules · scGPT fine-tune pending ACTREC validation cohort
  • Outcomes data: no clinical validation paper published yet · seeking research MoU with one academic hospital
  • SaMD certification: positioning for CDSCO Software-as-Medical-Device pathway · not yet submitted

This page is the source of truth. If anything elsewhere on PetriDish conflicts with what is shown here, this page wins.
Decision support only — does not replace clinician judgment.