The largest knowledge system ever assembled by humanity has never been
computationally formalised. Not meaningfully. Thirty million manuscripts.
Four thousand years of continuous intellectual output. Texts that contain
complete formal grammars, decision systems, cosmological models, and
acoustic performance traditions. All of it sits outside the training
distribution of every frontier model in existence.
This is not a content gap. It is a structural one.
Current AI architectures are not built to reason over sutras, propagate
commentarial lineage, or reconstruct oral knowledge from fragmentary
transmission. We are building the systems that do.
30M+
Sanskrit manuscripts, most never digitised
196
Endangered Indian languages with active Vedic oral traditions
<50
Living Parashara-tradition Jyotishis fluent in all three major systems
The Problem
Current AI fails on Vedic knowledge in predictable, structural ways.
It is not that models haven't seen Vedic texts. Parts of Sanskrit corpora
appear in Common Crawl. The Gita is in every training set. The failure is
more fundamental: general-purpose language models lack the architectural
commitments needed to reason reliably over symbolic systems that operate
through strict rule hierarchies, lineage-dependent interpretation, and
oral transmission with no written form.
Sutra systems are not retrievable by embedding similarity.
Panini's Ashtadhyayi encodes grammar in 3,959 sutras that interact
through a strict precedence hierarchy: metarules, context-sensitive overrides,
and zero-context aphorisms that only resolve when the reader holds the
entire system simultaneously. RAG over sutras produces fluent confabulation.
Commentarial knowledge is not additive. It is interpretive.
Shankara, Ramanuja, and Madhva read identical Brahmasutra verses and reach
opposite conclusions. The commentary is not a supplement to the root text.
It is a competing world model. Current LLM architectures flatten this into
averaging, which is wrong in every case.
Oral performative knowledge has no written proxy.
The Dhrupad and Darbari Kanada of the Agra gharana exist only in performance
lineage. A raga is not a scale. It is a formal constraint system specifying
allowed ascent/descent phrases (arohana/avarohana), characteristic gamakas,
time-of-day applicability (prahar), and emotional flavour (rasa). None of
this is recoverable from text. It requires acoustic modelling of a vanishing
corpus before the last performers die.
Endangered languages are exiting without a machine-readable record.
Tulu, Gondi, Nihali, Kodava Takk, and dozens of other languages carry
Vedic oral traditions: ritual poetry, folk astronomy, medicinal plant knowledge.
None of it exists in digitised form. Standard language model training requires
tens of millions of tokens per language. These languages have hundreds of
hours of recorded speech, if that.
Domain evaluation does not exist.
There is no benchmark for Jyotish accuracy, no eval suite for Gita
interpretation quality, no automated way to tell whether a Vedic AI
system is reasoning or hallucinating. We cannot improve what we cannot
measure. We are building the evals simultaneously with the models.
Corpus construction is unsolved.
Digitising Sanskrit manuscripts requires Devanagari OCR that handles
scribal variation, regional scripts (Grantha, Sharada, Nandinagari),
manuscript deterioration, and the absence of modern punctuation. Existing
tools produce outputs that require expert correction at every line.
Architecture
Symbolic extraction plus neural synthesis, not fine-tuning or naive retrieval.
Wrong approach
Fine-tune a large model on Vedic text. Hope it learns the structure. Ship when it sounds confident enough.
Also wrong
Chunk texts into 512-token segments, embed them, retrieve by cosine similarity, and call it a Vedic AI.
Also wrong
Build a single general Vedic model. The domains are structurally different. Jyotish is a decision system. The Gita is hermeneutics. They require different architectures.
Our approach
Extract symbolic rules from sutras. Build domain-specific reasoning layers. Use neural synthesis only where formal structure ends. Keep commentary lineages separate and navigable.
The core architectural bet: Vedic knowledge systems are
formal enough to be represented symbolically, but experiential enough
to require neural completion at the edges. A Jyotishi does not
retrieve. They reason from first principles, applying precedence rules
to the specific configuration in front of them. The AI should do the same.
Case Study: Raga
Raga is not music. It is a constraint satisfaction system transmitted orally for three thousand years.
The Samaveda specified melodic contours for Vedic recitation before
Greece had a musical notation system. What developed from that seed over
three millennia is one of the most sophisticated formal constraint systems
in any artistic tradition: 72 parent scales (melakarta), hundreds of derived
ragas, each with mandatory ascent/descent phrases, permitted ornaments
(gamaka), forbidden note combinations, time-of-day constraints (prahar),
and associated emotional states (rasa) grounded in the Natyashastra's
taxonomy of human experience.
The problem is not that this system is underdocumented. It is that the
primary documentation is in performance. The Agra, Kirana, and Gwalior
gharanas each hold subtly different interpretations of the same raga.
These variations were passed guru-to-shishya in thousands of hours of oral
instruction that never became text. When the last exponents of a rare raga
perform their last concert, that version of the raga is gone.
We are building acoustic models that can extract the formal grammar of a
raga from recordings, disambiguate gharana-specific interpretations, and
represent the raga as a queryable constraint system. This is one of the
hardest problems in ethnomusicology. We believe it is now computationally
tractable.
"यद् भावो तद् भवति" — As the feeling, so becomes the reality.
The tradition understood that knowledge is inseparable from the mind
that holds it. We are trying to encode not just the words, but the
cognitive structure underneath them.
— Bhagavad Gita 17.3
Open Research Problems
What we have not solved yet.
We are publishing these because they are genuinely hard, because we want
collaborators who have thought about them, and because the field benefits
from naming problems precisely.
Sutra-to-rule extraction.
Formalising Panini's Ashtadhyayi as a computable rule system. Previous
computational work (Akshar, SLP1) covers subsets. We need the full
precedence hierarchy — utsarga (general), apavada (exception), and
paribhasha (metarule) — encoded in a form that supports automated
derivation verification.
Commentary-aware embeddings.
Standard sentence embeddings treat Shankara's Advaita commentary and
Madhva's Dvaita commentary on the same Brahmasutra verse as semantically
similar, because the surface text is. They are philosophically opposite.
We need embedding spaces that preserve doctrinal lineage as a first-class
dimension.
Low-resource acoustic modelling for raga.
Training phoneme-level ASR on endangered language oral traditions
with under 100 hours of clean audio, no text transcripts, and multiple
speaker-dialects. Standard self-supervised approaches (wav2vec 2.0,
Whisper) require orders of magnitude more data per language.
Domain-specific evaluation.
Building the first publicly available benchmarks for Jyotish reasoning
accuracy, Vedic Sanskrit translation quality, and Ayurvedic diagnostic
consistency. Without evals, progress is invisible and hallucination
is indistinguishable from competence.
Multi-script manuscript digitisation.
Devanagari OCR that handles Grantha, Sharada, Nandinagari, and
Siddham scripts — with scribal variation, ligature ambiguity, and
physical manuscript degradation — at accuracy levels that do not
require expert correction on every line.
Jyotish chart reasoning.
Formalising Parashari, KP, and Jaimini interpretation as a symbolic
reasoning system with navigable precedence, planetary dignity tables,
and dasha sub-period calculation that can be audited at each inference
step. Not black-box prediction — explainable reasoning from first
principles.
Join the Research.
We are looking for people who find these problems compelling — not because
of the application layer, but because the problems themselves are interesting.
You do not need to be Vedic-background. You need to be technically rigorous
and willing to work at the intersection of traditions that have never been
formalized computationally.
Write to us. Tell us what you're working on and why this matters to you.
Pranava — the seed from which all knowledge unfolds.
The tradition has survived four thousand years of invasion, colonisation,
and modernity. It does not need rescue. What is new is the possibility
of preserving its formal structure at machine scale — not as a museum
exhibit, but as a living reasoning system that continues to grow.