Sarvam Vision in a League of Its Own for Indic OCR Across 22 Languages, Outpacing Gemini, GPT

Key Highlights:

  • Sarvam AI launches Sarvam Vision, a 3B-parameter vision-language model.
  • The system focuses on multilingual document intelligence across 22 Indian languages.
  • It supports OCR, chart interpretation, table parsing, and visual reasoning.
  • A new Indic OCR benchmark with 20,000+ samples is also released.

Sarvam AI has introduced Sarvam Vision, a new 3B-parameter vision-language model designed for multilingual document intelligence and visual reasoning. The launch expands the company’s sovereign model series into vision capabilities. Importantly, the release aims to solve a key challenge: extracting knowledge from India’s vast archive of scanned and physical documents.

The new model can perform image captioning, scene text recognition, chart analysis, and complex table parsing. As a result, enterprises, research institutions, and government agencies can process multilingual documents at scale.

Why is Sarvam AI focusing on document intelligence?

A large portion of India’s knowledge still exists in printed archives, historical manuscripts, and scanned records. However, most global AI models are optimized primarily for English documents. Consequently, regional language accuracy often remains lower.

Sarvam AI is attempting to close this gap by training the model on curated datasets that include scientific papers, financial records, textbooks, newspapers, and government bulletins across Indian languages. The company also used both synthetic and real-world samples to improve accuracy across domains.

How does Sarvam Vision work?

The architecture combines the sovereign vision-language model with two supporting modules: a semantic layout parser and a reading-order network. Together, these components enable the system to interpret document structure rather than extracting text alone.

This approach shifts document intelligence from simple OCR to knowledge extraction. For example, the model can understand nested tables, extract values from charts, and analyze relationships between visual elements and text.

New benchmarks aim to measure Indic OCR performance

To evaluate multilingual performance, the company introduced the Sarvam Indic OCR Bench, featuring more than 20,000 document samples spanning 22 official Indian languages. The benchmark measures word-level accuracy across historical and modern documents, including lower-quality scans.

In addition, the model was evaluated using global benchmarks such as olmOCR-Bench and OmniDocBench to ensure comparability with existing systems.

What this means for multimodal AI adoption

Sarvam Vision signals a broader push toward inference-efficient, regionally optimized AI systems. By targeting multilingual OCR and document reasoning, Sarvam AI is positioning the platform to support research digitization, enterprise workflows, and government data modernization across India.

39 Views