// Projects

Built &
Maintained

Research infrastructure, evaluation frameworks, multilingual language models, and applied language systems, from papers to production deployments.

// Featured

Flagship Project

01Active

ML Agents Reasoning Benchmark

A benchmarking effort for LLM agent reasoning across 50 task categories, spanning coding, instruction following, mathematical reasoning, and tool use. The work centers on evaluation design, inference strategy comparison, and LLM-as-judge pipelines.

AgentsBenchmarkingPythonEvaluation Design
> View on GitHub> HuggingFace

Context

Cohere Labs research coordination and benchmark design.

Scope

50 task categories spanning reasoning, coding, instruction following, tool use, and recovery.

Team

Research collaborators contributing taxonomy, datasets, and analysis workflows.

// All Projects

Research & Engineering

Abstract model and language system visualization

// [02]

Mantra-14B

A Hindi-English multilingual model developed by fine-tuning Phi and Qwen on a self-curated and empirically balanced instruction dataset. The goal was to push multilingual capability in a way that remains useful for real user-facing tasks rather than only benchmark reporting.

Multilingual LLMInstruction TuningHindi-English
> View model

AgentPro + DSBC

At Traversaal.ai, I architected AgentPro, a REACT-based framework for complex data-science workflows, and contributed to DSBC, a benchmark for evaluating agent performance across eight task categories with explicit attention to context engineering and architectural sensitivity.

03Active
AgentsData Science

South Asian LLMs

An ongoing multilingual language modeling effort focused on South Asian languages, with work spanning instruction tuning, capability evaluation, and community-grounded model development for languages that are still poorly served by mainstream LLMs.

04Ongoing
South Asian LLMsMultilingual NLP

Hate Speech Detection in Devanagari Languages

An ongoing project on hate speech detection, hate-target detection, and cross-lingual generalization in closely related Devanagari-script languages (Nepali-Hindi), centered on dataset curation with socio-cultural annotations and multilingual baseline analysis.

05Ongoing
Hate SpeechLow-Resource NLP

Global PIQA

An extension of the PIQA physical reasoning benchmark to 100+ languages and cultures. The project exposes how commonsense reasoning shifts once evaluation leaves English, helping separate genuine reasoning from English familiarity and benchmark overfitting.

06Preprint
100+ LanguagesCommonsense