Jebish Purbey | Research

// Research Statement

I am broadly interested in understanding language models. My work asks what capabilities they genuinely have, which limitations standard benchmarks hide, and how behavior changes across language, culture, modality, and deployment conditions. Evaluation is a central method in that effort, especially when benchmark success risks overstating real-world performance.

// Primary Areas

Core Research Threads

Global view representing multilingual evaluation coverage

INCLUDE / Global PIQA / Kaleidoscopeevaluation as model analysis

[01] Evaluation

Using evaluation to study capabilities, limits, and transfer.

A large share of my work asks what evaluation is measuring. Through INCLUDE, Global PIQA, Kaleidoscope, and related efforts, I study where models lose regional knowledge, physical commonsense, and visual understanding once we move beyond English-default settings. Much of that work is about where reported strength does not carry over cleanly to the settings people actually use.

evaluation methodologyreal-world transfervision evaluation

Focus

Evaluation methodology, multilingual transfer, regional knowledge, physical reasoning, and massively multilingual vision evaluation.

Evaluation45+ LanguagesReal-World Performance

Evaluation program

INCLUDE

Regional knowledge evaluation for multilingual language understanding, aimed at separating genuine capability from English-only proxy performance.

Behavior under transfer

Global PIQA

Physical commonsense reasoning across broad language and cultural coverage, designed to expose hidden transfer failures.

Multimodal evaluation

Kaleidoscope

In-language exams for massively multilingual vision evaluation so model strength is tested in the languages and settings where people use it.

/// additional threads

[02] Behavior

Model behavior across language, culture, and modality.

I study how model behavior changes once we vary linguistic, cultural, or multimodal context. This includes work on cultural representation disparities in vision-language models, multilingual AI-generated text detection, and broader efforts to understand whether current systems actually reflect the settings they claim to serve.

Study lens

Country-level coverage and representation gaps in VLM outputs.
Cultural framing failures hidden by aggregate benchmark scores.
Behavior shifts that only become visible when evaluation leaves the default English setting.

Language Model BehaviorCultural BiasMultimodal AI

[03] Reasoning

Reasoning strategies, agents, and evaluation of the evaluators.

Alongside language evaluation, I study how LLM-based agents reason across multi-step tasks and how our evaluation procedures shape the conclusions we draw. That work spans chain-of-thought style prompting, structured reasoning methods, LLM-as-judge pipelines, and how agent systems generalize when environments become noisy, mixed-motive, or strategically complex.

What gets measured

Multi-step reasoning quality across diverse task categories.
Error recovery and robustness when plans break mid-trajectory.
How evaluation setups and judges influence the picture we get of model competence.

AgentsReasoningEvaluation Methods

[04] Systems

Language systems built with actual use in mind.

A complementary thread connects research questions to systems people can actually use. That includes multilingual AI-generated text detection, multilingual language modeling through Mantra-14B, and regulatory NLP work such as LeSeR, where retrieval and lexical reranking help make policy and compliance text more searchable, comparable, and actionable.

Current applications

AI governance analysis, regulatory question answering, multilingual detection, and practical language systems built for South Asian contexts.

Regulatory NLPAIGC DetectionDeployment

Legal books and documents representing regulatory and deployment-ready language systems

LeSeR / Mantra-14B / multilingual detection

// Collaborations

Research Partners

Sep 2023 - Present

Cohere Labs

Community Researcher -> ML Agents Lead

Long-term collaboration spanning INCLUDE, Kaleidoscope, cultural representation studies, and ongoing work on reasoning behavior and evaluation methodology.

Sep 2024 - Present

ZeroGrad.ai

Researcher

Grant-funded multilingual and multimodal NLP research covering multilingual language modeling, AI-generated text detection, regulatory QA, and behavior shifts across languages.

Apr 2021 - Apr 2025

Pulchowk Campus, IoE

Research Assistant and Undergraduate Researcher

Research across hate speech detection, Nepali summarization, cultural heritage detection, and low-resource NLP systems.