
[01] Evaluation
Using evaluation to study capabilities, limits, and transfer.
A large share of my work asks what evaluation is measuring. Through INCLUDE, Global PIQA, Kaleidoscope, and related efforts, I study where models lose regional knowledge, physical commonsense, and visual understanding once we move beyond English-default settings. Much of that work is about where reported strength does not carry over cleanly to the settings people actually use.
Focus
Evaluation methodology, multilingual transfer, regional knowledge, physical reasoning, and massively multilingual vision evaluation.
