IJCNLP-AACL 2025December 20, 20257 min read

Why I Wanted to Test VLMs Country by Country

This project began with a simple refusal: broad claims about cultural awareness were not enough. I wanted an evaluation where the weak spots had nowhere to hide, which is why we pushed the task down to country level and then tested it across prompt formats, multilingual settings, and adversarial conditions.

VLMsCultural RepresentationEvaluation Design

Nepali stupa and prayer flags under open sky, with no people in frame

The bias question needed a harder target

Broad claims about cultural awareness were not enough for me. They let the model hide inside averages. A system can look globally competent while still being thin, brittle, or stereotyped on a long list of actual places.

That is why the paper pushes the unit of analysis down to the country. Once the task becomes image-based country identification, the claim gets sharper. The model either recognizes where the image is from or it does not, and the unevenness across places becomes much harder to blur away.

This mattered because cultural failures rarely announce themselves at the aggregate level. They appear as patterned drop-offs, familiar-country comfort, and weaker recognition once the image stops matching the model's most repeated visual priors.

The setup was never a neutral wrapper

The paper also avoids the easy version of the benchmark. We used Country211, but not with just one flattering prompt style. The evaluation moves across open-ended questions, multiple-choice questions, multilingual prompts, and adversarial settings.

That choice matters because prompt format is not decoration. If a model looks stable in one format and much less reliable in another, that difference belongs to the result. The same goes for multilingual prompting and adversarial framing. Those changes expose what the model was leaning on all along.

A lot of my research motivation sits right there. Models matter, but the evaluation choices that make models look stronger than they are matter just as much.

// Working Notes

Country-level image identification forces the task to stay specific.
Open-ended and multiple-choice prompts let us compare behavior across formats.
Multilingual and adversarial settings test how performance changes under less comfortable framing.
The setup makes it harder to confuse broad visual fluency with truly even cultural coverage.

The result was unevenness, not collapse

The models did not collapse across the board, and that would have been a less interesting outcome anyway. The stronger result is that performance shifts across countries and across formats. The capability is there, but it is not distributed evenly.

That kind of patterned disparity says much more than random failure would. It points toward something structural, whether the source is the training distribution, the prompt format, the evaluation setup, or all of them at once.

This is where the paper became valuable to me. Cultural representation stopped being a slogan and became a behavior that could be inspected directly.

Training data is part of the story, not the whole story

The easy explanation would have been to pin everything on skewed pretraining data and stop there. Training distribution clearly matters, but it does not close the case.

Question format matters too. Prompt language matters. Adversarial setup matters. Once those decisions move the outcome in visible ways, evaluation stops being a passive judge standing outside the system. It becomes part of what capability means on the page.

That is very close to the center of my research taste. A score is only interesting if it survives changes in framing that make the task look more like the world.

Where I want to push this line of work next

I want more multimodal evaluation to work at this level of specificity instead of hiding behind global averages.

The interesting part starts exactly where the model stops looking evenly capable.

// Closing Thought

This is the kind of evaluation I want more of: specific enough that uneven cultural coverage has nowhere to hide.

The bias question needed a harder target

The setup was never a neutral wrapper

The result was unevenness, not collapse

Training data is part of the story, not the whole story

Where I want to push this line of work next

More Writing

Researching from Kathmandu Is Not an Abstract Idea to Me

What DSBC Taught Me About Data Science Agents