RegNLP 2025June 19, 202510 min read

When Regulatory Retrieval Needs a Second Pass

This project started from a problem that kept bothering me: in regulatory work, a passage can look right long before it becomes safe to use. That gap between topical similarity and dependable evidence is exactly where the system design had to get stricter.

Regulatory QAHybrid RetrievalBM25 + Dense

Legal books stacked on a desk for a regulatory retrieval article

I started from the near miss

I have a hard time with retrieval systems for regulatory work when they are judged only by whether they found something vaguely related. In regulatory text, vaguely related often means operationally wrong. A paragraph can mention the same institution, the same process, even the same policy term, and still fail to answer the user's question because the exact scope is different.

That was the real motivation behind this paper. The shared task gave us a formal benchmark, but the question was already familiar. If a system is going to support regulatory question answering, it cannot stop at topical relevance. It has to surface the passage that survives close reading.

What makes this difficult is that regulatory documents are full of near matches. The wrong passage is often not nonsense. It is often a neighboring clause, a similar provision, an earlier definition, or a general statement that becomes inaccurate once an exception is introduced later. That is exactly the kind of setting where dense retrieval can look better than it really is.

So I did not approach this paper as a search for a clever trick. I approached it as a way to test a simpler suspicion. Semantic similarity is doing useful work, but it should not finish the ranking by itself.

That distinction matters much more in regulation than in many consumer-facing retrieval settings. In a looser domain, topical proximity can still be useful. In regulation, a near miss can become an operational mistake. That raised the standard for what I wanted the retriever to do.

The frame I kept using while reading it

The task couples retrieval with answer generation, but the right way to read it is as a retrieval paper first. If the context window is wrong, the answer stage becomes mostly damage control. That means the burden sits heavily on the ranking stage.

I found it useful to think of the task in two layers. Layer one asks whether the system can gather a neighborhood of potentially relevant passages. Layer two asks whether it can distinguish the usable passage from the merely similar one. Dense retrieval is often strong at the first layer. Regulatory QA punishes weakness in the second.

Once I framed it like that, the design space looked narrower and also more practical. We did not need one scoring function that magically understands everything. We needed one stage that explores semantically and another that becomes stricter about lexical evidence.

That is also why this paper fits my broader taste in systems work. I usually prefer a pipeline where I can describe what each part is responsible for. Here the responsibilities are clean.

It also helps with failure analysis. If the final answer is wrong, I want to know whether retrieval failed broadly, whether reranking failed narrowly, or whether generation misused good evidence. A monolithic description of the system makes that harder.

// Working Notes

Dense retrieval gives a broad candidate pool.
BM25 reranking restores lexical discipline.
The final top-ranked passages are the ones answer generation sees.
The hybrid design is easy to inspect when something goes wrong.

The pipeline is simple on purpose

The pipeline itself is not complicated, and that is part of its strength. We fine tune a dense embedding retriever, use FAISS to retrieve the top 20 passages by cosine similarity, then rerank those candidates with BM25 before selecting the top 10 for answer generation.

The part I value here is the separation of roles. Dense retrieval does broad semantic recall. BM25 restores exact lexical agreement. In regulatory text, that second signal matters much more than people sometimes want to admit.

If I write the intuition mathematically, the dense-first version acts like score(q, d) = sim_dense(q, d), while the reranked version behaves more like score(q, d) = lambda * sim_dense(q, d) + (1 - lambda) * sim_lexical(q, d). The paper is not trying to romanticize the formula itself. The point is that lexical evidence gets restored as part of the final decision.

That changes the system's attitude. Instead of asking only whether a passage feels related, it starts asking whether the passage uses the language the question is anchored on.

The more I sat with the pipeline, the more I liked that modesty. It does not pretend that dense retrieval has solved the whole ranking problem. It accepts that semantic search and lexical matching are each seeing something incomplete, then makes them argue with each other a little.

The table I kept staring at

One thing I kept returning to while reading the results was how strong BM25 already was. It reached Recall@10 = 0.7611 and mAP@10 = 0.6237. That should immediately make us more careful. If a lexical baseline is already this competitive, then the task is telling us something about where the real signal lives.

Several dense retrievers produced a familiar pattern. Recall looked acceptable, sometimes even strong, but mean average precision stayed weak. Stella, for example, reached Recall@10 = 0.7756 and mAP@10 = 0.1036. That is not a small miss. It means the model could find the right region of the document space but was bad at ordering the evidence in the way the task needed.

That pattern matters because it matches the qualitative behavior I worry about in regulatory systems. A dense retriever can look intelligent because it gathers thematically related passages, but once you inspect the order, the passages near the top may still be the wrong basis for an answer.

That is one of the most useful things in the paper. It gives a numerical version of a practical feeling many people in applied retrieval already have.

The BM25 result also has a humbling effect. It forces us to ask whether we are underestimating lexical exactness in domains where wording carries force. Regulation is one of those domains. Exact wording is not a superficial feature there. It is often where the obligation or exception actually lives.

Where the result becomes convincing

The final LeSeR system reaches Recall@10 = 0.8201 and mAP@10 = 0.6655. I do not read that result as a generic victory for hybrid retrieval. I read it more specifically. The reranking stage corrected exactly the kind of ranking mistakes that dense retrieval kept making in this domain.

What matters to me is that the final system beats BM25 in recall while also beating it in mAP. That is the result I wanted to see. If the hybrid system had only increased recall while collapsing precision, it would still feel risky for downstream generation. If it had only improved mAP while narrowing retrieval too much, it would feel brittle. The balance is the achievement.

This is also where the paper starts to matter beyond one benchmark. In high-stakes domains, retrieval should be evaluated by whether it puts the operationally usable passage high enough for a human or generator to depend on it, not whether the correct passage appears somewhere in the long tail of candidates.

That feels like the hidden standard behind a lot of serious retrieval work. The system is not good because it can eventually recover the answer. It is good because it puts the usable evidence where a human or a downstream generator is most likely to rely on it.

Why generation cannot rescue bad retrieval

One reason I keep insisting that this is a retrieval paper first is that answer generation is often asked to rescue mistakes it should never have inherited. If the retriever brings in passages that are semantically nearby but operationally wrong, the generator is being asked to choose among bad premises.

In practice, that creates a dangerous illusion. The answer can still sound coherent because the retrieved material remains on-topic. The wording may even sound informed. But the confidence comes from proximity, not from the right clause.

That is why I keep thinking about retrieval quality as the first real bottleneck in regulatory QA. A generator cannot be more grounded than the evidence it was given. At best it can behave cautiously. At worst it can transform a near miss into a polished mistake.

// Closing Thought

In regulatory retrieval, a near miss is already too close for comfort. If the right evidence is not near the top, a reranking stage that restores lexical discipline can make the difference between a risky guess and a dependable answer.

I started from the near miss

The frame I kept using while reading it

The pipeline is simple on purpose

The table I kept staring at

Where the result becomes convincing

Why generation cannot rescue bad retrieval

More Writing

Researching from Kathmandu Is Not an Abstract Idea to Me

What DSBC Taught Me About Data Science Agents