Efficiently teaching computers to know what they don't know

1 Jan 2024 nlp phd question answering by Vagrant Gautam

A non-technical overview of Gautam et al. 2023, our latest paper about making question answering systems more robust: A Lightweight Method for Unanswerable Question Generation in English.

I've hit a PhD milestone - my first first-author paper! This is a non-technical post explaining what it's about.

TL;DR We use simple swaps to turn answerable questions into unanswerable ones, and we use this extra data to create better question answering systems. Our method to create unanswerable questions is better than previous computer-based methods in almost every way, despite being simpler, so new methods should be compared to ours. Simpler is better; don't invent sledgehammers to crack nuts.

Answering questions with computers

Have you ever wondered what the capital of Saarland is?

If you wanted to answer that question, I imagine you would go to the Wikipedia page for Saarland and then perhaps skim the first few paragraphs until you found the answer in the following sentence:

Saarbrücken is the state capital and largest city.

A common way to build question answering systems using machine learning is to follow the same 2-step pipeline:

Picking the most relevant document given the question (the Wikipedia page for Saarland, for instance)
Then extracting the answer by "reading" the document

But what if your system picked the wrong first document? The Wikipedia page for Capital (Marxism), for instance, rather than Saarland.

Or what if your system only had access to documents from an ancient copy of Wikipedia, one with the 2002 entry for Saarland, which didn't contain any information about the capital?

Or what if you wanted to know the capital of Lewis Carroll's Wonderland, a fictional country that doesn't have a (known) capital at all?

All of these cases are failure modes for the typical question answering system, where we would expect it to tell us that it can't answer the question but most systems tend to guess at an answer anyway. Sometimes this may be okay, like in a trivia game with no negative points, but in other cases, like if people were taking health advice from such a system, this could have a serious impact on people's lives.

Teaching computers to know what they don't know

The reason that these systems always attempt to answer questions is because they are built and trained with the assumption that every question can be answered. One way to fix this is to train these systems with negative samples as well, i.e., questions that cannot be answered with a given document.

To train and evaluate question answering systems, researchers have curated lots of high-quality datasets of answerable questions, but very few datasets exist that contain both answerable and unanswerable questions. There are a few ways to augment only-answerable datasets with unanswerable questions, each with its own advantages and disadvantages:

Human annotators - high-quality, but slow and expensive
Machine learning models trained to generate unanswerable questions - fast, but low-quality and needs expensive computers (GPUs)
Simple word swaps - medium-quality, fast and cheap!

Examples of methods to create unanswerable questions

Let's see how these methods compare when they're all given the same document about Bermuda and an answerable seed question.

Document:
The only indigenous mammals of Bermuda are five species of bats, all of which are also found in the eastern United States: ...

Answerable seed question:
What are the only native mammals found in Bermuda?

Other than the human annotators, all methods below are computer-based methods to create unanswerable questions. UNANSQ and CRQDA are two machine learning models specifically designed to generate unanswerable questions, and antonym and entity swaps are the lightweight swaps that we recommend using instead. When a method has more "training parameters," that means it requires more expensive computers to run.

Human annotators^[1]: What is one of five indigenous mammals of Bermuda?

UNANSQ^[2]: what is the only native mammals found in bermuda ?
(relevant and partially grammatical, but answerable and has 1 million training parameters)

CRQDA^[3]: What are the only native mammals are found in?What?
(relevant and unanswerable, but ungrammatical and has 593 million training parameters)

Antonym-swapped: What are the only foreign mammals found in Bermuda?
(relevant, grammatical, unanswerable and has 0 training parameters!)

Entity-swapped: What are the only native mammals found in United States?
(relevant, partially grammatical, unanswerable and has 0 training parameters!)

Creating unanswerable questions with simple swaps

As the examples in the previous section show, antonym-swapped questions are created by swapping a word for its antonym, and entity-swapped questions are created by swapping an entity (i.e., a proper noun for a person, place or organization) with another entity of the same type.

Document:
Saarland is a state of Germany in the southwest of the country. It is the smallest and least populous German state apart from the city-states. Saarbrücken is the state capital and largest city...

Answerable seed question:
What is the capital of Saarland?

Entity-swapped unanswerable question:
What is the capital of Germany?

Answerable seed question:
What is the least populous state of Germany?

Antonym-swapped unanswerable question:
What is the most populous state of Germany?

Reminder: "Unanswerability" here only means that a question is unanswerable given a certain document. The swapped questions are unanswerable only because they are paired with the above document about Saarland.

It's very intuitive to see why swaps create questions that make sense. It also helps that getting antonyms and identifying entities are very well-known tasks in the field of natural language processing. Machines are quite good at this in English, which makes the quality of the swapped data better.

How good are simple swaps?

Data quality

Since our method and the computer-based methods that came before us all involve creating unanswerable questions, we want to evaluate the quality of all these questions and compare them to ones written by humans. Our method's unanswerable questions turned out to be high-quality along 3 axes that are important for question answering:

Less data noise: our questions are unanswerable more often than the other computer-based methods, and almost as often as human-written ones
Higher relatedness: our questions are more relevant to the document
Higher readability: our questions are more human-readable and grammatical than other computer-based methods, but not as good as human-written ones

These results are especially impressive when you consider that I can run our method on my 2017 MacBook Air, whereas the other methods require much more powerful GPUs.

The catch(es)

My post so far paints a rosy picture of swaps but they definitely aren't perfect. One of the main problems is that swaps create heuristically unanswerable questions, i.e., sometimes we accidentally get questions that can actually be answered with the document, but we put them in the unanswerable box.

Document:
On April 4, 2008, Beyoncé and Jay-Z married without publicity...

Answerable seed question:
When did Beyoncé get married?

Entity-swapped unanswerable question:
When did Jay-Z get married? (answerable because the marriage involves both Beyoncé and Jay-Z)

Human annotators do this too but they do it just a little less often than our method does. Human-written questions are also more creative and diverse than a simple method like this. This is a noticeable qualitative difference which we mention in our paper but don't look at in detail, and we don't really know how question creativity impacts question answering systems anyway.

We're also fairly limited in the domains and languages we can work with, because swapping names and antonyms may not work quite so well when dealing with sources of information that are not Wikipedia. Swapping is also very reliant on the quality of tools you have for recognizing entities and finding the right antonym, and also on syntactic features of the English language. If we didn't have high-quality tools for this in English or if we were operating in German, for instance, a lot of things just wouldn't work.

The bigger picture

Recently I've been thinking about broader questions that didn't make it into the paper at all.

Our definition of unanswerability hinges on picking a wrong document in step 1, but what if you had a question answering system that doesn't use the 2-step pipeline at all? ChatGPT doesn't rely on access to a document store or to the internet but simply has all its "knowledge" baked in. In this context, it's interesting to look at unanswerability as a property inherent to the question, e.g., if you had a question that assumes something about the state of the world that isn't actually true.

When was the asteroid that destroyed Australia first spotted?

I've also been thinking about how the world is constantly changing and it is expensive, impractical and plain absurd to up and build a brand new ChatGPT every time a fact changes. There are different, more practical ways to handle this (with varying advantages, disadvantages and degrees of success). I'm curious about how these approaches stack up in the context of question answering, and I've been thinking about the "stickiness" of facts: country capitals change significantly less often than government leaders do. Does this make systems more adaptable to information updates in the latter case?

Quantifying the answers to these questions lets us better understand where these technologies are useful, where a different technical solution might be better, and where we should just keep computers out of the equation altogether. I feel that not enough people are considering this last possibility or thinking critically about the tradeoffs and (social, environmental, etc.) impact of getting a computer to do something, but I'm glad I know and work with many who do. I hope 2024 holds more papers and blog posts about these collaborations - stay tuned!