Search Documents Quickly with Extractive Question Answering

mwitiderrick · on Dec 15, 2022

Imagine a situation where you have thousands of documents

But need to find an answer from the documents

And at the same time, get the document where the answer is coming from

You could open and search the documents one by one

But that would take forever

Enter Extractive Question Answering with Sparse Transformers

With Extractive Question Answering, you input a query into the system

And in return, you get the answer to your question and the document containing the answer.

Extractive Question Answering enables you to search many records and find the answer.

It works by: - Retrieving documents that are relevant to answering the questions. - Returns text that answers that question.

Language models make this possible.

For example, the receiver can be a masked language model.

The reader can be a question-answering model.

The challenge of these language models is that they are quite large.

The size makes it hard to deploy the models for real-time inference.

For example, deploying big models is not possible on mobile devices.

Furthermore, inference time, latency, and throughput are also critical.

The solution is to reduce the model's size while maintaining its accuracy.

Making the model small is easy but maintaining accuracy is challenging.

These can be achieved by pruning and quantizing the model.

Pruning involves removing some weight connection from an otherwise overprecise and overparameterized model.

Furthermore, you can reduce the precision of the floating points to make the model smaller.

In today's article, I cover this in more detail. Including: Document retrieval with DeepSparse and arXiv dataset Document retrieval with a dense and sparse model Comparing the performance between dense and sparse models