Imagine a situation where you have thousands of documents
But need to find an answer from the documents
And at the same time, get the document where the answer is coming from
You could open and search the documents one by one
But that would take forever
Enter Extractive Question Answering with Sparse Transformers
With Extractive Question Answering, you input a query into the system
And in return, you get the answer to your question and the document containing the answer.
Extractive Question Answering enables you to search many records and find the answer.
It works by:
- Retrieving documents that are relevant to answering the questions.
- Returns text that answers that question.
Language models make this possible.
For example, the receiver can be a masked language model.
The reader can be a question-answering model.
The challenge of these language models is that they are quite large.
The size makes it hard to deploy the models for real-time inference.
For example, deploying big models is not possible on mobile devices.
Furthermore, inference time, latency, and throughput are also critical.
The solution is to reduce the model's size while maintaining its accuracy.
Making the model small is easy but maintaining accuracy is challenging.
These can be achieved by pruning and quantizing the model.
Pruning involves removing some weight connection from an otherwise overprecise and overparameterized model.
Furthermore, you can reduce the precision of the floating points to make the model smaller.
In today's article, I cover this in more detail. Including:
Document retrieval with DeepSparse and arXiv dataset
Document retrieval with a dense and sparse model
Comparing the performance between dense and sparse models
But need to find an answer from the documents
And at the same time, get the document where the answer is coming from
You could open and search the documents one by one
But that would take forever
Enter Extractive Question Answering with Sparse Transformers
With Extractive Question Answering, you input a query into the system
And in return, you get the answer to your question and the document containing the answer.
Extractive Question Answering enables you to search many records and find the answer.
It works by: - Retrieving documents that are relevant to answering the questions. - Returns text that answers that question.
Language models make this possible.
For example, the receiver can be a masked language model.
The reader can be a question-answering model.
The challenge of these language models is that they are quite large.
The size makes it hard to deploy the models for real-time inference.
For example, deploying big models is not possible on mobile devices.
Furthermore, inference time, latency, and throughput are also critical.
The solution is to reduce the model's size while maintaining its accuracy.
Making the model small is easy but maintaining accuracy is challenging.
These can be achieved by pruning and quantizing the model.
Pruning involves removing some weight connection from an otherwise overprecise and overparameterized model.
Furthermore, you can reduce the precision of the floating points to make the model smaller.
In today's article, I cover this in more detail. Including: Document retrieval with DeepSparse and arXiv dataset Document retrieval with a dense and sparse model Comparing the performance between dense and sparse models