"What’s impressive is that the sparse fine-tuned LLM can achieve 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU."
Imagine a situation where you have thousands of documents
But need to find an answer from the documents
And at the same time, get the document where the answer is coming from
You could open and search the documents one by one
But that would take forever
Enter Extractive Question Answering with Sparse Transformers
With Extractive Question Answering, you input a query into the system
And in return, you get the answer to your question and the document containing the answer.
Extractive Question Answering enables you to search many records and find the answer.
It works by:
- Retrieving documents that are relevant to answering the questions.
- Returns text that answers that question.
Language models make this possible.
For example, the receiver can be a masked language model.
The reader can be a question-answering model.
The challenge of these language models is that they are quite large.
The size makes it hard to deploy the models for real-time inference.
For example, deploying big models is not possible on mobile devices.
Furthermore, inference time, latency, and throughput are also critical.
The solution is to reduce the model's size while maintaining its accuracy.
Making the model small is easy but maintaining accuracy is challenging.
These can be achieved by pruning and quantizing the model.
Pruning involves removing some weight connection from an otherwise overprecise and overparameterized model.
Furthermore, you can reduce the precision of the floating points to make the model smaller.
In today's article, I cover this in more detail. Including:
Document retrieval with DeepSparse and arXiv dataset
Document retrieval with a dense and sparse model
Comparing the performance between dense and sparse models
Language models have a limit on the length of text they can process.
For instance, 512 for BERT and 4096 for models such as Longformer and Big Bird.
However, documents in the real world can be arbitrarily long. For instance, reviews from customers.
Classifying these reviews fast can help a business in engaging with its customers quickly. Fort instance, flagging negative reviews to reduce churn and customer complaints.
The problem?
Language models are usually very large making them slow at inference and difficult to deploy. These models are usually overprecise and over-parameterized. You can drop some of the weight connections in these networks to achieve a small model while maintaining accuracy. This is known as sparsification. Furthermore, you can reduce the precision of the floating points in the network to reduce its size further.
In my latest article, I explore how to perform text classification on long documents using sparse Hugging Face Transformers. I illustrate that it’s possible to use a 90% sparse transformer model— meaning that 90 percent of the model's weights are removed, and the rest of the parameters quantized— and still achieve accuracy similar to a dense model.
Using a sparse transformer achieves a 4.8X increase in performance over the dense baseline. It also results in a smaller model which is easy to deploy on commodity CPUs, hence no expensive accelerator hardware is required.
Tracking LightGBM projects with Layer. You can use Layer to:
- Version datasets
- Version models
- Log model parameters
- Log test metrics
- Log charts
- Log sample predictions
- Log images
- Document your project
Once a machine learning model has been built, the next step is to share it with the world so that other people can benefit from it. Creating a working model is the beginning of a more extensive process where machine learning is operationalized. But what exactly is operationalization?
In this class, you will learn how to use Power BI for business intelligence.
As part of the class, you will learn how to generate various reports as well as create a dashboard from them.
You'll also learn how you can publish your dashboard and share it with your colleagues.
Transforming your data applications into web applications can easily become a pain in the neck. If your focus is majorly data science, machine learning, and deep learning then you might not want to spend time learning web frameworks so as to build and deploy data applications. Enter Streamlit.