Hacker News new | past | comments | ask | show | jobs | submit | mwitiderrick's comments login

"What’s impressive is that the sparse fine-tuned LLM can achieve 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU."


Great, thanks for the add


Imagine a situation where you have thousands of documents

But need to find an answer from the documents

And at the same time, get the document where the answer is coming from

You could open and search the documents one by one

But that would take forever

Enter Extractive Question Answering with Sparse Transformers

With Extractive Question Answering, you input a query into the system

And in return, you get the answer to your question and the document containing the answer.

Extractive Question Answering enables you to search many records and find the answer.

It works by: - Retrieving documents that are relevant to answering the questions. - Returns text that answers that question.

Language models make this possible.

For example, the receiver can be a masked language model.

The reader can be a question-answering model.

The challenge of these language models is that they are quite large.

The size makes it hard to deploy the models for real-time inference.

For example, deploying big models is not possible on mobile devices.

Furthermore, inference time, latency, and throughput are also critical.

The solution is to reduce the model's size while maintaining its accuracy.

Making the model small is easy but maintaining accuracy is challenging.

These can be achieved by pruning and quantizing the model.

Pruning involves removing some weight connection from an otherwise overprecise and overparameterized model.

Furthermore, you can reduce the precision of the floating points to make the model smaller.

In today's article, I cover this in more detail. Including: Document retrieval with DeepSparse and arXiv dataset Document retrieval with a dense and sparse model Comparing the performance between dense and sparse models


Language models have a limit on the length of text they can process. For instance, 512 for BERT and 4096 for models such as Longformer and Big Bird.

However, documents in the real world can be arbitrarily long. For instance, reviews from customers.

Classifying these reviews fast can help a business in engaging with its customers quickly. Fort instance, flagging negative reviews to reduce churn and customer complaints.

The problem?

Language models are usually very large making them slow at inference and difficult to deploy. These models are usually overprecise and over-parameterized. You can drop some of the weight connections in these networks to achieve a small model while maintaining accuracy. This is known as sparsification. Furthermore, you can reduce the precision of the floating points in the network to reduce its size further.

In my latest article, I explore how to perform text classification on long documents using sparse Hugging Face Transformers. I illustrate that it’s possible to use a 90% sparse transformer model— meaning that 90 percent of the model's weights are removed, and the rest of the parameters quantized— and still achieve accuracy similar to a dense model.

Using a sparse transformer achieves a 4.8X increase in performance over the dense baseline. It also results in a smaller model which is easy to deploy on commodity CPUs, hence no expensive accelerator hardware is required.

Try it yourself https://neuralmagic.com/blog/accelerate-customer-review-clas...


Tracking LightGBM projects with Layer. You can use Layer to: - Version datasets - Version models - Log model parameters - Log test metrics - Log charts - Log sample predictions - Log images - Document your project


Once a machine learning model has been built, the next step is to share it with the world so that other people can benefit from it. Creating a working model is the beginning of a more extensive process where machine learning is operationalized. But what exactly is operationalization?


In this article, we will talk about:

What is Model Serving?

What is TensorFlow Serving?

TensorFlow Serving architecture

How to set things up with TensorFlow Serving?

Installing Docker

Installing TensorFlow Serving

Building an image classification model

Serving a model with TensorFlow Serving

Communication protocols

Creating gRPC and REST endpoints

Making a request to the model

Challenges of working with TensorFlow Serving

Common errors you might face when working with TensorFlow Serving

Best tips, practices, and gotchas when working with Tensorflow Serving


In this class, you will learn how to use Power BI for business intelligence. As part of the class, you will learn how to generate various reports as well as create a dashboard from them. You'll also learn how you can publish your dashboard and share it with your colleagues.


Transforming your data applications into web applications can easily become a pain in the neck. If your focus is majorly data science, machine learning, and deep learning then you might not want to spend time learning web frameworks so as to build and deploy data applications. Enter Streamlit.


Thanks for the heads up


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: