Running Large Language Models in Local : Llama Cpp

We’ll explore how to run Large Language Models (LLM) on your local system, even if you don’t have a GPU or prefer to avoid API costs. We’ll achieve this by using the Llama.cpp library https://github.com/ggerganov/llama.cpp which is compatible with both CPU and GPU, allowing GPU acceleration if available.

Prerequisites: This blog is for curious minds looking to tap into the potential of Large Language Models (LLMs) to enhance both their daily lives and business Objectives. All you need is a basic understanding of Python and a pinch of curiosity – no deep learning wizardry required!

Let’s get started by installing the library, llama cpp offers python bindings :

Python Version used : Python 3.10.12

pip install -q llama-cpp-python==0.1.78

Please ensure you install this specific version as other versions might not support certain LLM models. Our focus here is on running models locally on the CPU, so we’ll be working with models in different formats, such as GGML and GGUF. You can download these models from the Hugging Face website here.

GGML models are a type of LLM that are designed to be more efficient and faster to run on CPUs. They do this by using a number of techniques, such as:

  • Quantization: This reduces the precision of the weights and activations in the model, which can make the model smaller and faster to execute.
  • Model compression: This removes redundant information from the model, such as unused weights or connections.

GGML models are known for being CPU-friendly because they can be trained and run on CPUs without the need for GPUs. This makes them a good choice for applications where GPUs are not available or are not affordable.

Since we’re primarily running on CPU, we need to look for models in ggml and GGUF formats. Popular models like llama2 and falcon have their GGML versions available. You can find these models on the Hugging Face website’s dedicated search page. Choose a model that suits your CPU preference. Keep in mind that larger models may result in slower inference times, especially on CPUs with limited RAM. If you have, for example, 6GB of RAM, consider using 2-bit or 4-bit quantized GGML models for faster responses.

These models come in different sizes, such as 7 billion, 13 billion, 30 billion, and 70 billion parameters. Additionally, “2-bit” and “3-bit” refer to the quantization of the original model’s weights. Lower bit models provide faster responses but may sacrifice accuracy on complex tasks. It’s a trade-off between inference speed, memory usage, and accuracy.

For general purposes and dialogue generation, a 2–4 bit quantized model with 7 billion parameters is usually sufficient. However, if you require higher accuracy, you can opt for models with more parameters, like 13 billion. Assess your specific needs and choose a model accordingly.

In this blog, we will work with a relatively small-sized model with 7 billion parameters and 2-bit quantization. However, always be mindful of your CPU specifications. Larger models can significantly impact performance, sometimes even causing crashes. Select models that align with your CPU.

After installing the pip package and downloading the models from Hugging Face, let’s give it a try:

from llama_cpp import Llama
model = Llama(
model_path="path_to_your_models/llama-2-7b-chat.ggmlv3.q2_K.bin" # Replace with your model's path
)

output = model.generate("What do you think of the impact of LLM models in various industries?")

print(model_output["choices"][0]["text"])

The output looks like this :
I think it's a great way for individuals and organizations
to gain new skills and knowledge in specific areas without
having to go back to school or hire full-time employees.
The flexibility of the LLM models allows learners to choose
from a wide range of topics, making it easy to find something
that aligns with their interests and career goals.

If you got any package installation or other Issue, check the steps and Environment you are in and try Again, if you got it and worked then Great.

Let’s try to give a little complex task, like parsing valuable information from a document (OCR Text)

prompt_query = """
Extract key information like Member name, Member id, RXBIN, RXGRP, etc., from text in dictionary format.
This is the text to extract key information:
'
South Carolina
Member Name
Bruce Wayne
Member ID
ZCT012345678901
RxBIN 004336 PLAN PPO
RxGRP. RX4236
RxPCN MEDDADV
Issuer 80340
Part D/Plan Benefit
CMS-H4209-XXX
(ual PPO
'
"""
model_output = model(prompt_query, max_tokens=1000, temperature=0.1)
print(model_output["choices"][0]["text"])

the output looks like this :

Answer:
{
"Member Name": "Bruce Wayne",
"Member ID": "ZCT012345678901",
"RxBIN": "004336 PLAN PPO",
"RxGRP": "RX4236",
"RxPCN": "MEDDADV",
"Issuer": "80340",
"Part D/Plan Benefit": "CMS-H4209-XXX"
}
Note: The keys are case insensitive, so the values in the dictionary may be in an
y case.


As you can see, the outputs are quite satisfactory even with the small 7-billion-parameter model, which is 2-bit quantized. Feel free to experiment with larger models based on your specific needs.

Conclusion

In the ever-evolving landscape of programming languages and tools, it’s essential to stay informed about the alternatives available. While Llama Cpp provides a robust set of features and capabilities, it’s just one of many options at your disposal.

There are many other alternatives to Llama Cpp check out here in your Leisure time : https://www.libhunt.com/r/llama.cpp

In our next blog chapter, we’ll delve into an exciting topic: harnessing the power of web-based user interfaces. We’ll explore how to leverage technologies similar to ChatGPT and its API to create interactive and dynamic applications. What’s even more intriguing is that we’ll achieve this without the need for dedicated GPUs or incurring API costs—making it accessible to developers at all levels. Stay tuned for a journey into the world of user interfaces that are both captivating and cost-effective.

 

 

Leave A Comment