How to Get Up and Running User Interfaces Like ChatGPT and Its API Locally | Harnessing the Power of Open Source LLM

In this section, we’ll explore how to run LLM models conveniently through a web user interface and access them via an API, all without requiring a GPU and Completely local and free. We’ll be using a popular open-source and free-to-use web UI available on GitHub. Let’s dive in:

Prerequisites: This blog is for curious minds looking to tap into the potential of Large Language Models (LLMs) to enhance both their daily lives and business Objectives.  All you need is a basic understanding of Python and a pinch of curiosity – no deep learning wizardry required!

Installation and Setup:

git clone -b v1.8 https://github.com/camenduru/text-generation-webui
cd /content/drive/MyDrive/My_System_LLM/text-generation-webui

If needed (Recommended), activate your Conda environment or a virtual environment (VENV).

pip install -r requirements.txt
python server.py

Please ensure you install this specific version as other versions might not support certain LLM models. Our focus here is on running models locally on the CPU, so we’ll be working with models in different formats, such as GGML and GGUF. You can download these models from the Hugging Face website here.

GGML models are a type of LLM that are designed to be more efficient and faster to run on CPUs. They do this by using a number of techniques

Web User Interface (UI):

  • Access the web UI by going to the port assigned to you in your browser.
  • Place the LLM models you downloaded from Hugging Face in a specific location recommended by the documentation, should be placed at :
    text-generation-webui/models/llama-2-7b-chat.ggmlv3.q2_K.bin

THe User Interface looks like this :


You can load any Languages models From Hugging face. This WebUI have a lot of Other features like Character Personality to talk and Interact with, Tuning, Text to Speech and many more. Explore more : https://github.com/oobabooga/text-generation-webui

  • In the web UI, top left corner, navigate to the “Models” tab and load your downloaded model.
  • Once loaded, return to the “Text Generation” tab, and you’re ready to use the chat, similar to ChatGPT.

Or you can simply start the server with arguments like this :

python server.py --share --model path_to_your_model/llama-2-7b-chat.ggmlv3.q2_K.bin

API Access:

  • After running the Python server using python server.py you can make API calls from another file or terminal.
  • Define the API endpoint like this (Example) :

HOST = 'localhost:5000'
URI = f'http://{HOST}/api/v1/generate'

Here’s a sample code demonstrating how to make an API call to the web UI from Documentation:

import requests

# For local streaming, the websockets are hosted without ssl - http://
HOST = 'localhost:5000'
URI = f'http://{HOST}/api/v1/generate'

# For reverse-proxied streaming, the remote will likely host with ssl - https://
# URI = 'https://your-uri-here.trycloudflare.com/api/v1/generate'

def run(prompt):
    request = {
    'prompt': prompt,
    'max_new_tokens': 250,
    'auto_max_new_tokens': True,
    'max_tokens_second': 0,

    # Generation params. If 'preset' is set to different than 'None', the values
    # in presets/preset-name.yaml are used instead of the individual numbers.
    'preset': 'None',
    'do_sample': True,
    'temperature': 0.1,
    'top_p': 0.1,
    'typical_p': 1,
    'epsilon_cutoff': 0, # In units of 1e-4
    'eta_cutoff': 0, # In units of 1e-4
    'tfs': 1,
    'top_a': 0,
    'repetition_penalty': 1.18,
    'repetition_penalty_range': 0,
    'top_k': 40,
    'min_length': 0,
    'no_repeat_ngram_size': 0,
    'num_beams': 1,
    'penalty_alpha': 0,
    'length_penalty': 1,
    'early_stopping': False,
    'mirostat_mode': 0,
    'mirostat_tau': 5,
    'mirostat_eta': 0.1,
    'guidance_scale': 1,
    'negative_prompt': '',

    'seed': -1,
    'add_bos_token': True,
    'truncation_length': 2048,
    'ban_eos_token': False,
    'skip_special_tokens': True,
    'stopping_strings': []
    }

    response = requests.post(URI, json=request)

    if response.status_code == 200:
        result = response.json()['results'][0]['text']
        print(prompt + result)

# prompt = "In order to make homemade bread, follow these steps:\n1)"
prompt = f"""
What do you think of the impact of LLM models in various industries?
"""

run(prompt)

It will return you the prompt Output like this :

I think it's a great way for individuals and organizations
to gain new skills and knowledge in specific areas without
having to go back to school or hire full-time employees.
The flexibility of the LLM models allows learners to choose
from a wide range of topics, making it easy to find something
that aligns with their interests and career goals.

With this setup, you can harness the power of LLM models using a free, open-source solution with both a web interface and API access, all without the need for a GPU.

Using Google Colab for LLM Models with Limited CPU Memory

If your CPU has limited memory, Google Colab can be a lifesaver. By storing all necessary folders within your Google Drive, you can ensure that even when your Colab session ends, you won’t need to repeatedly download models and repositories. Here’s how to set up a seamless workflow:

  1. Organize Your Google Drive:
  • Create a dedicated folder in your Google Drive for this purpose.
  • Inside this folder, keep all your LLM models and the cloned repository from GitHub, such as the web UI, neatly organized.
  • This smart storage strategy will save you time and minimize distractions.

Organised GDrive : How mine Looks like, i have stored all my GGML models in GGML_Model folder :

Google Colab Mounting and Accessing Files from Google Drive, check out the sample Google Colab book in the link below : https://colab.research.google.com/drive/1pX1lgTytuQRQpi_m78q94kOn2_hxqZSR?usp=sharing

2. Leverage Google Colab GPU:

  • Google Colab offers a range of GPU resources that can significantly boost your LLM model performance.

3. Transferring Local Setup to Google Colab:

  • Follow these step-by-step instructions to install and run LLM models within Google Colab, ensuring you have access to these tools anytime, anywhere.

 https://colab.research.google.com/drive/1pX1lgTytuQRQpi_m78q94kOn2_hxqZSR?usp=sharing

With this setup, you’ll have a streamlined workflow for running LLM models, whether on your local CPU or through the powerful resources of Google Colab.

Conclusion

That wraps up this blog on running LLM models without the need for a GPU or incurring API costs. If you encounter any doubts or errors, please don’t hesitate to visit the documentation or reach out for assistance.
In future blogs, I will delve into exploring various LLM models, Langchain Integration and share techniques on optimizing prompt results for better performance.
Stay tuned for more exciting insights into the world of Large Language Models!

 

Leave A Comment