Step-by-Step Guide to Building a Chatbot Locally: Langchain, Ollama, and Llama3

Introduction

Imagine having a personal AI assistant that lives on your computer, ready to chat whenever you are. Forget the cloud and privacy concerns – this is local AI, powered by the muscle of Llama3, a cutting-edge language model, and the easy-to-use Langchain framework. With these tools, building your own chatbot is no longer science fiction – it's a weekend project! Get ready to chat with the future, all on your own terms.

Let's start by setting up llama3 locally

  • Download the Ollama package here. and install it.

  • After installation it will start a server locally on your machine.

  • We need to download the model now. Open your terminal and type 'ollama pull llama3'

This will download the model and make it ready to use locally.

Time to setup Langchain

We need to set up our project with Python and Langchain. I will use Pipenv for this setup.

pipenv install langchain langchain-core langchain-community

Chatbots with Langchain are quite easy to create. We need a few components for this:

  1. We require a Chat-based LLM (Large Language Model). Here, I will utilize the Llama3 model.

  2. We need the Prompt to set up the environment and provide context for the chat model.

  3. Models are stateless, meaning they lack knowledge of previous conversations. Hence, we need a custom memory to store and pass previous conversations as context.

  4. Finally, I will incorporate a Chain. Chains are a fundamental concept in Langchain, enabling you to move beyond a single API call to a language model and instead link together multiple calls in a logical sequence.

Step 1

""" make and return the llm instace """

from langchain_community.chat_models.ollama import ChatOllama


def get_llm():
    """returns llama3 model instance"""
    return ChatOllama(model="llama3", temperature=0)

Here, I import the ChatOllama class. In the get_llm function, I return an instance of the class. Note that I have specified the model as llama3 with a temperature of 0. This means we will use the llama3 model, and the model will not respond with highly creative answers; it will always strive to provide accurate responses.

Step 2

""" create and return propmt template """

from langchain.prompts import (
    HumanMessagePromptTemplate,
    ChatPromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate,
)


def get_chat_prompt_template():
    """generate and return the prompt
    template that will answer the users query
    """
    return ChatPromptTemplate(
        input_variables=["content", "messages"],
        messages=[
            SystemMessagePromptTemplate.from_template(
                """
                    You are a helpful ai assistent.
                    try to answer the questions to the best of your knowledge
                """
            ),
            MessagesPlaceholder(variable_name="messages"),
            HumanMessagePromptTemplate.from_template("{content}"),
        ],
    )

ChatPromptTemplate is the class responsible for creating prompt templates. These templates contain instructions for the language model. Whenever we query the model, this template is provided along with the question. It's important to note that models are stateless and do not retain instructions, so we need to pass the template each time.

Here, I have written a simple system message instructing the model to answer questions to the best of its knowledge. You'll also notice a MessagesPlaceholder, which stores the previous messages we've asked the model. Lastly, the HumanMessagePromptTemplate section contains the questions we ask. input_variables are passed to the prompt during execution, holding the variables messages and content. Content is supplied by users, while messages come from memory.

Step 3

""" create the conversation buffer memory
that will store the current conversation in memory
"""

from langchain.memory.buffer import ConversationBufferMemory
from langchain.memory.chat_message_histories.file import FileChatMessageHistory


def get_memory():
    """create and return buffer memory to retain the conversation info"""
    return ConversationBufferMemory(
        memory_key="messages",
        chat_memory=FileChatMessageHistory(file_path="memory.json"),
        return_messages=True,
        input_key="content",
    )

To store the chat history, I will utilize the ConversationBufferMemory with the FileChatMessageHistory. ConversationBufferMemory requires the memory_key, which here is set as "messages" to match the input_variables used in ChatPromptTemplate. We also need a storage location to retain data even after restarting the chatbot. Since this will operate locally, we can employ the FileChatMessageHistory with a file_path to store information in a json file for future reference. Alternatively, you can utilize different databases. Langchain supports various databases, including a vector store. Additionally, we set return_messages=True because we are using a chat-based model, not a completion model. This ensures the previous messages are returned and passed to the MessagesPlaceholder.

Step 4

""" create new chain """

from langchain.chains.llm import LLMChain
from memory import get_memory

def create_chain(llm, prompt):
    """returns a chain instance"""
    return LLMChain(llm=llm, prompt=prompt, memory=get_memory())

Here, I am creating an LLMChain instance with the llm, prompt, and memory we made in the previous steps. Now, we are prepared to execute the chat model.

Final Step

""" chat app with ollama llama3 model """

from llm import get_llm
from prompt import get_chat_prompt_template
from chain import create_chain

""" LARGE LANGUAGE MODEL """
llm = get_llm()

""" PROMPT """
prompt = get_chat_prompt_template()

""" CHAIN """
chain = create_chain(llm, prompt)

""" RUN THE APP """
if __name__ == "__main__":
    while True:
        question = input(">>> ")
        """ user types q break out of the loop """
        if question == "q":
            break
        """ print out the response from the model """
        print(chain.invoke({"content": question }))

So, we combine all the components and create a chain instance. We will then use the invoke function of the chain to begin chatting with the model.

How to get Streaming Responses

If you have followed the steps, you may have observed that the model's response is quite slow. It takes time for the model to create the response, and if the response is lengthy, the waiting time is also prolonged, which isn't ideal for user experience. To address this problem, we require the model to produce and send the response in smaller parts.

As we are utilizing the LLMChain, there isn't a straightforward method to achieve this. We must extend the LLMChain, create our own class, and override the stream method..

""" create new chain """

from langchain.chains.llm import LLMChain
from threading import Thread
from queue import Queue
from stream import get_streaming_handler
from memory import get_memory


class StreamableChainMixin:
    """modify the streaming of the chain"""

    def stream(self, input):
        queue = Queue()
        streamin_handler = get_streaming_handler(queue)

        def task():
            self.invoke(input, config={"callbacks": [streamin_handler]})

        Thread(target=task).start()

        while True:
            tok = queue.get()
            if not tok:
                yield "\n"
                break
            yield tok


class StreamingChain(StreamableChainMixin, LLMChain):
    pass


def create_chain(llm, prompt):
    """returns a chain instance"""
    return StreamingChain(llm=llm, prompt=prompt, memory=get_memory())

We will create a StreamableChainMixin class that will contain the modified stream method. Using the mixin approach, we can also modify other chains. The stream method will include a stream_handler that we need to pass to the model as callbacks. This will be triggered when a response is received from the model. You may have noticed the yield at the end of the function; this is because the stream method is a generator. Each time a response comes to the stream_handler, it will be passed to the queue, and we will retrieve and yield the value from the queue.

Take note of the task function inside the stream method that runs in a separate thread. This is done to prevent slowing down the streaming response. Otherwise, the method would wait until self.invoke is finished before yielding the response. The self.invoke is necessary to execute the model; if we skip this step, the model will not receive our questions.

""" handle streaming """

from langchain.callbacks.base import BaseCallbackHandler


class StreamingHandler(BaseCallbackHandler):
    """handle streaming for chat models"""

    def __init__(self, queue):
        self.queue = queue

    def on_llm_new_token(self, token, **kwargs):
        self.queue.put(token)

    def on_llm_end(self, response, **kwargs):
        self.queue.put(None)

    def on_llm_error(self, error, **kwargs):
        self.queue.put(None)


def get_streaming_handler(queue):
    """returns a strwmaing handler"""
    return StreamingHandler(queue)

The instance of StreamingHandler is being passed to the model in the StreamableChainMixin class. All the methods inside this class are callbacks triggered when the model has a response. When the model responds to a question, it will trigger on_llm_new_token. Once the response ends, it will trigger on_llm_end, and in case of an error, it will trigger on_llm_error. In both error and end responses, we are putting None in the queue so that in the stream method, we can stop yielding values. A queue is needed to yield the values one by one..

Now we are ready to start receiving a streaming response from the model. We need to make a slight modification to the way we call the model to accomplish this.

""" chat app with ollama llama3 model """

from llm import get_llm
from prompt import get_chat_prompt_template
from chain import create_chain
from dotenv import load_dotenv

""" load the env variables """
load_dotenv()


""" LARGE LANGUAGE MODEL """
llm = get_llm()

""" PROMPT """
prompt = get_chat_prompt_template()

""" CHAIN """
chain = create_chain(llm, prompt)


""" RUN THE APP """
if __name__ == "__main__":
    while True:
        question = input(">>> ")
        """ user types q break out of the loop """
        if question == "q":
            break
        """ print out the response from the model """
        for res in chain.stream({"content": question}):
            print(res, end="", flush=True)

Instead of using the invoke method, we will use the stream method of the chain. Unlike invoke, it returns a list of responses that we can print.

And that's it we now have a local chat bot with streaming response.

Conclusion

In summary, with the help of Llama3 and Langchain, it's now possible to create a personal AI assistant locally. This blog has explained the process of setting up the environment, putting together the components, and enabling streaming responses for a smooth chat experience. Addressing privacy issues and giving you control, the idea of building and chatting with your own AI assistant is no longer just a distant dream but a practical reality. So, jump in and discover the potentials of local AI today.

All the codes used can be on GitHub Link

Did you find this article valuable?

Support Arnab Gupta by becoming a sponsor. Any amount is appreciated!