Creating a Chatbot Using a Local LLM

In this blog post I will show how I used a local LLM to create a chatbot that can be used to do a first-person interview with a random person based on their Wikipedia page. As an example, I will show how I used the chatbot to have multi-turn conversations with Luke Skywalker based only on the content from his Wikipedia page. I used qwen 3:8b as the local LLM for this exercise.

Indexing the Data

The key component in my solution is general RAG, so the first step is to pull in the content of the Wikipedia page and store it in an index DB (ChromaDB) for vector-based retrieval (i.e RAG). The best approach for indexing the data will depend on your particular use case, but a general guideline is to index data in chunks that convey a complete thought or idea. This will help with querying since you are much more likely to find results that match well with the semantic meaning of the input query. I tried a few different chunking strategies but landed on indexing Wikipedia pages in units of document sections. In addition to the raw content, I also store a hierarchical section title as metadata that can be used to categorize the data.

I have shared the code for preparing the sections with recursive section titles below:

from typing import Dict, List import wikipediaapi def get_article(page_title: str, lang: str = 'en') -> List[Dict[str, str]]: wiki = wikipediaapi.Wikipedia(language='en', user_agent=some_user_agent) page = wiki.page(page_title) if not page.exists(): raise ValueError(f"Page '{page_title}' not found.") results = [] def recurse_sections(sections, prefix=""): for section in sections: current_title = f"{prefix} -> {section.title}" if prefix else section.title results.append({ "title": current_title, "text": section.text.strip() }) recurse_sections(section.sections, current_title) recurse_sections(page.sections) return results

Storing data as vector embeddings is key since it converts human readable text to a numeric, computer friendly, format that still represents the semantic meaning of the text. By passing the input query through the same embedding model we can search the vectorDB and locate chunk(s) that are semantically relevant to the query. In practice this means that different queries with different wording, but same meaning will likely match the with the same chunks.

Let’s look at the simple question: “What is your father’s name?”. Changing the wording to the less formal version “What is your dad’s name?” will still map the input query to a similar vector in embedding space. As a result, the chatbot will respond with the same answer based on retrieving the same data chunk. See screenshot below:

However, there are exceptions to this that may cause certain low signal queries to yield very poor matches in embedding space. Let’s look at one such sample question in: “What is Millenium Falcon?”. One of the challenges here is that embedding models are optimized for general semantic meaning but may underperform on short questions containing named entities (e.g. people, fictional objects). In these cases, it may yield better results to fall back on a keyword-based search like ngrams after first determining that the embedding space results are poor. I suspect there are multiple ways to implement this, but in this example, I use an LLM to rank the vector search results on a scale from 1-10 to determine relevance.

In my experience working on this chatbot, I achieved the best results from a hybrid query approach. I will show the flow below, but my general experience is that most queries find good matches in embedding space, but a few benefit from the fallback to ngrams (keyword-based matching). Let’s look at my implementation in the sections below:

Vector Search

I experimented with a few different flows for doing vector searches, but landed on the following:

Instead of doing a straight knn search to get the best matches, I try to augment the query with metadata filtering to further narrow down the search results when possible. Basically, I ask the LLM to categorize the question from a predetermined list of categories from all indexed chunks. If a suitable category is found, I retrieve only chunks that also match the metadata before creating the final inference prompt. As I mentioned before, the metadata is created from hierarchical section headers in the underlying Wikipedia document, so the success rate of this approach is highly dependent on the quality of these headers as descriptors of the section text. Unfortunately, with the Luke Skywalker page the results are mixed since the headers are often just names of various Starwars movies with limited context. However, other documents that I tested (e.g. Barrack Obama's Wikipedia page) had much more descriptive headers, which led to a much higher success rate in terms of matching on a single high-quality chunk.

I have included the prompt used for categorizing the input below:

def create_categorize_query_prompt(question, categories): template = f""" You are an assistant capable of categorizing questions by matching a single question to a single category from the following list: Categories: --------------------- {"\n".join(categories)} --------------------- Question: {question} Respond with just the selected category and nothing more. """ return template

I have also included the ChromDB query that includes the metadata filter criteria as well:

def query_vector_db_by_category(query, category): collection = _get_collection() results = collection.query(query_texts=query, where={"source": category}) return results

If the category based query, for some reason doesn’t provide an answer, I move on to asking the LLM to rate the results from the initial knn search on a scale from 1-10.

See the ranking prompt below:

def create_ranked_result_prompt(question, doc): return f"""Score how well the following paragraph answers the question. Question: {question} Paragraph: {doc} Return only a score from 0 to 10. Do not include the reasoning for the score. Only the options listed below are valid answers: 0 1 2 3 4 5 6 7 8 9 10 """ def get_ranked_results(question, query_results): msgs = [] for doc in query_results["documents"][0]: prompt = create_ranked_result_prompt(question, doc) p = PromptTemplate(prompt) ranking = Settings.llm.predict(p) ranking_res = extract_think_response(ranking) ranking_num = int(ranking_res["content"]) print(f"THE RANKING Number is {ranking_num}") print(doc) print() if ranking_num >= 5: msgs.append(doc) return f"\n".join(msgs)

In this implementation I am collecting all chunks with a score of 5 or above. The remaining chunks are discarded. One clear benefit from this is that we don’t end up taking up unnecessary space in the LLM’s context window for irrelevant responses. If none of the chunks are rated 5 or higher, the fallback on ngrams is triggered. I will discuss this approach more in the next section.

Ngram matching

In situations where none of the results from embedding space provide an acceptable answer to the question, the chatbot falls back to ngram matching. In my case I am building an inverted index consisting of all 1-gram, 2-gram and 3-gram entries from the chunks. You can think of the inverted index as a map of the individual grams from all chunks to the id of the corresponding chunks in the vector DB. Before querying, the input query is tokenized into grams that are mapped to entries in the inverted index. Finally, I pick the entries with the highest count of matching ids and use those in the final prompt.

Below is an excerpt from the inverted index where the property names represent grams and the numbers in the arrays are ids of chunks.

"millenium falcon": [ 5 ], "falcon discover": [ 5 ],

The code for looking up matches from the tokenized query can be found below:

def match_query_ngrams(self, query: str): query_tokens = self.vectorizer.build_analyzer()(query) print(f"Query n-grams: {query_tokens}") matched_chunks = defaultdict(int) for ngram in query_tokens: for idx in self.inverted_index.get(ngram, []): print(f"{ngram} matched to section {idx}") matched_chunks[idx] += 1 print(f"matched chunks: {matched_chunks}") return sorted(matched_chunks.items(), key=lambda x: -x[1])

I have also included a screenshot of an example where ngram matching produced the response to the question “What is Millenium Falcon?”.

Question Rewriting

One of the goals of the chatbot was to add support for multi-turn conversations where the user and the chatbot can exchange multiple connected messages or turns. I decided to implement this as a conversation thread in the UI, but this also requires some extra consideration on the backend.

One of the challenges with a multi-turn conversation is that individual messages may not be able to stand on their own in terms of adding enough context for the LLM to carry on the conversation. For usability reasons I wanted to avoid asking the user to repeat the context for every subsequent entry in the thread. As a workaround I decided to go with query rewriting on the server to make every message self-contained, but without manual intervention from the user. In my implementation, I feed all the messages in a particular thread to the LLM and ask it to rewrite the current question to add any necessary context from previous entries in the thread.

Let’s look at an example:

Deep in the conversation you can see the question: “Did your hand heal?”. The main challenge with this question is the lack of clarity and detail. As a result, you may get sub-par results if you submit it as-is to the LLM. There are probably multiple ways to address this, but in my implementation, I ask the LLM to rewrite the query to incorporate more context before sending it to the final prompt.

In this case, the vague question “Did your hand heal?” was rewritten by the LLM to the following: “Did your hand heal after being severed during the duel with Darth Vader on Cloud City, following your rescue by Leia?”

I have included the rewrite logic below:

def create_rewrite_query_prompt(history, original_query, title): template = ( "Previous conversation:\n" "---------------------\n" f"{"\n".join(history)}\n" "---------------------\n" "Given the previous conversation, " "Rewrite the Question to be self-contained by incorporating necessary context from the conversation." "Do not include any assumptions in the response, only a clean, more detailed question." "Ensure that the rewritten question does not ask for more information than the original question. " f"In this context you are {title}, so any reference to 'you' must be rewritten to address {title} instead of 'you'. " "If you know the names of people mentioned in the previous conversation, use their names rather than forms like 'the person' or 'you'. " "Ensure that you only rely on information provided in the previous conversation. " f"Question: {original_query}\n" ) return template def rewrite_query(history: list[str], original_query, title): if len(history) < 2: return original_query template = create_rewrite_query_prompt(history, original_query, title) res = Settings.llm.predict(PromptTemplate(template)) rewritten = extract_think_response(res) print(f"DONE REWRITING {original_query} to {res}") return rewritten["content"]

Query Logic

Finally, I am showing the full logic for determining how to execute the hybrid search described in the sections above.

The code that shows my overall flow is included below:

def predict(ctx: ChatContext): question = rewrite_query(ctx.history, ctx.question, ctx.wikiPageTitle) category = categorize_query(question) query_results = query_vector_db_by_category(question, category) documents = query_results["documents"][0] metadatas = query_results["metadatas"][0] context = None for i, metadata in enumerate(metadatas): if metadata and metadata.get("source") == category: print("Found matching document:") context = documents[i] break response = llm_predict(question, context) if response["content"] == "NOT FOUND": print("Not a good match by category, asking the llm to rank the search documents.") query_results = query_vector_db_by_question(question) context = get_ranked_results(question, query_results) response = llm_predict(question, context) if response["content"] == "NOT FOUND": print("No good search result from vector search. Falling back to ngram search.") context = query_by_ngram(question) response = llm_predict(question, context) if response["content"] == "NOT FOUND": response["content"] = "I am unable to respond to that. Please try to rephrase your question or add more context." return response

The main prompt used for the chat responses can be found below. This is the prompt where the RAG content from the hybrid search is incorporated.

def create_system_prompt(context_str, query_str): template = ( "Context information is below.\n" "---------------------\n" f"{context_str}\n" "---------------------\n" "You are the main character described in the context information. " "You will answer questions without providing more information than strictly necessary. " "Answer the question exactly as asked. Do not provide any information beyond what is asked in the question. " "Rely only on the provided context information. " "Write the answer in first person as the person described in the context.\n" "If you can't find answers to all questions based on the context, respond with the text: 'NOT FOUND'" f"Query: {query_str}\n" ) return template

Conclusion

Overall, I am pretty happy with how the chatbot turned out. Some areas can still be improved, particularly around performance, but some of that is also a limitation of my hardware (only 8 GB of VRAM!). In case you are interested in taking a detailed look at the code, I've put it up on Github.