Teach Me Cool Stuff

Fine Tuning a Local LLM to Categorize Questions

Published: 16 Jun, 2026

As a fun personal project, I have been working on a chatbot for answering general questions about my household on anything from maintenance questions to doctor’s appointments.

The general idea is that the chatbot will get its household knowledge through RAG from querying a vector database, but for better results I have made the vector searches metadata aware.

Basically, I am running questions through a pre-processing step to categorize questions into known metadata categories (e.g. pool, car, hvac, cooking). The main goal of this is to narrow down the search space for vector ranking to only indexed entries that match the category of the question. As an example, the question “When did we replace our pool pump?” will be mapped to a category called “pool” before querying the Index database.

The hypothesis I want to test in this experiment is whether a very small local LLM can be fine-tuned to perform reliable question categorization when trained on a dataset of household-related questions

LLMs

In this project I am using two different local llms – Qwen 3:4B and Qwen 3:0.6B. The 4B parameter version is used for general question answering, while the super tiny 0.6B version is used to categorize questions. The whole premise of this experiment is to see if a tiny llm with only 600M parameters can be finetuned into a reliable classifier of household questions.

Finetuning

For finetuning I am using a popular open-source framework called Unsloth, which seems well suited for tuning local models like Qwen and Llama.

For training purposes my initial dataset consists of about ~850 data entries where I do a 70/15/15 percentage-based split into training data, eval data and test data respectively. Training data and eval data are used during training, while the test dataset is withheld and used to run a test post training. See section below for sample data:

[ { "question": "Who cleans our gutters at the house?", "category": "gutters" }, { "question": "Who serviced the hot water heater for the home?", "category": "water heater" }, { "question": "Who fixed the sprinkler system in the yard?", "category": "irrigation" }, { "question": "Which store do we usually buy pinnekjott from?", "category": "cooking" }, { "question": "What dimensions are the air filters for the home AC?", "category": "hvac" }, { "question": "What year did we replace the downstairs AC unit?", "category": "hvac" } ]

The basic idea is to train the llm on a sufficient set of household questions to teach it to become a reliable question classifier.

Baseline

Before doing any finetuning, it’s important to establish a baseline to measure against. In this experiment the baseline is to try to use the original Qwen 0.6B model “as is” through prompting alone. A sample prompt used for the baseline can be found below:

Classify the homeowner question into exactly one category from the list below. Return only the category name from the list. Never return a code, a number, a synonym, an explanation, or any other text. The answer must be exactly one category name from the list. Choose the best category based on the meaning of the question. Valid categories: - appliances - brick work - car - cooking - doorbell - electric - fence - fountain - garden lights - gutters - hvac - irrigation - mosquito - painting - pool - tree service - water heater - window service Question: Who installed the tankless hot water setup for the house? Category:

Accuracy of Baseline model:

As one of my offline eval methods I have created a battery of ~130 integration tests to test the model with scenarios from a second dataset. For the baseline model, the results are poor. Out of 131 tests the model only categorized 13 questions correctly (~10% correct responses). See summary below:

{ "scenario": "baseline-category", "model_kind": "baseline", "model_name": "qwen3:0.6b", "label_mode": "category", "total": 131, "correct": 13, "incorrect": 118, "accuracy": 0.0992 }

When digging into the actual failures a few common patterns emerge:

The model is mostly overusing broad labels like electric/appliances and missing most of the other categories (e.g pool, cooking, hvac).
The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories

I have provided an excerpt from the test report below:

[{ "case_id": 1, "question": "When was the lower air conditioning system swapped out?", "expected_category": "hvac", "scenario": "baseline-category", "model_kind": "baseline", "model_name": "qwen3:0.6b", "label_mode": "category", "predicted_category": "electric", "correct": false }, { "case_id": 64, "question": "Which painter worked on Joe's room?", "expected_category": "painting", "scenario": "baseline-category", "model_kind": "baseline", "model_name": "qwen3:0.6b", "label_mode": "category", "predicted_category": null, "predicted_code": null, "correct": false, "status_code": 422, "error": "Ollama returned an unknown category name 'apartments' from response 'apartments'" } ]

Finetuning – 1st attempt

The results from the baseline made it clear that a tiny model like Qwen 3 0.6B cannot provide reliable performance through just prompting alone.

As for the next experiment, I am using the same prompt as before, but I am doing model finetuning to teach the model how to categorize with greater accuracy.

I have included the finetuning script here in case you are interested in checking it out. At a high level I am leveraging Unsloth with QLora as the finetuning strategy. One note: The default fine tuning parameters provided by Unsloth provide a very good starting point. It’s been my experience that it’s more important to come up with a good dataset than worrying about tweaking the Unsloth values too much, at least to start.

One common pitfall to avoid though is overfitting on the training data, which is why it’s important to test the model on data not found in the training data. In addition to the static training/test data I have also incorporated a way to provide user feedback to amend the training data as a second channel during future retraining.

Result:

After running the battery of integration tests, I observed a clear improvement in prediction accuracy as seen in the report below:

{ "scenario": "finetuned-category", "model_kind": "finetuned", "model_name": "our-house-qwen3-0.6b-category-names", "label_mode": "category", "total": 131, "correct": 104, "incorrect": 27, "accuracy": 0.7939 }

The prediction accuracy is up from 10% to 79%, but I still see some clear patterns of incorrect results:

The model now shows clear signs of heading in the right direction, but I see a pattern of only emitting fragments of the correct categories from the allowed list. Some examples are ac/air instead of hvac
The model gets confused by semantically overlapping categories like water-based confusion from fountain, water heater and pool.

Finetuning – 2nd attempt

An easy improvement on the first fine tuning experiment would be to add a post processing step. This would allow me to normalize results where the prediction is semantically correct, but syntactically incorrect (e.g. ac, air). Another tweak would be to build more reinforcement into the prompt itself by providing more examples, telling the model what to do and not to do. I would say both ideas a reasonable, but it leads to more maintenance as more categories are added.

Instead, I wanted to see if I could tweak the finetuning approach slightly by making some changes to how I teach the model to map categories.

It turns out we can make a minor change to the prompt to improve accuracy even more compared to the 1st experiment. The tweak is actually just a simple change to the prompt where I map the categories to a two-character opaque IDs with no semantic overlap as seen in the sample below:

Classify the homeowner question into exactly one label from the list below. Return only the short label code from the list. Never return the category name, a number, a synonym, an explanation, or any other text. The answer must be exactly one uppercase two-letter code. Choose the best label based on the meaning of the question. Valid labels: AA = appliances BB = brick work CC = car DD = cooking EE = doorbell FF = electric GG = fence HH = fountain II = garden lights JJ = gutters KK = hvac LL = irrigation MM = mosquito NN = painting OO = pool PP = tree service QQ = water heater RR = window service Question: Who installed the tankless hot water setup for the house? Code:

Now, I ask the model to output a fixed format code instead of a variable category string with potentially overlapping meaning (e.g. water-based categories).

The interesting part is that I see a very nice boost in performance from this simple change as seen in the summary below:

{ "scenario": "finetuned-code", "model_kind": "finetuned", "model_name": "our-house-qwen3-0.6b", "label_mode": "code", "total": 131, "correct": 120, "incorrect": 11, "accuracy": 0.916 }

As you can see, prediction accuracy is now at ~92%, which is pretty accurate. It appears that asking for fixed, non-overlapping output helps the tiny qwen model when generating responses.

There are still a few misses though. I have included the specific failures below:

- Case 15: water heater -> pool | When was the home's tankless hot water system last checked? - Case 53: gutters -> mosquito | What did CompanyA bill us for the gutter cleaning visit? - Case 62: mosquito -> garden lights | Which section of the mosquito misting line needed repair? - Case 73: water heater -> pool | Who put in the tankless hot water system? - Case 74: water heater -> pool | What manufacturer made the home's tankless water heater? - Case 99: fountain -> pool | Who serviced the pump for the front water feature? - Case 106: gutters -> mosquito | Who do we use for gutter cleaning service? - Case 114: mosquito -> garden lights | What fluid do we pour into the mosquito misting system? - Case 126: water heater -> pool | Who installed the tankless hot water setup for the house? - Case 127: water heater -> pool | When was the tankless heater maintenance done last? - Case 128: water heater -> pool | What brand is the tankless water unit we use at home?

At this point the predictions are generally reliable, and the finetuned llm serves as a usable predictor in my chatbot, but there are still issues to work on. One issue that stands out is water heater -> pool, which is still likely due to the overlapping “watery” meaning between those two categories. To address this, I will likely have to revisit the training data and make it even more nuanced.

A sample chat interaction can be seen in the screenshot below. Pay special attention to the little category tag in the blue question bubbles (e.g. “pool”) since that is the part that is automatically classified by the tiny qwen 3:0.6B llm.

I have included the Github repo here in case you are interested in checking it out.

Update (6/22/2026)

While the main focus of this article was learning about fine tuning of tiny models, I did receive feedback from multiple readers that I should consider a simpler classification approach than llm fine tuning. This inspired me to run a new experiment using Logistic Regression as the classifier. If interested, you can find the new article here.