In a previous article I showed how to fine-tune a local llm to categorize questions into a set of known categories. The use case a is chatbot I am working on where the predicted categories are used as metadata for RAG queries. The original article was well received, but multiple readers have provided feedback that it might be worth exploring other, simpler approaches for classification. Based on this feedback I decided to explore whether using logistic regression might be a worthwhile alternative for my chatbot question classifier.

Logistic Regression

Logistic regression is a classification algorithm used to predict the probability that a given input belongs to a particular class or category. This sounds perfect for question categorization since we are trying to map questions to a fixed list of categories (e.g. hvac, electric, cooking, etc.).

The first question that comes to mind is: How can we reliably run a classification prediction on a piece of text like a question?

Luckily, embeddings provide us with a great framework for converting text to a numeric representation that preserves much of the semantic meaning of the original text. The main benefit of this is that we can train the classifier on embedding vectors of known question/category pairs. In the chat application I can then feed questions through the same embedding model and have the model predict a category based on embedding similarity to the training data.

Training

To keep the comparison consistent, I am training the LogisticRegressionClassifier on the same dataset as the finetuned llm. For more details, you are welcome to look at the code on Github. Following training, I execute the original set of integration tests with side-by-side metrics from the original finetuned llm and the new LogisticRegression classifier.

Test Results

As mentioned before, the fine tuned llm provides very good results, but I was happy to see that the simpler logistic regression approach outperforms it. If you look at the side-by-side metrics below, you will see that performance on this dataset increased from 92% to 98% when using logistic regression. It’s also worth noting that prediction performance is also noticeable faster. On top of that, the time to train the LogisticRegressionClassifier is also much shorter.

[ { "scenario": "finetuned-code", "model_kind": "finetuned", "model_name": "our-house-qwen3-0.6b", "label_mode": "code", "total": 131, "correct": 120, "incorrect": 11, "accuracy": 0.916, "average_inference_duration_ms": 74.71, "max_inference_duration_ms": 127.83, "average_logistic_embedding_duration_ms": null, "average_logistic_classifier_predict_duration_ms": null, "average_logistic_classifier_predict_proba_duration_ms": null }, { "scenario": "logistic-regression", "model_kind": "logistic_regression", "model_name": "logistic_regression", "label_mode": "classifier", "total": 131, "correct": 129, "incorrect": 2, "accuracy": 0.9847, "average_inference_duration_ms": 11.91, "max_inference_duration_ms": 116.46, "average_logistic_embedding_duration_ms": 7.97, "average_logistic_classifier_predict_duration_ms": 2.89, "average_logistic_classifier_predict_proba_duration_ms": 0.98 } ]

One difference when using Logistic regression is that the classifier returns a probability distribution of all the possible categories instead of single category. In my current implementation I just pick the prediction with the highest probability, which seems to work well for this project.

For the two incorrect predictions I have provided more details in the table below.

case_id question expected_category predicted_category correct probability_electric probability_fountain probability_hvac probability_pool
95 What electrician company do we call for work at the house? electric hvac FALSE 0.1969 0.0262 0.2266 0.0328
99 Who serviced the pump for the front water feature? fountain pool FALSE 0.0231 0.1534 0.0342 0.3662

The first mis-prediction electric -> hvac shows somewhat similar probabilities (.20 vs. .26), but my simple algorithm will currently favor the one with the highest probability regardless of delta size. The second mis-prediction is off by a bigger margin, but it looks like familiar water based category confusion between fountain and pool. Adding more nuanced training data will like help address these types of issues.

Conclusion

I think this was a fun continuation of the original experiment. The main goal of the initial experiment was to experiment with llms and fine tuning. However, I think this results reinforce the general advice that we should always explore simpler solutions first, before jumping to more complex and resource intensive solutions.