Large Language Models as Classification Engines: Overkill, or Awesome?
Why it's an odd choice. Why you should try it anyway. And how to go about it.
Have you ever wanted to build a trillion-parameter labelling machine? And by that I mean, would you use a Large Language Model to solve a classification problem?
Itâs an unlikely fit for an LLM. And yet, there are good reasons to use LLMs for such tasks, and emerging architectures and techniques to do so. Many real-world use cases need a classifier, and many product development teams will soon find themselves wondering, could GPT handle that?
In todayâs post, Iâll dive into how weâre tackling this issue while building a next generation conversational assistant at Switzerlandâs largest telecommunications provider.
Iâll cover:
Why LLM-classifiers are an odd idea, but you might want to try it anyway
Possible approaches and architectures, using customer intent detection as a real-world use case
Universal challenges, and our lessons learned so far
Sound good? Letâs go!
LLMs as classifiers is a weird idea⊠đ€
Letâs start by comparing classification algorithms and LLMs. It turns out theyâre almost opposites of each other.
Classification is a classic, (theoretically) simple machine learning (ML) task, where the model takes a structured input in the format itâs used to seeing, and outputs a single label: dog or cat (image classification), positive or negative (sentiment detection), spam or not spam, and so on. LLMs, by contrast, are complex algorithms, designed to solve many different kinds of tasks. They take unstructured inputs and generate potentially long, complex outputs.
Classification is usually just one step in a bigger process, so it must be fast, accurate, and consistent: the same sorts of input should get the same label out at the end. LLMs, however, are slow, random, and prone to confabulation. Moreover, classification models often needs to be interpretable, for ethical and safety reasons. For example, if an ML system classifies a criminal as a high risk of re-offense, we need to know why it made that decision. But sadly, LLMs are not at all interpretable.
⊠but you might want to do classification with LLMs anyway đ
So LLMs as classifiers doesnât necessarily seem like a natural fit. But it can still make sense and deliver certain benefits. For example:
Prototyping: LLMs make it easy and fast to try out ideas. For example, you could role-play how your software should behave with your users, and try out different user journeys. Itâs also pretty simple to call a model using itâs API within a simple piece of code, enabling you to explore more complex implementations before you really start building.
LLMs donât require any training (to get started), thanks to their built-in world knowledge.
Scalability and flexibility when adding new classes: In an ML classification task, adding a new class label requires retraining the model with new labelled samples. But with LLMs and few-shot prompting, you âjustâ* need to add a new class description to your prompt. (* More on the actual challenges of this, later).
If you can turn your problem into a classification task, you can potentially take advantage of existing data sets and benchmarks. Especially in the prototyping stage, this helps you assess feasibility, before you invest significant effort preparing your own labelled test data. Youâll also be able to better judge whether an LLM is really worth the additional challenges they bring.
Common classification metrics like accuracy, precision/recall and F1 score are easy and fast to compute (a big advantage over LLM-judges, which I complained about here).
LLMs can handle diverse, unstructured and even inconsistent data. Contrast that with many ML algorithms, where patching data gaps (using e.g. imputation or a placeholder value) can harm the modelâs performance. Plus, multi-modal models even allow new kinds of features from different modalities.
Finally, if you have a multilingual use case, itâs easier (than other approaches!) to build in one language and port to another afterwards.
Worked Example: Intent detection for a conversational AI agent
Letâs look at a real example from my work at a large telco. Currently, when a customer calls our voice bot, the botâs main task is to route the customer to the correct service team. This requires detecting which product or service the customer is calling about and what they need to get done. Thatâs a classification task.
Why intent detection is hard⊠đ
Our current system uses a combination of ML models and rule based logic to identify the correct routing target. Many factors make this a challenge:
Variability in how customers express themselves can be confusing, even for human experts who know our business inside and out.
Customers can have multiple intents, making it harder to identify the best one to solve (first), and raising questions like whether to output multiple predictions, and how to handle them.
Context may be even more informative than what the customer actually says. Whatâs the status of the customer and their products? Have they been blocked because of an unpaid bill? Is the internet out in their area? We need to find clever ways to incorporate such information, since customers arenât great at explaining themselves.
A big challenge is that meaning is built up over multiple turns of dialogue. Imagine the following conversation:

On itâs own, that final utterance means nothing. Clearly we should only attempt to classify the whole conversation. But we canât predict the future, so when the conversation starts, we donât know how long itâs going to go, and how (un)helpful the customerâs utterances will be. Thus, the question of when and how to trigger classification is not as trivial as it seems.
Customers might not even know what they want or need. Again, imagine the following conversation:

Now imagine that the cancellation specialist, after chatting with the customer, realises that they donât really want to cancel, they just need to downgrade their plan temporarily. So this should have been a âsalesâ intent, as itâs the sales team who take care of switching plan levels. Such problems can arise if thereâs misalignment between how our org chooses to distribute agent skills across teams, and how customers think about and describe their problems. This makes intent detection harder, but itâs a practical reality: not every team member can be trained to resolve every process.
Finally, data is always noisy. Language data often contains spelling mistakes, hashtags, emojis, andâââin our caseâââerrors from the speech-to-text transcription service.
⊠But we do it anyway đ
Despite its challenges, there are benefits to intent-based logic:
It helps us simplify natural language to make it workable. Weâll never be able to capture the variability of the way customers talk to us, but at least when our teams design customer experiences, we can use intent categories to align on the kinds of things customers call about.
Intent-based logic makes our system more deterministic, as weâre limiting the set of paths it can take. Plus, itâs more interpretable: seeing which intent a model predicted can help us investigate errors down the line. Itâs also more testable, because a classification task can have a concrete ground truth label we can compare against.
This approach can also help us gracefully handle LLM hallucinations, because we can verify whether a predicted value is valid or not.
Finally, the intent-based approach can be useful for the broader business. For example, insights on how often people call about different intents can help managers better staff our call-centres to meet demand.
Couldnât we use LLMs for the lot? đđŸââïž
You might be wondering, why not just use an LLM end to end? Instead of trying to detect the customerâs intent and then look up the correct routing, why not write a giant prompt that describes the sorts of tasks each team can do, and let the model figure out where to send each customer? Why not go all-in and build an agent, which is empowered to call tools and APIs to solve the customerâs problem? Well, it might work. But that doesnât necessarily make it a good idea:
First, these are typically routine use cases, which donât need the creativity and spontaneity of an LLM.
Letting an agent call various APIs might prove incorrect, unnecessary, and wasteful, leading to additional costs and latency without any gain.
You might not trust an LLM to chat autonomously with your customers. After all, thatâs one of the most valuable interactions youâre ever going to have; you donât want an LLM messing it up.
For many companies, the actual bot functionality is pretty good; itâs just the routing thatâs a challenge. LLMs can be a better âfront door,â when the old systems are struggling with their classification accuracy.
Finally, converting an open problem to a closed one, and breaking it down into stages, are classic prompt engineering tips anyway. And thatâs what we get with LLM-powered intent detection.
Possible Techniques and Architectures đ
Letâs now explore some possible techniques and architectures. Remember that these will work with any kind of classification problem you might be faced with, not just intent detection.
âThe Classicâ
The status quo for many companies is a âNatural Language Understanding (NLU)â bot. Here, intent detection is done via machine learning, while hard-coded disambiguation questions and possibly other fallback business logic are used to select the final routing target. Companies wanting to cautiously introduce LLMs and conversational abilities to their stack could consider using an LLM here purely to rephrase the hardcoded system utterances.
Pros
ML models can be simple, fast, accurate, interpretable and trained for your specific task on your own data.
Evaluation (with a ground truth dataset) is concrete and can be automated.
Cons
Adding a new prediction class requires gathering labelled data and retraining the model.
Adding more use cases via business logic is not scalable, due to the complexity of all the different paths a conversation could follow. Thus, conversations remain rigid (e.g. a customer is not able to change their mind and escape the route the system has identified for them).
Real-world example: Many voice- and chatbots in production today.
âThe hybridâ
This approach still uses an ML model for the initial classification, but falls back to an LLM when the prediction confidence is low.
Pros:
All the benefits of ML models mentioned above.
Youâre accessing the strengths of the LLM âfor cheapâ, i.e. only when you need it.
Cons:
The system becomes more complex to deploy and maintain.
LLMs add latency, opacity, and are less predictable.
Real-world example: Lufthansa Group
âThe filterâ
Another hybrid approach: In this one, you retrieve the top n most predictions from the ML model and then inject them into a prompt for the LLM to make the final decision.
Pros
Gives you a chance to recover from poor ML model accuracy (unlike the previous approach, where youâll never know if the model was wrong but confident).
Cons:
Same issues as the hybrid approach (though latency and cost might be slightly lower, as youâre limiting the number of options you add to the prompt).
Real-world example: Voiceflow.
âThe fast learnerâ aka few-shot prompting
In this approach, you simply describe all the intents, possibly with example utterances, and prompt an LLM to classify the customer input accordingly.
Pros:
Works remarkably well, thanks to LLMsâ out of the box understanding.
Adding new intents can be rather easy...
Cons:
* ⊠But adding intents can also prove challenging, as the addition of one new description to the prompt can affect classification accuracy across all classes. So you may need multiple rounds of experimentation to find the best class descriptions.
With all of your conversations now being classified by an LLM, issues like latency and cost become even more problematic.
As you add intents, the prompt explodes, which further increases the latency and costs.
Models can also forget the inner parts of a long prompt, leading to hallucination, confusion and lower accuracy. Or, they simply become less confident in predicting an intent, and trigger unnecessary disambiguation questions instead (which we know harms the customer experience).
Real-world Example: Rasa
âThe Embedderâ
This approach is a little like Retrieval Augmented Generation, but for generating a prompt. First, you embed your intents and their descriptions. At inference time, you embed the customerâs input and retrieve the n most similar intent descriptions. Then you inject these into the LLM prompt.
Pros:
Results in a smaller prompt, which helps tackle some of the above issues.
Adds a degree of interpretability, as you know which intents were retrieved.
You may be able to kill two birds with one stone: create a small labelled dataset that helps you assess and improve both your retrieval accuracy (e.g. Precision@K) and your classification accuracy.
Cons:
If intent descriptions are too different from the way customers express themselves, you may retrieve the wrong intent options, and the LLM will be doomed to fail.
Real-world example: Rasa.
âThe Tunerâ
This approach is based on the idea that, since LLMs are built to predict tokens in a sequence, every next token possibility is essentially a classification task. The âlabel setâ is simply the modelâs vocabulary, and the model outputs a certain probably for each possible token in that vocabulary. The token that is actually predicted is just the one whose output probability was highest.
So in âthe Tunerâ approach, you attach a custom a head to an LLM and fine tune it to learn to map that token probability distribution to a much smaller label set.
Pros:
The additional control and accuracy of fine-tuning your own model.
Can add interpretability, since even a simple model, like a Gradient Boosted Tree, can work as a custom head.
Cons:
Adds the effort of fine-tuning, and the complexity of having a more sophisticated model stack.
Example: This Kaggle competition winner.
How can YOU benefit from these techniques? đ«”đœ
I talked about intent detection, but these architectures can be applied to any kind of label classification problem you might be facing. So my recommendations are:
1. Grab some production data and label it: Do it as a team effort; itâll be quicker than you think, and will help you understand what your customers are asking, and how theyâre asking it. Itâll also help ensure your team members are aligned on which kinds of customer inputs should be directed where. And thatâs vital, since if you canât agree on how to label a sample, youâll never be able to write a good prompt or debug a poorly performing LLM classifier. Finally, a labelled dataset will help you track performance as you build your LLM-powered system, and enable you to sanity check before deploying changes to production.
2. Establish your classification baselines: How big is the largest class in your labelled dataset? Thatâs the minimum accuracy you should aim for. Better yet, do you have existing ML classifiers you can try to beat?
3. Try out few-shot inference: You can do this with a simple rest API call for each sample, or even try a public LLM UI, like Gemini.google.com. If your system uses ML classifiers already, you can even try out âthe Hybridâ approach: test the LLM only on the samples your existing classifiers got wrong, and then add up the accuracy. Itâll be an optimistic score, but can be informative nonetheless (especially if you repeat such a test, in which case the score becomes a trend indicator, and the absolute value becomes less important).
If the results are promising, you can try some of the other architectures I presented today.
Some Final Tips:
Prompting best practices can be model dependent and highly unexpected: What works for one might be terrible for another. So development teams should experiment frequently, and keep each other up to date on whatâs working best for their problem and domain.
Include domain experts, like conversation designers or call center agents, when designing prompts and exploring model outputs.
Remember to situate the problem in a business context: Ask what are the implications of different kinds of misclassifications? For example, failing to detect a customer complaint intent could be far worse than missing a potential sales call. So donât just measure accuracy; look at where the model is going wrong, and think about what that means for the business.
Get clever about how to measure your progress: Youâre probably working on a problem that has no baseline or benchmarks, so how will you know if youâre developing in the right direction? Iâve actually written another blog post about that, here:
No baseline? No benchmarks? No biggie! An experimental approach to agile chatbot development
·What happens when you take a working chatbot that's already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.
And thatâs it! Happy prompting!