Large Language Models as Classification Engines: Overkill, or Awesome?

Why it's an odd choice. Why you should try it anyway. And how to go about it.

Jul 11, 2025

A simple cartoon featuring speech bubbles paired up with various labels. — With real-world examples from our customer-service voicebot. Source: Author-provided.

Have you ever wanted to build a trillion-parameter labelling machine? And by that I mean, would you use a Large Language Model to solve a classification problem?

It’s an unlikely fit for an LLM. And yet, there are good reasons to use LLMs for such tasks, and emerging architectures and techniques to do so. Many real-world use cases need a classifier, and many product development teams will soon find themselves wondering, could GPT handle that?

In today’s post, I’ll dive into how we’re tackling this issue while building a next generation conversational assistant at Switzerland’s largest telecommunications provider.

I’ll cover:

Why LLM-classifiers are an odd idea, but you might want to try it anyway
Possible approaches and architectures, using customer intent detection as a real-world use case
Universal challenges, and our lessons learned so far

Sound good? Let’s go!

LLMs as classifiers is a weird idea… 🤔

Let’s start by comparing classification algorithms and LLMs. It turns out they’re almost opposites of each other.

Classification is a classic, (theoretically) simple machine learning (ML) task, where the model takes a structured input in the format it’s used to seeing, and outputs a single label: dog or cat (image classification), positive or negative (sentiment detection), spam or not spam, and so on. LLMs, by contrast, are complex algorithms, designed to solve many different kinds of tasks. They take unstructured inputs and generate potentially long, complex outputs.

Classification is usually just one step in a bigger process, so it must be fast, accurate, and consistent: the same sorts of input should get the same label out at the end. LLMs, however, are slow, random, and prone to confabulation. Moreover, classification models often needs to be interpretable, for ethical and safety reasons. For example, if an ML system classifies a criminal as a high risk of re-offense, we need to know why it made that decision. But sadly, LLMs are not at all interpretable.

A summary of the opposing characteristics of classification and LLMs (described in the text above). — Opposing characteristics of classification vs. using LLMs

… but you might want to do classification with LLMs anyway 😉

So LLMs as classifiers doesn’t necessarily seem like a natural fit. But it can still make sense and deliver certain benefits. For example:

Prototyping: LLMs make it easy and fast to try out ideas. For example, you could role-play how your software should behave with your users, and try out different user journeys. It’s also pretty simple to call a model using it’s API within a simple piece of code, enabling you to explore more complex implementations before you really start building.
LLMs don’t require any training (to get started), thanks to their built-in world knowledge.
Scalability and flexibility when adding new classes: In an ML classification task, adding a new class label requires retraining the model with new labelled samples. But with LLMs and few-shot prompting, you ‘just’* need to add a new class description to your prompt. (* More on the actual challenges of this, later).
If you can turn your problem into a classification task, you can potentially take advantage of existing data sets and benchmarks. Especially in the prototyping stage, this helps you assess feasibility, before you invest significant effort preparing your own labelled test data. You’ll also be able to better judge whether an LLM is really worth the additional challenges they bring.
Common classification metrics like accuracy, precision/recall and F1 score are easy and fast to compute (a big advantage over LLM-judges, which I complained about here).
LLMs can handle diverse, unstructured and even inconsistent data. Contrast that with many ML algorithms, where patching data gaps (using e.g. imputation or a placeholder value) can harm the model’s performance. Plus, multi-modal models even allow new kinds of features from different modalities.
Finally, if you have a multilingual use case, it’s easier (than other approaches!) to build in one language and port to another afterwards.

Worked Example: Intent detection for a conversational AI agent

Let’s look at a real example from my work at a large telco. Currently, when a customer calls our voice bot, the bot’s main task is to route the customer to the correct service team. This requires detecting which product or service the customer is calling about and what they need to get done. That’s a classification task.

Why intent detection is hard… 😟

Our current system uses a combination of ML models and rule based logic to identify the correct routing target. Many factors make this a challenge:

Variability in how customers express themselves can be confusing, even for human experts who know our business inside and out.
Customers can have multiple intents, making it harder to identify the best one to solve (first), and raising questions like whether to output multiple predictions, and how to handle them.
Context may be even more informative than what the customer actually says. What’s the status of the customer and their products? Have they been blocked because of an unpaid bill? Is the internet out in their area? We need to find clever ways to incorporate such information, since customers aren’t great at explaining themselves.
A big challenge is that meaning is built up over multiple turns of dialogue. Imagine the following conversation:

An example conversation: Customer: “Technical issues.” Bot: “Okay, is it about product A, product B or product C?” Customer: “The first one.” — Customer: “Technical issues.” Bot: “Okay, is it about product A, product B or product C?” Customer: “The first one.” Source: Author-provided.

On it’s own, that final utterance means nothing. Clearly we should only attempt to classify the whole conversation. But we can’t predict the future, so when the conversation starts, we don’t know how long it’s going to go, and how (un)helpful the customer’s utterances will be. Thus, the question of when and how to trigger classification is not as trivial as it seems.
Customers might not even know what they want or need. Again, imagine the following conversation:

An example conversation: Customer: “I want to cancel my mobile plan.” Bot: Predicts a ‘cancellation’ intent, and transfers to a cancellation specialist. Source: Author-provided.

Now imagine that the cancellation specialist, after chatting with the customer, realises that they don’t really want to cancel, they just need to downgrade their plan temporarily. So this should have been a ‘sales’ intent, as it’s the sales team who take care of switching plan levels. Such problems can arise if there’s misalignment between how our org chooses to distribute agent skills across teams, and how customers think about and describe their problems. This makes intent detection harder, but it’s a practical reality: not every team member can be trained to resolve every process.
Finally, data is always noisy. Language data often contains spelling mistakes, hashtags, emojis, and — in our case — errors from the speech-to-text transcription service.

… But we do it anyway 😄

Despite its challenges, there are benefits to intent-based logic:

It helps us simplify natural language to make it workable. We’ll never be able to capture the variability of the way customers talk to us, but at least when our teams design customer experiences, we can use intent categories to align on the kinds of things customers call about.
Intent-based logic makes our system more deterministic, as we’re limiting the set of paths it can take. Plus, it’s more interpretable: seeing which intent a model predicted can help us investigate errors down the line. It’s also more testable, because a classification task can have a concrete ground truth label we can compare against.
This approach can also help us gracefully handle LLM hallucinations, because we can verify whether a predicted value is valid or not.
Finally, the intent-based approach can be useful for the broader business. For example, insights on how often people call about different intents can help managers better staff our call-centres to meet demand.

Couldn’t we use LLMs for the lot? 🙋🏾‍♀️

You might be wondering, why not just use an LLM end to end? Instead of trying to detect the customer’s intent and then look up the correct routing, why not write a giant prompt that describes the sorts of tasks each team can do, and let the model figure out where to send each customer? Why not go all-in and build an agent, which is empowered to call tools and APIs to solve the customer’s problem? Well, it might work. But that doesn’t necessarily make it a good idea:

First, these are typically routine use cases, which don’t need the creativity and spontaneity of an LLM.
Letting an agent call various APIs might prove incorrect, unnecessary, and wasteful, leading to additional costs and latency without any gain.
You might not trust an LLM to chat autonomously with your customers. After all, that’s one of the most valuable interactions you’re ever going to have; you don’t want an LLM messing it up.
For many companies, the actual bot functionality is pretty good; it’s just the routing that’s a challenge. LLMs can be a better ‘front door,’ when the old systems are struggling with their classification accuracy.
Finally, converting an open problem to a closed one, and breaking it down into stages, are classic prompt engineering tips anyway. And that’s what we get with LLM-powered intent detection.

Possible Techniques and Architectures 📐

Let’s now explore some possible techniques and architectures. Remember that these will work with any kind of classification problem you might be faced with, not just intent detection.

“The Classic”

The status quo for many companies is a ‘Natural Language Understanding (NLU)’ bot. Here, intent detection is done via machine learning, while hard-coded disambiguation questions and possibly other fallback business logic are used to select the final routing target. Companies wanting to cautiously introduce LLMs and conversational abilities to their stack could consider using an LLM here purely to rephrase the hardcoded system utterances.

Pros

ML models can be simple, fast, accurate, interpretable and trained for your specific task on your own data.
Evaluation (with a ground truth dataset) is concrete and can be automated.

Cons

Adding a new prediction class requires gathering labelled data and retraining the model.
Adding more use cases via business logic is not scalable, due to the complexity of all the different paths a conversation could follow. Thus, conversations remain rigid (e.g. a customer is not able to change their mind and escape the route the system has identified for them).

Real-world example: Many voice- and chatbots in production today.

“The hybrid”

This approach still uses an ML model for the initial classification, but falls back to an LLM when the prediction confidence is low.

Pros:

All the benefits of ML models mentioned above.
You’re accessing the strengths of the LLM ‘for cheap’, i.e. only when you need it.

Cons:

The system becomes more complex to deploy and maintain.
LLMs add latency, opacity, and are less predictable.

Real-world example: Lufthansa Group

“The filter”

Another hybrid approach: In this one, you retrieve the top n most predictions from the ML model and then inject them into a prompt for the LLM to make the final decision.

Pros

Gives you a chance to recover from poor ML model accuracy (unlike the previous approach, where you’ll never know if the model was wrong but confident).

Cons:

Same issues as the hybrid approach (though latency and cost might be slightly lower, as you’re limiting the number of options you add to the prompt).

Real-world example: Voiceflow.

“The fast learner” aka few-shot prompting

In this approach, you simply describe all the intents, possibly with example utterances, and prompt an LLM to classify the customer input accordingly.

Pros:

Works remarkably well, thanks to LLMs’ out of the box understanding.
Adding new intents can be rather easy...

Cons:

* … But adding intents can also prove challenging, as the addition of one new description to the prompt can affect classification accuracy across all classes. So you may need multiple rounds of experimentation to find the best class descriptions.
With all of your conversations now being classified by an LLM, issues like latency and cost become even more problematic.
As you add intents, the prompt explodes, which further increases the latency and costs.
Models can also forget the inner parts of a long prompt, leading to hallucination, confusion and lower accuracy. Or, they simply become less confident in predicting an intent, and trigger unnecessary disambiguation questions instead (which we know harms the customer experience).

Real-world Example: Rasa

“The Embedder”

This approach is a little like Retrieval Augmented Generation, but for generating a prompt. First, you embed your intents and their descriptions. At inference time, you embed the customer’s input and retrieve the n most similar intent descriptions. Then you inject these into the LLM prompt.

Pros:

Results in a smaller prompt, which helps tackle some of the above issues.
Adds a degree of interpretability, as you know which intents were retrieved.
You may be able to kill two birds with one stone: create a small labelled dataset that helps you assess and improve both your retrieval accuracy (e.g. Precision@K) and your classification accuracy.

Cons:

If intent descriptions are too different from the way customers express themselves, you may retrieve the wrong intent options, and the LLM will be doomed to fail.

Real-world example: Rasa.

“The Tuner”

This approach is based on the idea that, since LLMs are built to predict tokens in a sequence, every next token possibility is essentially a classification task. The ‘label set’ is simply the model’s vocabulary, and the model outputs a certain probably for each possible token in that vocabulary. The token that is actually predicted is just the one whose output probability was highest.

So in “the Tuner” approach, you attach a custom a head to an LLM and fine tune it to learn to map that token probability distribution to a much smaller label set.

Pros:

The additional control and accuracy of fine-tuning your own model.
Can add interpretability, since even a simple model, like a Gradient Boosted Tree, can work as a custom head.

Cons:

Adds the effort of fine-tuning, and the complexity of having a more sophisticated model stack.

Example: This Kaggle competition winner.

How can YOU benefit from these techniques? 🫵🏽

I talked about intent detection, but these architectures can be applied to any kind of label classification problem you might be facing. So my recommendations are:

1. Grab some production data and label it: Do it as a team effort; it’ll be quicker than you think, and will help you understand what your customers are asking, and how they’re asking it. It’ll also help ensure your team members are aligned on which kinds of customer inputs should be directed where. And that’s vital, since if you can’t agree on how to label a sample, you’ll never be able to write a good prompt or debug a poorly performing LLM classifier. Finally, a labelled dataset will help you track performance as you build your LLM-powered system, and enable you to sanity check before deploying changes to production.

2. Establish your classification baselines: How big is the largest class in your labelled dataset? That’s the minimum accuracy you should aim for. Better yet, do you have existing ML classifiers you can try to beat?

3. Try out few-shot inference: You can do this with a simple rest API call for each sample, or even try a public LLM UI, like Gemini.google.com. If your system uses ML classifiers already, you can even try out “the Hybrid” approach: test the LLM only on the samples your existing classifiers got wrong, and then add up the accuracy. It’ll be an optimistic score, but can be informative nonetheless (especially if you repeat such a test, in which case the score becomes a trend indicator, and the absolute value becomes less important).

If the results are promising, you can try some of the other architectures I presented today.

Some Final Tips:

Prompting best practices can be model dependent and highly unexpected: What works for one might be terrible for another. So development teams should experiment frequently, and keep each other up to date on what’s working best for their problem and domain.
Include domain experts, like conversation designers or call center agents, when designing prompts and exploring model outputs.
Remember to situate the problem in a business context: Ask what are the implications of different kinds of misclassifications? For example, failing to detect a customer complaint intent could be far worse than missing a potential sales call. So don’t just measure accuracy; look at where the model is going wrong, and think about what that means for the business.
Get clever about how to measure your progress: You’re probably working on a problem that has no baseline or benchmarks, so how will you know if you’re developing in the right direction? I’ve actually written another blog post about that, here:
No baseline? No benchmarks? No biggie! An experimental approach to agile chatbot development
Katherine Munro 👩‍💻
·
August 23, 2024
What happens when you take a working chatbot that's already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.
Read full story

And that’s it! Happy prompting!

Beyond the Buzzwords, with Katherine Munro

No baseline? No benchmarks? No biggie! An experimental approach to agile chatbot development

Discussion about this post