Generative AI Use Case: Using LLMs to Score Customer Conversations
Despite all the talk about AI replacing humans, Skynet blowing up the sun, and deep-fake celebrities parenting our children, it’s difficult to point to a generative AI use case that’s demonstrably more interesting than your average run-of-the-mill chatbot.
But what if instead of replacing customer support teams with chatbots, we could leverage AI to improve the performance of real human CS teams?
That’s exactly the problem that AssuranceIQ attempted to solve with its first foray into enterprise generative AI.
We recently spoke with Killian Farrell, Principal Data Scientist at insurance startup AssuranceIQ, to learn how his team built an LLM-based product to structure unstructured data and score customer conversations for developing sales and customer support teams.
Read on to find out what they did, and what they learned!
Table of Contents
Leveraging all customer conversations as data
For most fledgling AI teams, customer service is an approachable stepping stone into enterprise AI. Chatbots, copilots, and tools built to make the customer experience more seamless, or business teams critical insights, like sentiment or engagement are all relatively well understood focus areas.
And when used as a tool to understand or improve customer behavior, generative AI can actually be quite helpful—if not entirely the ground-breaking innovation that top third-party LLM creators would have you believe.
AssuranceIQ was a platform that leveraged technology to match consumers with insurance plans. And in the confusing, often overwhelming world of private insurance, that meant a lot of customer conversations. Tens of thousands per day in fact.
Like any customer conversation, those transcriptions contained a treasure trove of insights and clues into the customers’ experience and frame of mind during and after a call.
Did they get the information they needed? Was the rep friendly and agreeable? Did they walk away satisfied with the interaction?
“Every moment of every call is an important component of the overall customer experience,” said Killian. “Any moment could impact a customer’s next decision.”
The challenge was simply how to process that many unstructured data points at scale—data points that could serve as a corpus for an LLM to learn from.
Operationalizing customer data as context
When the GenAI hype eventually became unavoidable, Killian and his team didn’t jump right to the played-out chatbot storyline. Instead, he took a step back to understand what unique data they had, and how it might provide differentiated value to the broader Assurance team.
“We knew that, given the right context, an LLM is very powerful at extracting and summarizing information,” said Killian.
The key phrase was: “given the right context.”
That’s exactly what Assurance had: context. And lots of it.
All these customer conversations were data points that could be used to tune an LLM on what good, average, and negative customer service would look like. From there, they could use an LLM to not only score that sentiment, but also leverage those scores for new machine learning and predictive analytics use cases as well.
So, they put GenAI to work.
Using the LLM to create a scoring model
To build the LLM-based product, the Assurance team leveraged their contextual conversation data in their S3 data lake using a combination of proprietary and open source third party models hosted in AWS Bedrock, Azure OpenAI, and more.
Then, they trained their model for a two-fold use case:
- Summarizing conversations
- And turning unstructured data (customer conversations) into structured data (scores)
Using their contextual call transcripts, the team instructed the LLM to take that unstructured data – the conversations saved as plain text files – and turn it into structured data – scores – based on level of customer satisfaction.
They provided the model with relevant (proprietary) context for the scores, and the LLM then gave a score and reasoning for that score. “We get a chain of thought so it’s not just hallucinating,” said Killian. “The model has to give sources for the reasoning.”
Turning these customer conversations into scores for analysis eliminated what would have been hours spent listening to calls and performing NLP work—delivering an output that was both uniquely valuable and uniquely scalable.
Not only can the LLM turn unstructured data into structured data, but it can also give a summary of exactly what happened – and it can do so dynamically, so new context is always added and taken into account.
What’s more, the scoring output of the model was only the beginning. This new dataset opened the door for even more machine learning analysis on newly structured data. The customer conversation scores provided a vast amount of knowledge that previously required immense resources – and existing models could now be layered on top of this score data to determine new relevant insights, like predictive value or churn rates.
The efficiency and depth of these insights would make it easier to control, improve, and train real life customer service teams for the optimal customer experience. No chatbots required.
GenAI is new, but the reliability concerns aren’t
Like any data product, it’s not enough to pump your pipelines full of data and hope for the best. Monitoring the inputs and outputs is essentially to delivering value with a generative AI use case.
In fact, data quality is even more fundamental to the performance of GenAI pipelines.
So, how was Killian and his team monitoring its LLM?
“We can do a lot of structured output monitoring on things like numeric scores,” says Killian. “Did the LLM follow the instructions we gave it? If we expect the LLM to respond with sources and timestamps, is it doing that?”
Monitoring numeric scores meant monitoring output distribution – in other words, looking at the structured output and assessing if the distribution fits where they’d expect or not expect.
The team is also looking for anomalies, like word errors in the source transcription data or an output of a 120% customer satisfaction score, for example. For these types of issues, data observability comes in handy – it can catch those errors and route the team to the root cause in the pipeline so they can triage and resolve it right away.
But even though every generative AI use case is new and exciting, the importance of data quality remains the same. “Our data lake principles didn’t fundamentally change,” said Killian. “[The LLM is] another tool that operates with our data lake – but it still relies on the fundamental data quality principles, the quality of the model, and the quality of the predictions.”
“It’s not different from any other predictive model, we just call the errors ‘hallucinations’ now,” says Killian. “You have to have the right data quality principles in place just like anything else you run through your data lake.”
When it comes to GenAI, data quality is essential for more than just RAG or fine-tuning models. It’s also essential for initiatives just like this: measuring LLM outputs and the outputs of other features that could end up in new, additional predictive models.
“The Venn diagram of where you need to care about data quality has expanded,” said Killian. “It’s even more important in even more places.”
For Killian, the potential for generative AI is massive – when you know how to use it correctly. “It’s a tool in the tool box. It’s not the only tool, but it can help in certain situations.” And no matter what, ensuring a foundation of data quality is tantamount.
To learn more about how data observability and data quality impact the value of your generative AI use case, give us a call.
Our promise: we will show you the product.