The use of Large Language Models (LLMs) to drive generative AI applications such as Google Bard and ChatGPT has been the biggest technology leap of the last year. But the dark shadow of these tools has started to emerge. AI hallucinations and seemingly credible fabrications are common. The reason these models break down comes down to the data they are trained with.
AI models require two major elements: software models that perform complex series of probability and logic calculations and data that is used to train the models. When the software model lacks enough data to make an accurate guess, it interpolates the data it has to produce a result. For enterprises using generative AI in business applications, the likelihood of producing the best possible results increases when the models are trained using your own high-quality data.
The data challenge for AI
Organisations seeking to leverage generative AI must train their models on data that best matches their business needs. For many organisations, this is a significant challenge. Research from across the world, including Australia, finds that about half of all companies can’t identify where they are storing critical data or its structure. And once they know where that data is, they need to address data quality issues.
“Organisations seeking to leverage generative AI must train their models on data that best matches their business needs.”
Vinay Samuel, Founder and CEO, Zetaris
The data that may be most valuable was most likely historically collected for a different purpose. At a structural level, the field names assigned to values may not clearly articulate a piece of data’s purpose. For example, the database underpinning a business system may use a label like “XC3” while business users call the same field ‘Customer Name’. The data may need to be recoded so that the LLM understands the business logic and assigns the data correctly. This requires the recoding of data into business language so it can be analysed to unlock the greatest insights and benefits.
When businesses don’t know where or what their data is, it creates many risks. But it also makes it extremely difficult to feed LLMs with the data they need to maximise the probability of an accurate result from an AI model. To reap the benefits of generative AI, organisations must know the rules that describe their data. It must be recoded, if necessary, into business logic so AI models can be trained to deliver accurate insights to support better business outcomes.
It’s imperative that data is prepared so that it can be ingested into systems in a form that is usable. Data quality is a significant issue. Inconsistent data collection processes and data entry errors need to be identified and corrected. Older data needs to be checked. Some researchers suggest that data can decay at around 2% per month as changes in customer addresses, preferences and system changes result in data becoming incorrect or out of date.
Traditional data management models, like those used in many business intelligence systems, relied on knowing where the data is, copying it to a central location and normalising different types of data so they can be stored in a centralised repository.
For LLMs, that need to be trained with large volumes of data, this approach is impractical. As well as training models with their own data, organisations may rely on external data sources that complement and augment their data.
Solving the challenge
Finding and integrating multiple data sources to feed LLMs so they can deliver the most accurate and useful responses requires a new approach to data exposition and management. Organisations need the best possible answers to their questions as quickly as possible to ensure they make the best possible decisions to stay ahead of their competition and to embrace new opportunities as they arise.
The first step in meeting this challenge is identifying all the potential data sources. This exercise is critical – not just for ensuring the success of your generative AI projects but from a data governance perspective.
With the data locations identified, the next step is to identify what data is needed for your LLM. In the past, this meant extracting and copying the data to a central repository such as a data lake. But a modern data preparation studio can present the data to the LLM without copying it from the source. This is faster, more cost effective and lowers cybersecurity risk by not duplicating the data. It also means LLMs have access to data as it is updated in real time.
This also overcomes the criticism of generative AI tools being incorrect because they are trained with old data. It’s now possible for the models to be fed in real-time with accurate up-to-the-second information. That means the answers the model delivers are superior and able to help the organisation identify emerging customer and market trends, keep ahead of competitors and quickly turbocharge return on investment.
When organisational leaders ask questions, they don’t just want fast responses – they need those responses to be trustworthy. By using your own data wherever it is through a data preparation studio that brings current data to AI models in real time, it’s possible to not only get fast answers but answers that are more likely to be correct and free from AI hallucinations.