Mining Unstructured Data for AI Gold

AI needs unstructured data, but unstructured data is not easy to structure and leverage in AI tools due to its volume and distribution across data silos. Most enterprises don’t have detailed insights into all their data nor its value to the organization. There are also security and governance challenges pertaining to using corporate data in AI. Prateek Kansal, Head of Engineering, India Operations at Komprise, answers our questions on the challenges and steps ahead for preparing unstructured data for AI initiatives.

“There is a growing need to create streamlined, point-and-click solutions for AI data workflows.”
Prateek Kansal, Head of Engineering, India operations, Komprise

AI is continuing to heat up across many sectors, yet we hear that its implementation at the enterprise level is still very nascent. What’s going on?

This is a classic case of the market being ready before the customer. There are plenty of options for AI infrastructure – from on premises to the cloud – as well as an incredible array of SaaS and AI development tools. But the problem is, organizations need to prepare their data. Unstructured data is massive, constituting as much as 50 to 100 petabytes of data in some sectors like healthcare, life sciences and financial services. IT rarely has full visibility into it nor easy ways to understand it. This is the first problem: it’s big and unwieldy. But beyond that, file and object data – anything not stored in a database – has varying degrees of quality. It needs structure and context so that it can be used. It needs to be scrubbed of any erroneous or misleading information. There’s so much to do! These are problems which cloud expert David Linthicum says have been ignored for too long, leading to a common issue of “messy data.”

Why is unstructured data so much harder to deal with for AI?

Structured data is already in a database with a defined, understood structure. This makes it easier to import the data into an AI or big data analytics tool. Unstructured data is much larger and has no structure. Most organizations have a lot of user and application data and they don’t know how much they have. So first, you need to find a way to create an index or catalog of it so that the data is visible and searchable. And then, you need a tool that can run analytics on all your data across storage. What is valuable and what is obsolete and who is using it? That is the first step to managing unstructured data and making it usable for AI.

Assuming that an organization can get this done, what’s next?

Once you have an index or catalog of all your data, then there are steps to add some structure to it. One such step is segmentation and classification. Firstly, you want to ensure that sensitive data is handled correctly. AI tools like Amazon Macie and Azure AI Search can help by routinely filtering data across your environment for PII and other data types that should be protected from use in external tools such as GenAI. If integrated with an unstructured data management solution, you can create a policy to continually find, tag as “PII” and move sensitive data to a secure location. When it comes to other types of data classification, tagging or metadata enrichment is invaluable for researchers so they can find the data sets they need much faster. You could, again, use AI tools to inspect the contents of files and then filter them based on keywords such as project name, demographic data, file type, geographic region, or any other variable such as diagnosis in healthcare. Now you have a smaller, discrete data set which is discoverable in minutes and can be used and reused for different analytics projects. As well, since most AI solutions have a pay-per-use billing model, this can cut costs. A global index that keeps track of the labels and tags from AI means that users can find discrete data sets without having to run the AI process again.

What are some other considerations for data protection with AI?

There are a lot of security, privacy, accuracy and even ethical risks of using corporate data in AI. You need to be careful when using generative AI because it is getting trained on your data. IT needs automated ways to track what corporate data was fed to which AI process so there is an audit trail. It’s equally important to create policies that are enforceable, such as not sharing sensitive data with GenAI tools. A combination of policies, training and automation is important. Many organizations are figuring it out on the fly, since people are using GenAI on the job regardless.

There is certainly a lot of hard work that goes into managing your data for AI. So what is the exciting part of all this for customers?

There really is so much potential and we don’t have to spend millions of dollars and create an entirely new business unit to get benefit from AI. One example is a chatbot that answers customer questions during an order process and finds relevant information across the enterprise data estate to deliver the most accurate responses quickly. A clinical researcher could use an AI tool to rapidly filter large image data sets for mutations, saving hours or even weeks on the research process. AI does not need to be complex. It does need a strategy with the right people involved to protect corporate data and make sure that there is a way to assess outcomes for accuracy.

How do you see AI evolving from the customer deployment standpoint?

AI technologies are complicated to develop, host and manage. A notable barrier is expertise. Many IT organizations lack specialized skills in coding and AI. But the market is innovating around this issue with no-code and low-code development tools. There is a growing need to create streamlined, point-and-click solutions for AI data workflows. We’ll also see the development of open ecosystems of complementary technologies for IT and business users to select and build projects from the point of gathering and classifying the right datasets through applying security and governance to monitoring the outcomes and finally moving the data sets to an archiving location once the project is done.

How does Komprise Intelligent Data Management fit into all of this?

Komprise is an unstructured data management solution that brings visibility and analytics to an enterprise’s entire data estate so that IT can make the best decisions for where it is stored. We deliver a Global File Index for granular search and tagging of data so that users can find and move the right data to AI, ML or cloud services. A newer technology, Komprise Smart Data Workflows, allows users to create custom workflows for any use case. The point is to automate the process of finding, tagging and moving the exact files you want to a data lake or AI tool so that data scientists can spend the bulk of their time on the analysis rather than data preparation.

About the Author

Prateek Kansal is the Head of Engineering, India Operations, at Komprise. Over his career, Prateek has built and led cross-functional teams delivering scalable products in Bangalore and in the Bay Area. He has hands-on experience in product development, project management, and building and managing teams of various expertise to deliver products in a fast-paced environment. Prateek can be found online at LinkedIn.

Sophos Launches Authorized Training Center in Mumbai in…

Infor Deepens Strategic Collaboration with AWS to Advance…

Simple2Call Launches Fully Compliant AI Contact Center Suite…

Trend Micro Research Reveals AI Adoption in Cybersecurity…

Freshworks Launches Freshservice Journeys to Simplify Employee Transitions…

Mining Unstructured Data for AI Gold

Strategic Alliances: Key to Growth of Trusted Global Data Center Ecosystem

Discussion on Identity and Access Management

Bhanu Panda, Co-Founder and COO, Ghangorcloud speaking to Sanjay Mohapatra of Enterprise...

Sanjeev Singh, CISO, Birlasoft discussing with Sanjay Mohapatra, Editor, Enterprise IT World

Balwant Singh, CISO, DS Group speaking to Sanjay Mohapatra, Group Editor, Enterprise...

Meheriar Patel, Group CIO, Jeena & Company, speaking to Sanjay Mohapatra of...

Related posts