Tonic.ai, the San Francisco-based company pioneering data synthesis solutions for software and AI developers, announced the launch of the world’s first secure data lakehouse for LLMs, Tonic Textual, to enable AI developers to seamlessly and securely leverage unstructured data for retrieval-augmented generation (RAG) systems and large language model (LLM) fine-tuning. Tonic Textual is an all-in-one data platform designed to eliminate integration and privacy challenges ahead of RAG ingestion or LLM training—two of the biggest bottlenecks hindering enterprise AI adoption. Leveraging its expertise in data management and realistic synthesis, Tonic.ai has developed a solution to tame and protect siloed, messy, and complex unstructured data into AI-ready formats ahead of embedding, fine-tuning, or vector database ingestion.
The Untapped Value of Unstructured Data
Enterprises are rapidly expanding investments in generative AI initiatives across their businesses, motivated by its transformational potential. Optimal deployments of the technology must leverage enterprises’ proprietary data, often stored in messy unstructured formats across various file types and containing sensitive information about customers, employees, and business secrets. The IDC estimates that approximately 90% of data generated by enterprises is unstructured, and, in 2023 alone, organizations were expected to generate upwards of 73,000 exabytes of unstructured data. To use unstructured data for AI initiatives, it must be extracted from siloed locations and standardized, a time-consuming process that monopolizes developer time. According to a 2023 IDC survey, 50% of companies have mostly or completely siloed unstructured data, and 40% of companies are still manually extracting information from the data.
“We’ve heard time and again from our enterprise customers that building scalable, secure unstructured data pipelines is a major blocker to releasing generative AI applications into production,” said Adam Kamor, Co-Founder and Head of Engineering, Tonic.ai. “Textual is specifically architected to meet the complexity, scale, and privacy demands of enterprise unstructured data and allows developers to spend more time on data science and less on data preparation, securely.”
Also Read: 9Spokes Partners with Akoya to Elevate Financial Data Access for Financial Institutions and Fintechs
The Importance of Privacy in AI
Particularly when using third-party model services, data privacy is paramount among enterprise decision makers—the same IDC survey reported that 46% of companies cite data privacy compliance as a top challenge in leveraging proprietary unstructured data in AI systems. Organizations must protect sensitive information in the data from model memorization and accidental exfiltration, or risk costly compliance violations.
“AI data privacy is a challenge the Tonic.ai team is uniquely positioned to solve due to their deep experience building privacy-preserving synthetic data solutions,” said George Mathew, Managing Director at Insight Partners. “As enterprises make inroads implementing AI systems as the backbone of their operations, Tonic.ai has built an innovative product in Textual to supply secured data that protects customer information and enables organizations to leverage AI responsibly.”
Introducing the Secure Data Lakehouse for LLMs
Tonic Textual is a first-of-its-kind data lakehouse for generative AI that can be used to seamlessly extract, govern, enrich, and deploy unstructured data for AI development. With Tonic Textual, you can:
- Build, schedule, and automate unstructured data pipelines that extract and transform data into a standardized format convenient for embedding, ingesting into a vector database, or pre-training and fine-tuning LLMs. Textual supports the leading formats for unstructured free-text data out-of-the-box, including TXT, PDF, CSV, TIFF, JPG, PNG, JSON, DOCX and XLSX.
- Automatically detect, classify, and redact sensitive information in unstructured data, and optionally re-seed redactions with synthetic data to maintain the semantic meaning of your data. Textual leverages proprietary named entity recognition (NER) models trained on a diverse data set spanning domains, formats, and contexts to ensure that sensitive data is identified and protected in any form it may take.
- Enrich your vector database with document metadata and contextual entity tags to improve retrieval speed and context relevance in RAG systems.
Looking ahead, our roadmap includes plans to add capabilities that further simplify building generative AI systems on proprietary data without compromising privacy for utility, including:
- Native SDK integrations with popular embedding models, vector databases, and AI developer platforms to create fully automated, end-to-end data pipelines that fuel AI systems with high-quality, secure data.
- Additional capabilities for data cataloging, data classification, data quality management, data privacy and compliance reporting, and identity and access management to ensure organizations can utilize generative AI responsibly.
- An expanded library of data connectors, including native integrations with cloud data lakes, object stores, cloud storage and file-sharing platforms, and enterprise SaaS applications, enabling AI systems to connect to data across the entire organization.
“Companies have amassed a staggering amount of unstructured data in the cloud over the last two decades; unfortunately, its complexity and the nascency of analytical methods have prevented its use,” said Oren Yunger, Managing Partner at Notable Capital. “Generative AI has finally unlocked the use case for that data, and Tonic.ai has stepped in to solve the complexity problem in a way that reflects its core mission to transform how businesses handle and leverage sensitive data while still enabling developers to do their best work.”
SOURCE: BusinessWire