In the world of modern tech, data has become the gold of AI applications. Thinking through a successful AI application requires more than selecting the tech stack — it demands a well-crafted approach to data. While full-stack applications rely on traditional data access and storage, AI applications require more consideration due to being probabilistic (covering wide spaces of data). In this article you’ll learn about key aspects of understanding AI data strategy — addressing everything from the differences between AI and full-stack applications to privacy, unit economics, and the nuanced decision-making involved in developing your AI solution.
At the core of any AI application lies the model—a system designed to predict, classify, or generate output based on patterns in data. Unlike full-stack applications, which rely heavily on deterministic systems (i.e., if X happens, do Y), AI systems are probabilistic by nature. This means that their outputs are predictions or estimates, not guaranteed results.
A critical aspect of these probabilistic systems is the concept of an accuracy threshold: models improve accuracy thresholds based on the quality and quantity of data. With industry standard wisdom stating: “It’s easy to get an AI to be 80% accurate, but painfully difficult to recapture the remaining 20%”.
As a result, to move beyond minimal accuracy thresholds and ensure that your AI application remains performant, you need a sizable quantity of high quality data. This is not just a one-time investment, however, but an ongoing process that fuels the iterative improvement of AI systems over time: “Build a base AI and grow it over time”.
When developing an AI application, there are several fundamental questions that every company should ask about data, and it’s strategy.
One of the first questions to consider is the complexity of the task you are solving. This can often be measured by the number of input and the output variables, and the strength of the relationship between the output and input. More variables means higher complexity.
There’s been a shift towards focusing on the quality over quantity of data (GIGO - garbage-in, garbage-out). It’s crucial to find the right balance between having enough data to train your model and ensuring that data is representative and reliable. Quality should never be sacrificed in favor of sheer volume. In many projects, data cleaning ends up taking a large chunk of time.
AI is a data-hungry technology, and acquiring enough data can be expensive. Whether you're collecting data through APIs, scraping, purchasing datasets, or building integrations, you need to evaluate the data acquisition cost (DAC). This is crucial so you can calculate the return on investment for each AI feature, and prioritize the feature build in a way that minimizes your overall risk — so you don’t overcommit and underdeliver.
Data governance is another consideration, especially for enterprises. With increasing regulations like GDPR, AI Act, and others, companies must have safety measures that are compliant. AI product leaders need to develop strategies that emphasize trust and safety in their systems by ensuring privacy, fairness, and transparency in models.
Large Language Models (LLMs) have become a focus of many AI builds today, but they are not a universal solution. The decision to use an LLM depends largely on the complexity of the task you need to solve, and the accuracy you need to attain.
Large language models (LLMs) are really good at certain language tasks, like writing text or summarizing information. But as the task gets more complex, like requiring specialized knowledge or nuanced thinking, a generic LLM might not be enough. A lot of the new work in generative AI is about creating a framework of smaller, more manageable tasks that the LLM can handle. Then the AI can put those smaller tasks together to tackle the bigger problem. So it's not just about using one or two LLMs and calling it a day. You need to break the problem down into smaller pieces that the AI can handle, and then put those pieces back together in a smart way.
Fine-tuning your own LLM can be incredibly powerful, but it requires more consideration.
A good rule of thumb when considering fine-tuning is the 10x rule—for every feature or decision component in your model, aim to have at least 10 relevant data points. This ensures that your model has enough information to learn effectively. For more complex tasks, even more data may be required.
Saliency analysis is one way to determine whether your model is sensitive to the right features. This technique evaluates how much the model's predictions change in response to changes in the input data. If Y changes with respect to X as expected, it means your model is learning from the right signals in the data. If not, this may be a sign that you need higher quality or more representative data for fine-tuning.
Initially, you can experiment with improving prompts for better output. However, as demands grow, you might explore techniques like Retrieval-Augmented Generation (RAG), where the model can pull in relevant information dynamically from external sources. This can further evolve into Dynamic RAG, where the information retrieval is even more flexible and real-time with Knowledge Graphs (KGs), and eventually, if needed, full fine-tuning.
Historically, startups have focused on building a Minimum Viable Product (MVP), but in AI, there’s a shift toward building a Minimum Viable AI (MVA). Essentially deploying a base model that is improved in iterations based on real-world usage data. Constant feedback loops help you refine the model based on actual human interaction rather than theoretical performance. And building an AI system where the first features have low user expectations and high accuracy should be prioritized first.
Building an AI application is not just about technical feasibility but also about costing and resourcing of data. To answer whether you can afford it, you need to evaluate several factors:
Every model is powered by data, and the economics of acquiring and managing that data are critical. What is the cost of each data point? How much will you need to invest in maintaining and growing your data pipeline? These costs need to be weighed against the potential returns from your AI product.
AI applications rarely function in isolation. They often need to be integrated with other tools, systems, or databases. The cost and complexity of these integrations can significantly impact the affordability of your project. Whether you're connecting to customer data, APIs, or internal data lakes, each integration adds both financial and operational overhead.
Calling LLM APIs can be costly depending on the model. It’s useful to set a ‘budget per query’. That is, does it make sense to spend $0.01, $0,10 or $5.00 per AI answer or conversation? The answer may differ wildly whether you’re doing high volume customer support or aiding lawyers. Fortunately, the cost of AI is still dropping at a rapid clip. Usecases that don’t seem economical today may be economical in a year.
The future of AI applications depends on the data you acquire — your gold. As such teams need to think through their data strategy with 2 major points covered in this article:
identify your core task, measure the quality and quantity of the data you require, and calculate a unit economics (data acquisition cost - DAC) to ensure you can reach the accuracy threshold of your use case. Consider the cost of integration as well.
organize your AI product features in terms of user expectations, value created, and ease of creation based on the data you have access to. Once you have your base AI model, you can iteratively improve it based on user data.
In this rapidly advancing industry, a well-thought-out data strategy can create defensibility for your AI product and company in the long run.. It’s a core part of designing an engineering roadmap in AI and should have the highest attention of executives.
Need help with your AI roadmap? Book a call now.