AI is not Water

Or how the world of databases might be a better analogy to understand AI

Mar 09, 2024

A month ago, NFX published this piece, “AI is like water”. The central idea of the post was that AI is going to be so ubiquitous that it is really no different than water - available to everyone with little to no differentiation. Opinions on this are going to diverge…

The scientists are going to argue that their methodology and training sets are unique and thus there is a real difference.

The business people will pretend to care not to be outsmarted by their competitors, but ultimately will only buy-in once it is clear how the AI addresses their pain point.

Meanwhile, vendors are selling shovels left and right - and this is maybe a good place to start the conversation to better understand whether AI is really like water.

Our move away from Astra DB

One of the “hot” areas that emerged with the appearance of LLMs is that of Vector databases. Vector databases work by summarizing each piece of content into a vector representation. Vectors can be compared to one another based on how “similar” they are. Similar types of content—whether text or video—will have a similar magnitude and direction (the two attributes used to define vector as a mathematical construct). This has some advantages in the age of AI because all these LLM workflows are executed in a non-deterministic fashion - meaning, we’re always dealing with “similarities”, rather than “exacts”. Vector databases are today key to RAG (Retrieval Augmentation Generation) because they help applications identify context relevant to a LLM request based on some similarity between the user’s prompt and the data available in the application.

Elasticsearch Creator Shay Banon Livestream | Elastic Videos — Shay Banon, who in 2010 invented the open source library behind Elastic Search

In principle, Vector databases are not entirely new. Elasticsearch and most search engine tech rely on such technology. Having said that, Vector databases are somewhat different from the relational databases commonly used in Analytics (e.g. Snowflake, BigQuery) because these latter databases typically work with exacts where values are compared on whether they are the same, as opposed to how similar they are.

As a company that is actively building AI applications ourselves, we rely heavily on Vector databases

As a company that is actively building AI applications ourselves, my company relies heavily on Vector databases. We use a Vector database not only at the time of the user prompting to retrieve relevant data, but also during the phase of training for our own internal LLM fine-tuning. So a Vector database is almost as critical to our workflow as the LLM model itself.

Recently, we made a decision to move away from Datastax’s Astra DB to Azure’s Vector database equivalent referred by Microsoft as “AI Search”. Microsoft’s implementation is not in any way superior to Astra. To be honest, I would not even know where to start evaluating the difference. And frankly, I don’t much care.

I would not even know where to start evaluating the difference. And frankly, I don’t much care.

Wait, What?

Yep, you heard me right: I don’t much care about the quality of the critical component in our AI technology stack. And this is what much of the current tech scene is getting completely wrong! Allow me to explain.

In technology, every API is a form of a contract. A contract on basic properties between the consumer of the thing and the producer of the thing. One way to think about this is via an analogy: grocery shopping.

When you go grocery shopping, you expect that there will be food at the store. You expect that you will be able to pay for that food. And you expect that the food costs roughly the same amount it cost last week. Your basic contract with the store is thus: availability, purchasability, and price range. You might still not be happy if the store is, for example, out of bananas the specific hour you visit, but it is likely not in your basic expectations that they always have bananas - unless this is a bigger chain like Whole Foods.

To build on this analogy, the contract we have today with Vector Databases includes basic properties such as speed, storage capacity and price. There are other properties in that contract too, but none say “quality of the retrieval” OR “accuracy relative to the original input”. And this is really the crux that is important to understand. End-user applications guarantee quality - but developer-facing applications do not. They can only guarantee consistency in quality. This is a hard information to digest, so an analogy is perhaps warranted here again.

Imagine you’re running a restaurant. Every day you have to guarantee that items on your menu are available and that the quality of the prepared food is the same as yesterday. Notice that the end consumer does not much care that you went grocery shopping to Store A where some food ingredient wasn’t available. The patron’s expectations of your restaurant are set independent of grocery stores you visit, and what happens in them. Restaurants are like end-user application developers; grocery stores are like AI vendors. Actually, food and restaurants are a great analogy for a lot that happens in technology - I highly advise you to read my other previous piece:

Why you can only eat that sandwich once…

SG Mir

October 13, 2022

Read full story

Guaranteeing Quality despite the Uncertainty

Imagine you open your restaurant in some region of Latin America, where the supply of food at a local grocery store is highly volatile. You create a menu and set prices, but every other day half of your ingredients are not going to be available at a local store, so you have to go shopping at another.

Now, imagine Costco, the large food retail chain, moved in next door and opened a 100,000 square feet store with reliable and predictable supply of all food produce. You know their food quality is subpar, but both the quality and selection are highly consistent. Regardless of my philosophical views on how Costco affects local economies, unless I am building a 3-Michelin-star fine-dining experience, I am probably going to buy my ingredients from Costco - not from the small grocery store.

This is why we moved to Azure…

AI is like a Database

Coming back to the NFX post, while I too agree that differentiation at the application level is what it is all about, the analogy of comparing AI to water is wrong. The better analogy is that AI is like the world of databases.

Companies have been building on top of Databases for 40+ years. Do databases make companies? No! Did applications on top of databases require marketing, product differentiation, etc. Yes, of course! Was there absolutely no technical barrier to all those applications? No, of course there was!

Having said that, database comparison is important for another reason. On the one hand, databases are as commodity as technology goes. Almost all underlying database tech is open-sourced and is available for anyone to reverse engineer and sell as a new database. Amazon’s Redshift was a fork of open-sourced Postgres. Most recently, an upstart relational database company, Neon, also was built on top of PostgreSQL. MariaDB was famously taken from an open-source MySQL and repackaged into a separate database.

On the other hand, databases have these properties which act almost like those of contracts when it comes to developer expectations. Redshift’s contract was that you were getting a faster Postgres at a fraction of Oracle’s cost. Snowflake’s contract was that you were getting an on-demand managed Redshift. MariaDB’s contract was that you were getting MySQL without Oracle Inc.

As AI moves on its maturity curve, I think it’s prudent to think about AI in those same terms. Major AI providers today are like database vendors - they set expectations around throughput and model consistency, but they ultimately are not responsible for quality of results in their responses.

Major AI providers today are like database vendors - they set expectations around throughput and model consistency, but they ultimately are not responsible for quality of results in their responses.

So not at all like Water!

While NFX article is correct to conclude that differentiation and customer-facing applications will make all the difference, I think they confuse developer-facing AI providers (shovels) and end-user applications. End-user applications will continue to require differentiation, distribution, etc - all those same things they required since…forever. Raw developer-facing vendors will be evaluated based on things like economies of scale - hence why Azure, for example, makes more sense than Datastax’s Astra.

All that said, it is not always easy to understand the difference between shovels and end-user applications. For example, what happens if an end user application is aimed at developers? It used to be that marketing around developer-facing products was boring. So recognizing where you’re in the stack was easy. But now every developer-facing product will have case studies and solution guides, attempting to communicate value to business users in addition to developers. As a result, no one now knows who is selling shovels and who is selling gold itself.

no one now knows who is selling shovels and who is selling gold itself

Trying to figure out the difference is ultimately the confusing moment every VC is finding themselves in right now.

Mir's .Report

Why you can only eat that sandwich once…

Discussion about this post