Databases are the unsung heroes of modern technology. They power everything from your favorite social media apps to the analytics dashboards that businesses rely on for decision-making. But with over 300+ databases available, each designed for specific use cases, choosing the right one can feel overwhelming. Whether you’re a seasoned developer or an aspiring PM, understanding databases is crucial for building efficient, scalable, and reliable systems.
In this guide, we’ll explore databases in depth, using relatable metaphors, real-world examples, and clear explanations to help you understand how they work, why they matter, and how to choose the right one for your needs.
Let’s start with a metaphor: baking bread.
When baking, the type of flour you use determines the texture, flavor, and structure of your bread. High-protein flour (like bread flour) creates a chewy, dense loaf, perfect for pizza or bagels. Low-protein flour (like cake flour) makes light, flaky pastries. All-purpose flour works for everything, but it’s not ideal for specialized recipes.
Databases are like flour. Each one is designed for specific use cases. You can use any database for anything, but choosing the right one ensures the best possible bread outcome. To understand which database to use, ask two key questions:
What does the data look like?
What do you need to do with it?
Let’s break these down.
Databases are optimized for specific types of data. While you can store any piece of data in any database, the way data is inserted and retrieved matters most. Postgres is relational, but supports big blobs of text. Nothing is stopping you from storing your user data in Elastic. Nevertheless, many databases are designed with specific types or shapes of data in mind.
Here’s a closer look at how data shape influences database choice:
What it is: Data with a clear, predefined schema (e.g., tables with rows and columns).
Example: User profiles, orders, or inventory.
Best databases: Relational databases like Postgres, MySQL, or MSSQL.
What it is: Data without a fixed schema (e.g., JSON documents, logs, or social media posts).
Example: Tweets, sensor data, or product reviews.
Best databases: NoSQL databases like MongoDB, Cassandra, or Elasticsearch.
What it is: Data with complex relationships (e.g., social networks, recommendation engines).
Example: Friend connections on Facebook or product recommendations on Amazon.
Best databases: Graph databases like Neo4J.
What it is: Data indexed by time (e.g., stock prices, sensor readings).
Example: Hourly temperature data or daily financial transactions.
Best databases: Time-series databases like InfluxDB or TimescaleDB.
Different types of data – and even similar types of data for different use cases – get used in different ways. How you use the data determines the database you need. Here are some common use cases and the databases that excel in them:
These are the databases that you use for the core data in your application, think your user data. Data is being read and inserted constantly. Queries are usually small and quick. Data integrity is key.
What it is: Real-time, transactional data processing (e.g., user actions, payments).
Example: Adding items to a shopping cart or updating a user profile.
Best databases: Relational databases like Postgres or MySQL.
These are the databases that you use for analytics and data science, think Snowflake etc. Data is being read and inserted sporadically. Queries are usually large and complex.
What it is: Complex queries on large datasets for analytics or business intelligence.
Example: Calculating monthly revenue or analyzing user behavior.
Best databases: Data warehouses like Snowflake, BigQuery, or Redshift.
What it is: Logs, monitoring, caching, or internal processes.
Example: Storing server logs or managing user sessions.
Best databases: Elasticsearch for logs, Redis for caching, or Prometheus for monitoring.
The fact that users of the database need to do different things with the data inside of it necessitates storing said data differently.
OLTP vs. OLAP is one example, but there are many different ways to break down those use cases. Some databases have special UIs on top for analysis (like Kibana and Elastic). Others are built for massive scale from the start. Some are made to handle many different situations decently well. No matter what database you’re dealing with, thinking about what kind of data it stores and how you need to use that data is the key to understanding each.
Crude as it may be, I find it useful to think of 3 major categories of databases. Some databases can overlap and cross categories. For example, there are NoSQL databases built for powering your app, but there are also NoSQL databases meant for in-memory stuff (e.g. Redis is technically NoSQL). Here are the major ones:
Databases that power a user-facing app – production databases that store the data you need for your app to run.
Databases that power analytics – databases for analysis, machine learning, and anything a data team does.
Databases that power operations – databases for monitoring, logs, security, and any internal processes that enable the above.
Let’s run through each category and cover a few examples for each.
Databases that power a user-facing app – often referred to as production databases or production data stores – are where developers store and query whatever data their app needs to run. Any “data” you’re seeing on your screen sits here.
If you’re Twitter: tweets, user profiles, trending topics, DMs
If you’re Gmail: emails, settings, spam filters
If you’re Amazon: orders, users, credit cards
Production databases are built to support small and quick queries. The most important thing is data integrity and reliability – you don’t want to lose an order or a credit card.
Every company’s data is different, and for some it makes the most sense to model things as a sort of graph of interconnected nodes. Facebook famously runs their social graph on an in house graph database called Tao. The use cases for graph DBs are mostly user-facing, but they’re sometimes useful for analytical purposes too.
What a data team needs from their database is very different from developers working on an app. For data science and analytics work, the data you store is often redundant, there’s tons of it, and your queries join data from multiple tables at once. Data usually gets queried by someone sitting at a computer doing research, or a system building a pipeline, and gets inserted at regular intervals (twice a day, something like that).
Data warehouses are (usually) relational databases for storing analytical data, like what your users have been doing, revenue by month, things like that. They’re optimized for big, long, multi-table queries. They’re usually relational databases in nature, although the implementation details can vary.
🤔 Why can’t I use Postgres as a data warehouse?
You can, people did for a long time, and some still do. But modern cloud data warehouses are just waaaaay faster for large queries on meaningful quantities of data. This is the thing about databases: you can use anything for anything, but there comes a point where you need a specialized tool for the job.
Though it’s unlikely you’ll see it in production today, HDFS was a highly popular way to build a data lake before the cloud data warehouse era. It’s powered by Hadoop, one of the original frameworks for performing distributed computations on large groups of data. It was/is notorious for being very, very hard to set up and run.
This is a bit of a niche one, but worth mentioning: there’s a class of databases (or in some cases, database extensions) that are purpose built for working with geographical data. GIS stands for Geographic Information System. In terms of the actual data being stored here, it can be anything from points and lines to complex 3D data; structures that don’t map (no pun intended) well to traditional database schemas.
This is the category you’ll probably encounter least if you’re not an engineer. The 3rd category of databases covers data stores that developers use to power internal operations: monitoring application performance, storing logs and security information, improving application speed, or even intermediate layers between other databases.
While pretty much all of the databases we’ve covered so far store permanent data – saved on a harddrive – there’s a class of databases that only keeps data in memory. They’re meant for ephemeral data that you need to store and use quickly, but you don’t mind if it disappears down the road. These databases are called key value stores (or KV stores) because the way they store data is like a dictionary, where each entry has a key.
Time series databases are DBs built specifically for storing data on some sort of time frame: daily financial data, second-by-second sensor readings, hour-by-hour health checks on your servers, anything like that.