The Basis for Databases

9 min readJun 7, 2021

Objectives:

To understand what a database is and how people use them
To learn about the most popular database models, along with their pros and cons
To feel equipped with tools/resources to learn more about databases

What is a database?

Databases are collections of logically stored information. They can be as complicated as AT&T’s user information data (323 terabytes in storage) and as simple as a personal diary. If you have ever written a to-do list, operated a TV channel guide, or referenced a dictionary, then you have interacted with a database! People use databases to enhance their day-to-day lives, while companies can use them to optimize their processes.

What do we do with a database?

Since the concept of a database is so broad, its applications are seemingly boundless. Every professional sports team analyzes video films from opponents’ previous games to make smarter decisions when it’s time to play. A college student studying for a final exam might (and should) look at previous exam results in the course to determine what to study. The list goes on and on — interpreting information is an essential part of how we interact with the world around us.

We generally interpret a database’s content through what are known as queries. A query is, at its simplest, an attempt to retrieve information from a database. As an example, consider the dummy data below:

One could write a query to return the year which yielded the highest earnings. Other potential queries from this mini-database might include the mean earnings over the four years, total earnings excluding 2020, and even projected earnings in 2021. If we add more columns of data for each year, it is easy to imagine how the number of potential queries can expand rapidly.

Why can’t we just use Excel for every database?

One of the most common visualizations of a database would be an Excel spreadsheet (or Google Sheets/any 2x2 matrix). This type of structure is referred to as a flat database. It could look something like the following:

In this spreadsheet, each row is a new observation: a unique new entry to the dataset. In turn, each column contains characteristics of the observations.

This database structure is relatively easy to read and draw conclusions from, but it has its limitations. To understand these limitations, we must once again consider our broad definition of a database: any collection of logically stored data. As an example, let’s consider a family tree:

This tree is certainly considered data, but trying to store it in an Excel spreadsheet would be incredibly awkward. To demonstrate, we tried to convert the family tree into a spreadsheet of our own:

This database has a lot of ambiguity. One may wonder whether Vicky is Emily’s child or only Craig’s, and might be unsure who Theresa and Alan’s children are. There need to be clearer indicators of a parent-child relationship. Consider our second attempt at storing the data in a spreadsheet:

The data now has a lot less ambiguity — There is no confusion as to whose children are whose, and the family is connected logically. However, this is still far from perfect. Consider the query of determining the relationship between Freddie and Charlotte. One may correctly determine that Charlotte is Freddie’s aunt, but doing so would be far easier if we had just visually analyzed the original family tree.

In essence, different data often requires different database structures. Our family tree could be stored as a spreadsheet, but to do so would make certain queries more complicated than necessary. Instead, it would be wise of us to find a different model, or structure, for our database.

What are the most popular database models?

In this article, we will take a look at three of the most popular database models: Relational, document, and graph. For further information, this article summarizes 10 popular models.

Relational databases:

This model can often look very similar to the flat model analyzed above. The main additional feature of a relational database is that it can contain multiple flat databases which relate to each other. To better understand this, consider these two flat databases:

Despite being different sizes and containing different rows/columns, these databases are related by ice cream flavor. By using both databases in conjunction with each other, we can find answers to more queries than if we only used one. For example, I know that Brandon’s ice cream flavor preference has more calories than Catherine’s.

In more formal examples of this model, we refer to the information in our common (related) columns as keys. Often, rather than using words as our keys, we will use an integer key. This is done for two main reasons:

To prevent typos from ruining our automated querying (i.e “Chcolate” won’t be considered the same as “Chocolate”).
To save disk space, especially in large databases. Integers take up less storage than words do.

Arranging our database in this fashion allows us to cluster our data in a way that makes logical sense. We have a “people” database and an “ice cream” database. Splitting the information between these databases allows for a clear interpretation of the data. Compare the above model with the one below, where we attempt to merge all the data:

Is Alex 145 calories? Or do the calories refer to his favorite ice cream flavor? Common sense allows us to answer this, but different data might lead to serious confusion when we attempt this merge.

While relational databases give us a lot of additional functionality, they certainly have their drawbacks. For example, suppose the owner of the above database receives new information in the table below:

This data is close, but not identical, to the format above. Unfortunately, this might lead to some difficulties in adding this data. In short, relational databases struggle when new observations don’t meet the rigid format of the old ones.

Document databases:

These databases are designed similarly to relational databases, with a few key differences. Document databases contain different documents of data (such as a person or ice cream flavor above) too, but the documents are not explicitly related to each other. Below is an ERD (Entity Relationship Diagram) of a document database:

Here we have four different types of documents: Group member, family member, pet, and course. The relationships between document types are shown by the connecting lines. Now let’s look at the breakdown of an individual document:

This group member has 14 different key-value pairs, where each “key” (defined the same as above) has an associated value. We use these pairs to answer queries — Figuring out the oldest group member, for instance, would require looking at the value for “age” of each group member.

The main benefit of using a document database is that it allows us to quickly locate information about any singular document, without having to traverse our entire database. If I want to quickly reference all the information about a particular group member, I can simply pull up that member’s document and would be able to look through any information I’m curious about. No referencing of an additional database is needed!

One large drawback of document databases is that documents don’t relate to each other as smoothly as relational or graph models. Queries that require spanning through different documents can end up being unnecessarily slow.

In short — For disparate, unrelated data about different observations, document databases are great! For connecting related databases with each other, they are not.

Graph database:

To best understand this database model, we must first understand the fundamentals of graphs in the technological world:

What is a graph?

A graph is a data structure consisting of “nodes” and “edges”. Two nodes can be connected by an edge, and nodes can be connected with as many other nodes as the structure requires:

In this diagram, our five nodes are labeled “a” through “e”. The edges are represented by the lines connecting them. To move from node “c” to node “e”, we would first need to move through nodes “d” and “a” (or b-a, alternatively).

One of the most widely-used applications of graphs is Google Maps (or any similar application):

Here, every destination is a node and every road is an edge. The (very simplified) way that Google determines the fastest route to a destination is to minimize the number of edges traversed during the commute. Adding in variables such as traffic and lights makes the process a lot more complicated, but it is still most certainly an application of graphs!

Back to our graph database:

This database model is essentially just like the graphs discussed above, except each node and edge contains data:

In most cases, each node/edge will contain associated properties. Here, the “Member1” node might contain characteristic properties such as hair color, age, etc. Furthermore, the information contained in the edges allows us to understand the relationship between the two nodes it connects.

One of the main advantages of using graph databases is that they can often make a lot of visual sense. Let’s revisit our database example from way earlier, the family tree:

This is actually a graph database of sorts! Each person is a node, and edges represent siblings/couples/children. With a few modifications (such as labeling the edges), this family’s ancestry could easily be stored in a graph model.

Another strength of the graph model is that it handles the addition of new data with ease. New observations, new nodes!

As for cons, queries can often be quite slow depending on how much of the graph needs to be traversed. In that case, it could potentially make more sense to use a relational/document model where the same queries could be answered in shorter runtime.

Summary:

Congratulations — you now have a strong foundational knowledge of how databases work, as well as how they’re most commonly modeled. While many would take any heap of data and throw it into Excel, you now understand how to evaluate your data and store it in an appropriate model. To continue exploring databases and how they’re best stored, Kaggle is a phenomenal resource. It contains around 50,000 various datasets, all of which are available for simple downloads. An instructive exercise may be to wonder if these datasets could be better stored in a different model…