Improving Storage and Read Performance for Free: Flat vs Structured Schemas

Artur Costa5 min read • Published Jan 26, 2024 • Updated Jan 26, 2024

MongoDB

Rate this article

When developers or administrators who had previously only been "followers of the word of relational data modeling" start to use MongoDB, it is common to see documents with flat schemas. This behavior happens because relational data modeling makes you think about data and schemas in a flat, two-dimensional structure called tables.

In MongoDB, data is stored as BSON documents, almost a binary representation of JSON documents, with slight differences. Because of this, we can create schemas with more dimensions/levels. More details about BSON implementation can be found in its specification. You can also learn more about its differences from JSON.

MongoDB documents are composed of one or more key/value pairs, where the value of a field can be any of the BSON data types, including other documents, arrays, or arrays of documents.

Using documents, arrays, or arrays of documents as values for fields enables the creation of a structured schema, where one field can represent a group of related information. This structured schema is an alternative to a flat schema.

Let's see an example of how to write the same user document using the two schemas:

The two documents above contain the same data. The one on the left, flatUser, uses a flat schema where all the field-and-value pairs are on the same level. The one on the right, structuredUser, employs a structured schema where the field and values have nested levels according to related information inside the document.

So, what are the advantages of using a structured rather than a flat one? The quick answer for those in a hurry is that a structured schema may require less storage and be faster to traverse than a flat schema. For those who want to know why, we need a better understanding of BSON.

For the purpose of this article, a BSON document can be seen as a list of items, where each item represents a field-and-value pair of the document. An item is composed of the field’s type, name, length, and data in a serialized form. The field type is one byte long and indicates the data type in the data field. The field name is the field's name in a string form. The field length is four bytes long and indicates the length of the data field for those types where the size is not fixed. The data field is the actual data of the field-and-value pair. Putting this definition in a graphical representation, we have:

Let's see how a structured schema uses less storage than a flat schema by analyzing the field-and-value pair related to the user's name.

In the flatUser, we have the following table from a storage perspective:

field-and-value	Type	Field Name	Field Length	Field Data	Total
name_first: "john"	1 byte	10 bytes	4 bytes	4 bytes	19 bytes
name_last: "smith"	1 byte	9 bytes	4 bytes	5 bytes	19 bytes
name_middle: "oliver"	1 byte	11 bytes	4 bytes	6 bytes	22 bytes

Adding up the table's total sizes, the flat document uses 60 bytes to store the field and value related to the user's name.

To analyze the storage of the structuredUser, let's divide it into two tables. In the first table, we'll have the storage used by the document of the field name, and in the second table, we'll have the storage utilized by the field-and-value name.

Let’s build the first table for the value/content of the field name:

field-and-value	Type	Field Name	Field Length	Field Data	Total Size
first: "john"	1 byte	5 bytes	4 bytes	4 bytes	14 bytes
last: "smith"	1 byte	4 bytes	4 bytes	5 bytes	14 bytes
middle: "oliver"	1 byte	6 bytes	4 bytes	6 bytes	17 bytes

Adding up the previous table's total sizes, the value/Field Data of the field name uses 45 bytes. Building the second table for the field-and-value name, we get:

field-and-value	Type	Field Name	Field Length	Field Data	Total Size
name: { … }	1 byte	4 bytes	4 bytes	45 bytes	54 bytes

The structured document uses 54 bytes to store the values related to the user's name.

Comparing the tables, we see the main difference is the "Field Name" storage size. The flat schema uses 30 bytes to store the names of its fields, while the structured schema uses 19 bytes to store the names of its fields. This is due to the repetition of the sub-string "name_" in the "Field Name" of the flat schema.

Storing the two documents in a MongoDB instance, we will get a size of 403 bytes for the flat schema and 307 bytes for the structured schema. Not bad getting a 24% improvement in storage just by changing the schema, and a structured document is easier to read and more pleasant to look at.

Now, let's see how a structured schema is faster to traverse than a flat schema by getting the zip code of the work address.

In the flatUser document, to get to the field address_work_zip starting at the beginning of the document, a cursor would need to perform a 12 field names comparison until it reaches the desired field.

In the structuredUser document, to get to the field address.work.zip starting at the beginning of the document, a cursor would need to perform an 8 field names comparison. The smaller number of comparisons here is due to some values of a field-and-value pair being a document. When the cursor checks the field name, it can jump three fields/comparison — first, middle, and last— because it knows that address.work.zip won't be inside of name.<field>. When the cursor checks the field address.home, it can also jump five fields/comparison — street, number, zip, state, and country.

To quantify the performance gain on traversing a structured schema instead of a flat schema in MongoDB, a test with the following methodology was used:

To isolate the result to be influenced just by the traversing of the documents, the MongoDB instance used was configured with in-memory storage.
Documents with 10, 25, 50, and 100 fields were utilized for the flat schema.
Documents with 2x5, 5x5, 10x5, and 20x5 fields were used for the structured schema, where 2x5 means two fields of type document with five fields for each document.
Each collection had 10.000 documents generated using faker/npm.
To force the MongoDB engine to loop through all documents and all fields inside each document, all queries were made searching for a field and value that wasn't present in the documents.
Each query was executed 100 times in a row for each document size and schema.
No concurrent operation was executed during each test.

Now, to the test results:

Documents	Flat	Structured	Difference	Improvement
10 / 2x5	487 ms	376 ms	111 ms	29,5%
25 / 5x5	624 ms	434 ms	190 ms	43,8%
50 / 10x5	915 ms	617 ms	298 ms	48,3%
100 / 20x5	1384 ms	891 ms	493 ms	55,4%

As our theory predicted, traversing a structured document is faster than traversing a flat one. The gains presented in this test shouldn't be considered for all cases when comparing structured and flat schemas, the improvements in traversing will depend on how the nested fields and documents are organized.

This article showed how to better use your MongoDB deployment by changing the schema of your document for the same data/information. Another option to extract more performance from your MongoDB deployment is to apply the common schema patterns of MongoDB. In this case, you will analyze which data you should put in your document/schema. The article Building with Patterns has the most common patterns and will significantly help.

The code used to get the above results is available in the GitHub repository.

Rate this article

Quickstart

Store Sensitive Data With Python & MongoDB Client-Side Field Level Encryption

Sep 23, 2022 | 11 min read

Tutorial

Modernize your insurance data models with MongoDB Relational Migrator

Mar 04, 2024 | 14 min read

Article

Structuring Data With Serde in Rust

Apr 23, 2024 | 5 min read

Industry Event

PITTSBURGH, PA, USA | IN-PERSON

PyCon US

May 15, 2024 - May 19, 2024