Mastering DynamoDB: Essential Techniques for Modern Data Modeling

Tudor Ioan Marin
ITNEXT
Published in
7 min readApr 27, 2024

--

© lebanmax — stock.adobe.com

Getting Started

Handling data in the cloud is no small feat. In a world where everything needs to scale, databases are in a league of their own. CAP theorem states that a distributed data store can only simultaneously provide two out of three guarantees: consistency , availability , and partition tolerance . On this spectrum, DynamoDB provides high availability and partition tolerance while having a configurable approach toward consistency, making it an intriguing option that is worth exploring.

This article is intended for developers with a foundational understanding of databases, who are looking to deepen their practical knowledge of DynamoDB. I aim to equip you with the right tools to build efficient and cost-effective software, whether for your next pet project or the big feature you’re about to work on.

Fundamentals

In the realm of NoSQL databases, DynamoDB stands out for its structured approach to managing data. First of all, everything related to clusters or servers is managed by AWS, so you only care about what happens from the table layer down. A table is composed of items. Dynamo doesn’t enforce a rigid schema on the items, they can have any attributes you see fit. What it does care about is the primary key that uniquely identifies each item in a table. The primary key is not an actual item attribute, it’s composed of the partition key and the sort key, so each item must have a unique combination of these two attributes. Of these two keys, only the partition key is mandatory, however, without leveraging the sort key, Dynamo becomes less of a database and more of a key-value store.

Choosing the partition key is crucial because that’s what Dynamo uses under the hood to achieve the massive scale and performance it boasts. The partition key is passed through a hashing function and the result is mapped to one of many partitions that are running behind the scenes. That means that every time you want to read or write data you need to know the partition key. Dynamo is sharded by default and the data only lives in that partition space. What partition key you choose largely depends on the use case, but keep in mind that a high cardinality key will be evenly distributed across partitions and less likely to create hot spots.

Items in the same partition are grouped based on the sort key. As you can imagine this is useful as you don’t want to scan the whole partition to find a selection. Besides, keeping related data together can be used for fine-tuning queries and splitting data into smaller items. This will be clearer in the design examples ahead

Since filtering data by only using the partition and sort key can be limited, DynamoDB supports indexes. A Global Secondary Index is essentially a secondary table, that can have other attributes as a primary key, where data is being replicated from the base table. Projected attributes are attributes copied over from the base table into the index. This is configurable, but the primary key of the base table will always be projected. New GSI s can be added on the fly and queries can be run against them to support use cases that were not known in the design phase.

Case Study

Right, now that we’re clear on what we’re working with, let’s start looking at a hypothetical example. Let’s say we want to model the data for a simple blog CMS (Content Management System). The blog will publish articles from multiple authors, articles can contain assets like videos and images and will support comments on each article. For this example, the comments will be anonymous.

Access Patterns

To get the most out of DynamoDB, you need to understand how the application you’re building is going to consume/produce data aka the application access patterns. It’s essential to start by listing all the entities and patterns, these will drive the discussion of data modeling.

So for our use case, we have the following entities and their cardinality:

  • Article — Medium Cardinality
  • Author — Low Cardinality
  • Comment — High Cardinality
  • Asset — High Cardinality

And the following access patterns:

  1. List articles by date
  2. Fetch an article by its ID
  3. Retrieve all assets (videos and images) related to a specific article
  4. Load comments for a specific article by date
  5. Search articles by keywords*

Table Design

Data modeling in DynamoDB differs greatly from the traditional relational model and even from the other NoSQL approaches. While usually, you would have a table or a collection holding one type of entity, this is not necessarily the best approach here. There’s of course nothing stopping you from normalizing data in DynamoDB, let’s explore the tradeoffs.

Multi-table design , as in one entity per table, is by far the best approach to build fast as you avoid the learning curve of single-table design. Besides that, having separate tables comes with increased flexibility and reduced sizes for indexes and streams. You can, for example, set a different IAM policy (Identity and Access Management) at a table level. The downside is that you will over-fetch a lot of data as you will need to join items at some point outside of the database. And remember, while storage costs are applied, throughput costs are by far the bulk of the bill. Overall this is not the recommended path for most use cases.

Colocating multiple entities in a single-table, on the other hand, allows you to fetch multiple entities in a single query because they are stored in the same partition. Not only is this essentially a join, but it’s also cheaper from a cost perspective as AWS charges by read request unit, which is up to 4KB, and all entities will be considered when calculating the total size. Chances are that not all entities in the table have the same throughput and being part of the same partition will help to alleviate overprovisioning of the table.

In practice, you would probably end up with more than one table that might contain multiple entities. For our CMS, we can see that based on the access patterns of our CMS we can benefit from using the single-table design.

Let’s start by looking for a partition key that has a high enough cardinality and also keeps entities that are related under the same partition. Based on the fetch article by ID access pattern, we know that to display an article we would need all the associated information for that article. That sounds like a good start. While we’ve said that articles have a medium cardinality, that was by comparison to Comments and Assets.

Article Item

All articles will be unique so we can use their ID, let’s say a UUID, as the partition key. A good practice when modeling the tables is to separate the keys from the other attributes. In this case, the id, type, and createdAt attributes are separate from the pk and sk of the item, giving us more flexibility in the future should we need to alter them.

The sort key is composed by type and createdAt separated by # as a delimiter. This will come in handy when using the sort key to filter. We can fetch, if needed, only assets or only comments for a single article.

We denormalized the author and added it to each item. This is information that will rarely change and we don’t care that much if an older article is not updated with the most recent author profile picture. The advantage of doing this is that we can fetch listings and individual articles in one query quickly.

Asset Item

Comments are pretty much the same. Here the timestamp will come in handy and we will be able to get them from most recent to least recent without the need to sort them.

Comment Item

To recap, right now we can query based on the partition key of an article and get, in a single query, the article, the assets contained, information about the author and the comments already sorted in descending order. Let’s now explore how to fetch the latest articles. Since we partitioned them by their id, we can only fetch one article per query.

Indexing the table based on different keys will allow us to query all the articles. Let’s see how we can configure the GSI to accommodate this access pattern. All entities have a type associated with them. We can use that as the partition key for the GSI and for the sort key we can use the createdAt . This way, we can get all the articles in the right order to display.

Final thoughts

We’ve covered the fundamentals, explored how to approach the data modeling process, and designed a blog CMS to see some of the information in practice. Hopefully, this will be a valuable addition to your developer toolkit and will serve you well in your future projects.

(*) You might have noticed that there’s one access pattern we have not covered, searching by keywords. To do that we’ll need to explore DynamoDB streams and OpenSearch. If you’re interested in reading an article on this topic or would like to hear more on a different subject, let me know and I’ll consider writing a follow-up.

If you found this article helpful, clap and share it so it will help more people. Have questions? Drop me a line on LinkedIn.

Originally published at https://www.linkedin.com.

--

--