AWS Cloud Project Bootcamp - DynamoDB

AWS Cloud Project Bootcamp - DynamoDB

During the past weeks, I have been very busy with Andrew Brown's amazing free AWS Cloud Project bootcamp. We have been through billing and architecture, containerising our application with Docker, using Honeycomb and X-ray for distributed tracing, using Rollbar for bug-tracking and monitoring, using Cognito for decentralized authentication and creating an RDS Postgres instance.

Every single week has been challenging, but the DynamoDB week has turned out to be the most stretching until now. Below I cover the main points regarding DynamoDB data modelling for our application.

Data modelling - access patterns

We started with a very insightful 2-hour live stream about data modelling. The design that was chosen for our database is a simple table design. It is a popular choice these days and works well in this kind of scenario where all data is closely linked together. Based on its name it sounds simple but turns out to be quite complicated in terms of data modelling. To get everything working and to keep the cost down, it is crucial to have your data mapped with all different access patterns.

When designing a relational database you simply map the data in logical entities, see what data belongs together in each table and then figure out how to access the data from these tables by using joins. However, with DynamoDB you have to approach this from a completely different perspective. You have to think of your application and what data it is going to need and how. When you know your access patterns, you can start to think about how to organize your data. You can break the rules you would have with relational databases - data can even be duplicated if that works with your access patterns! Storage is cheap and you want your base table to support as many of your access patterns as possible so duplicating data could make sense depending on the situation. You could also choose to save some of the data as JSON instead of separate items if it's not going to be used in any of your queries.

There are so many options for designing the data model for your database. To get the best results from DynamoDB in terms of cost-effectiveness and performance, you really need to do these initial steps correctly.

Access patterns in our application

Our application is a messaging app where the user is able to see a list of their conversations (message groups) and then click an individual message group and see all messages that belong to that message group. Additionally, the user is obviously able to send messages - these could be either completely new messages that start new message groups or further messages to existing message groups. Based on this it was possible to list our initial access patterns:

  • pattern A: showing a single conversation (message group).

  • pattern B: a list of conversations (message groups).

  • pattern C: create a new message

  • pattern D: add a message to an existing message group

  • pattern E: update a message group using DynamoDB streams

So the database is going to have one table, which is going to contain messages and message groups. Each item is going to have a unique uuid among other fields such as date, display name and message content. Each message group is also going to be listed twice as two individual items, from the perspective of the two users who are parts of the conversation. This is because a list of conversations cannot be displayed identically to both users, the person who is looking at their message groups wants to see the name of the other user listed as a topic of that message group.

Partition keys and sort keys

Then we come to the hardest part of data modelling, choosing the partition key. Partition key means an identifier for the item and it dictates under which partition DynamoDB puts the item under the hood. The partition key doesn't have to be unique and several items can have the same partition key. Sort key instead allows you to uniquely identify that item and allows it to be sorted. The primary key in DynamoDB can be either a simple primary key or a composite primary key (a combination of partition key and sort key). A partition key is always obligatory for any query and only an equality operator can be used. The sort key is not obligatory and not using it would simply return everything.

Our application has two access patterns that relate to messages and three that relate to message groups. For messages, we have to be able to write new messages and display the messages that belong to a certain message group. The best option is to use message_group_uuid as the partition key and created_at as the sort key for it. This is quite logical as we want to display a single conversation, so its identifier uuid is the easiest way to access it. Using created_at as a sort key will give us the option to display the messages within certain timeframes:

 def create_message(client,message_group_uuid, message, my_user_uuid,                  my_user_display_name, my_user_handle):
    now = datetime.now(timezone.utc).astimezone().isoformat()
    created_at = now
    message_uuid = str(uuid.uuid4())

    record = {
      'pk':   {'S': f"MSG#{message_group_uuid}"},
      'sk':   {'S': created_at },
      'message': {'S': message},
      'message_uuid': {'S': message_uuid},
      'user_uuid': {'S': my_user_uuid},
      'user_display_name': {'S': my_user_display_name},
      'user_handle': {'S': my_user_handle}
    }

For message groups, it gets a little bit more complicated. We have to be able to list message groups, add messages to message groups and update message group details. As each user naturally needs to see the message groups that belong exactly to them, the logical option is to use my_user_uuid as the partition key. This will work well as there are two message groups for each conversation, so each participant is going to have a version of the message group with their user uuid. As we want to be able to sort the message groups based on date, the sort key is going to be last_message_at:

 def create_message_group(client, message,my_user_uuid, my_user_display_name, my_user_handle, other_user_uuid, other_user_display_name, other_user_handle):
    table_name = 'cruddur-messages'

    message_group_uuid = str(uuid.uuid4())
    message_uuid = str(uuid.uuid4())
    now = datetime.now(timezone.utc).astimezone().isoformat()
    last_message_at = now

    my_message_group = {
      'pk': {'S': f"GRP#{my_user_uuid}"},
      'sk': {'S': last_message_at},
      'message_group_uuid': {'S': message_group_uuid},
      'message': {'S': message},
      'user_uuid': {'S': other_user_uuid},
      'user_display_name': {'S': other_user_display_name},
      'user_handle':  {'S': other_user_handle}
    }

The catch is that the value of the sort key will of course have to be updated every time a new message is created and added to the message group so that it reflects the date of the actual latest message (access pattern E). This is where a global secondary index is needed.

Global secondary index

GSI is a concept that takes some time to get familiar with. It is basically an index with a partition key and a sort key that can be different from those in the base table. You can imagine creating a new index almost as creating a new table in SQL. It can contain the same items as the base table but in a different order. That means the data is the same, but we twist it and look at it differently. GSIs always add extra costs and you want to avoid them if you can - as already previously mentioned, your base table should support as many of your access patterns as possible.

For our final access pattern E, we want to update the sort key (last_message_at) to reflect the sort key of the latest message (created_at). This will be implemented by using a DynamoDB stream. Every time a new message is created and pushed to a message group, the DynamoDB stream catches the event and triggers a Lambda function. So, how do we get this Lambda function to update the sort key?

As previously mentioned, the message groups have user_uuid as the partition key. So for each update, we have two different message groups with two different user_uuids (as there are always two versions of each conversation, one from the perspective of each participant). Hence we won't be able to find the correct message groups that we need to update based on the partition key. We could of course do a scan with a filter, but that is not a cost-effective solution.

The best option in this situation is to use a GSI. This basically creates a clone of our primary table using the message_group_uuid as the partition key, but the two tables are kept in sync. This GSI allows for querying the table based on the message_group_uuid attribute, in addition to the primary key attributes 'pk' and 'sk':

The GSI was added to the schema:

GlobalSecondaryIndexes= [{
    'IndexName':'message-group-sk-index',
    'KeySchema':[{
      'AttributeName': 'message_group_uuid',
      'KeyType': 'HASH'
    },{
      'AttributeName': 'sk',
      'KeyType': 'RANGE'
    }],
    'Projection': {
      'ProjectionType': 'ALL'
    },
  }],

Now the creation of a new message will be captured by the DynamoDB stream, which triggers a Lambda function that will use the GSI to query all message groups where the message group uuid matches the partition key of the message. It will then replace the sort key (last_message_at) with the sort key value (created_at) of the message. The sort keys for the message and two message groups are now matching:

sk

There is of course a lot more that could be said about the implementation, which was challenging and included a lot of troubleshooting and debugging. However, the whole week has been an outstanding learning experience. Now it's time to get ready for a new week of the BootCamp - ECS Fargate.