Creating a Blueprint for Microservices and Event Sourcing on AWS

Using Kinesis, Lambda, S3, DynamoDB, API Gateway and Kubernetes

Published in

ITNEXT

25 min readApr 6, 2018

This is the third installment of your quest for an architectural holy grail, where you have been mercilessly strangling monoliths and courageously fighting monster architectures to finally embrace the brave new world of container orchestration with kubernetes and microservices.

It’s time now to embark yourself in yet another exhilarating part of this story, where you will build an Event Sourcing/CQRS/microservices architecture along with your monolithic application, all by yourself, with just a little help from your friendly neighborhood cloud provider, Amazon Web Services.

I need to add a disclaimer here to prevent unneeded cloud wars: this is about building this architecture on the AWS platform, leveraging AWS specific managed services. While I don’t think that every cloud provider is created equal, I also think that the other two in the top tier, Azure and GCP will provide you with enough tools to build everything, if not today, tomorrow for sure. I just happen to know AWS more.

Building an Event Sourcing/Microservices architecture has been possible for many years already, in fact none of these concepts is completely new and companies with large resources have been building similar systems for a while. But it’s only today that, leveraging fully managed cloud services, it is actually simple enough to build it — with much more limited resources and effort.

You will use Kinesis Data Streams as the Event Stream, Kinesis Firehose to backup all the events in your S3 data lake, DynamoDB as a persistent Event Store to overcome Kinesis limits for message retention, Lambda to both subscribe to Kinesis events and implement simple microservices, API Gateway and/or CloudFront Lambda@Edge to route requests to the monolith or the microservices, and Kubernetes to run the rest of your services and your monolith happily together.

If you are feeling a little hesitant and skeptic about AWS PaaS (Platform as a Service) and FaaS (Function as a Service) offerings and if you wonder if leveraging all these managed services will cause you to be locked-in with your cloud provider, well then the answer is yes, of course, and your concern is valid.

In my opinion, though, if you consider the level of effort that it would take to build all of this on AWS, and then to do it again on GCP if you decide to, and then again on Azure if you have to, it would still be less than it would take if you were trying to build all of this on your own, just leveraging the IaaS (Infrastructure as a Service) offering of your cloud provider. It will simply require much less work, expertise and, in the end, money. It’s not just about leveraging the services, it’s all the monitoring and alerting and built in security that you will have to roll on your own. It’s not for the faint of heart.

And, think about it, while you are busy phasing out your monolith while adding features to your application, the good cloud folks will be busy adding capabilities to the platform, ready for you to use.

Did you jump to the bottom of the page to write a nasty comment that I have been drinking the cloud kool aid? Not yet?

Then let’s start looking at the (managed) services that you will need, and how you will use them together, and what you will have to be on the lookout for.

Writing to the Event Stream

Both your Monolith and your new budding microservices will be tasked to write events to the Event Stream. Any change (mutation) in the overall state of the system will go to the stream. Remember, in Event Sourcing the application state is in the Stream, there’s no way around it.

Assuming that mutations are external, this means that any mutation to the application state will be sent from the external source either to your monolith, or to a domain specific microservice.

The routing, using the strangler pattern, will be implemented as an edge service, either with a CDN that supports request routing, or the AWS API Gateway. Even better, anything that supports code running at the edge, like Fastly or CloudFront Lambda@Edge will allow you to dynamically decide where to route the request, based on the user’s identity, or region of provenience or just as the result of a split test.

The mutation, once validated and possibly transformed into a resulting event, will be propagated to any other downstream service via the Event Stream.

The Event Stream is the core of this architecture, and on AWS the choice is obvious: Kinesis Data Streams. Which is not Kafka, even though they both start with K.

Kinesis Data Streams

“A taxi crossing an intersection in New York City by a busy street decorated with Christmas lights” by Alexander Redl on Unsplash

Kίνησις means movement, motion. Did I mention that I studied ancient Greek in school? I knew it would have come useful one day.

There is plenty of articles and resources that you can google for, to compare the pros and cons of Kinesis vs Kafka. And to be fair, there are also many managed Kafka offerings that you could use, but in the end it’s easy to see that Kafka is your custom built off road vehicle, with a top of the line entertainment system on board, and Kinesis is a yellow cab sedan (with the mildly annoying taxi tv). Which one would you choose?

The latter (Kinesis) will get you where you want, but the driver will politely decline if you ask to take a shortcut and to climb vertically up a cliff face (or at least, most would do). The former (Kafka), will do all you want but sometimes it will require substantial upkeep if, after some fun jostling with the road, your alignment is off, or your suspensions are compromised for taking one bump too many, and the entertainment system breaks down .

Living in New York City, I am happy with just taking cabs, most of the times. You can also turn off the taxi tv.

Because it’s like a yellow cab, there are clearly documented service limits to what you can do with Kinesis, but those limits are meant to protect you, so that you can stay within the constraints of what can be guaranteed to work, and not design something that will eventually take all your time and resources to keep it running.

There are two types of constraints with Kinesis that you will need to work with: architectural and rate/throughput limits.

A theoretical and perfect Event Stream guarantees exactly once delivery and ordering of messages, and unlimited message retention, but it’s not something that you can find easily.

The unforgiveness of the laws of physics explained with a simple tweet.

The architectural constraints are what make Kinesis an imperfect (but close) approximation of the ideal Event Stream. Messages can be sometime delivered more than once, and the ordering guarantee comes with some caveats. Oh, and you have a maximum of 7 days of retention for your messages, if you are paying extra. By default, it’s 24 hours.

The rate/throughput limits constraints instead are, well, about API requests rate limits and throughput, but don’t be discouraged by them, they are well documented and you can work within them in most cases.

As a side note, I remember once dealing with a managed datastore from a well known cloud provider, for which there was no way to have any kind of documented throughput or latency limits. The answer from the support engineers was only: “there are none”. So, if you look for something where you can suddenly push a googolplex of petabytes per second with 0ms latency, ask me which one to use. And also which bridge I have on sale.

Designing your system around these constraints is paramount, but fear not: you have control on the context, the problem domain and the nature of messages that you are sending and receiving in the stream, and on the logic used to process (or reject) them. Doesn’t it feel good to know that?

Rate/Throughput limitations

A single Kinesis stream or shard may not be enough to handle the throughput that your application needs.

A Kinesis stream can increase its throughput by dividing it up in shards, and each shard can support up to 5 read requests per second, for a maximum data rate of 2 megabytes per second. And you can write up to 1,000 records per second, and each record cannot be bigger than 1 megabyte, and your total write throughput cannot exceed 1 Megabyte per second.

In theory, you can handle terabytes of data per second on a single stream if you want. Just shard and shard and shard again. But sharding does bring some issues with it, particularly when it comes to ordering guarantees, as we will see later.

These are the current limits. I think that is likely that in future AWS will announce less restrictive ones, but don’t bank on it, and start designing your system within these constraints, otherwise you may have to wait for wormholes and quantum teleportation to be a thing before you get anything running.

Lack of uniqueness

In Kinesis there’s no guarantee about the uniqueness of messages, you will get duplicates when you don’t really expect them. It can be happening because of network or application errors either on the producer or on the consumer side.

This is probably the simplest constraint to address, by just designing the handling of your events to be idempotent: each message should represent the full state of the entity at the moment of the event, and not the delta between the previous and the current state.

Let’s go back to our Medium example from the previous post, and the claps event message.

This message captures the state of the claps from user 32142459 for the post 4897249772 in its entirety. The clap button was pressed 21 times. Or maybe, this user already had clapped 20 times before, and then she came back to clap once more. It doesn’t really matter. The application that produced it (either the webapp or the mobile app) knows the full state of claps for the post and transmits it as a single event. Even if this message had been transmitted or received twice, the final count of claps from this user for this post would be the same, and no damage is done.

Think if instead each clap had been sent as an individual event. A duplicate would effectively inflate the number of claps, and nobody would want that.

In some ways, Event Streaming is a type of State Transfer (Roy, are you reading this?) between an event emitter and a one or more event subscribers. You need to model your domain so that entities are granular enough that their state is not shared among different emitters. If can you do that, then you can achieve idempotency (and also strict ordering) with relative ease.

The count of claps for a post, sent by the claps service

Consider the entity TotalClapsPerPost. It belongs to the Post entity. The full state of this entity is undetermined in the browser, because the browser doesn’t know if somebody else is also clapping for the same story. Imagine what would happen if your posts service was just subscribing to the claps events, using them as partial updates, and summing them up as they come for the corresponding post. Since the processing is not idempotent, duplicates would inevitably inflate again the total claps.

Just find, in your application domain, which agent is the emitter that can generate the entire state of the entity, and listen to their messages. In this example, it’s the claps service. So, for each clap event for a post, the claps service is the authoritative emitter agent for the TotalClapsPerPost entity: it will query its own datastore and emit an event with the count of claps for the post anytime you clap or unclap.

Lack of ordering

In Kinesis, exact ordering of messages within a shard is guaranteed only if you are putting records sequentially, one at a time with the PutRecord API, and also specifying the optional argument SequenceNumberForOrdering, but not if you are using the more efficient PutRecords bulk load API, where individual records in an array could be rejected and must be retried later (hence out of order).

Note that if you are using something like the Kinesis Aggregation Library, which packs multiple user records in a single, larger record, you could get by using sequential PutRecord calls with SequenceNumberForOrdering, and still maximize your throughput without losing ordering on a single shard. Please also note that using instead the Kinesis Producer Library alone is not sufficient to guarantee ordering.

But if you need to push more data, using multiple shards, it quite hard to have any reliable ordering at all, since the sequenceNumber parameter that is added by Kinesis in the message is unique to the shard, and when you are reading from each shard you have no safe way to determine the proper temporal sequence, because each shard is independent from the other, and one could deliver messages “faster”.

Compared to the lack of uniqueness, this “somehow ordered” architectural constraint is more problematic, especially when you think that we defined an Event Stream as a ledger and ledgers are, by definition, ordered.

On the plus side, not all type of messages need to be delivered in a strict order. For instance, if you and I are posting two stories, and they are are coming to Medium out of sequence, it’s not necessarily a problem. Or, if we are clapping to the same story, it still doesn’t matter which message comes first, yours or mine.

So the thing to keep in mind, before panicking and giving up on Kinesis, is that in many cases, the lack of ordering just doesn’t matter. You really have to think of the domain of your services, messages and application, and see what are the contexts it is a problem that you need to address.

Typically, ordering matter for two messages related to the same entity. For instance, you could have clapped for this story 21 times but also decided to quickly “unclap” it.

Yes, it’s tricky but possible to do that, and really unfair to the writer. Think about it.

If Medium has two or more shards in the stream, it’s likely that your claps are going to be received out of order once sent to downstream services by your claps service. Say that the claps service sends a TotalClapsPerPost message that the posts service will subscribe to. In this case, there would be two events, one with no claps, and one with 21 (assuming nobody else is clapping too). So the post service may record 21 claps for this story, because the no claps event came first, and the other one last. This is obviously an example where ordering matters. You want to be able to change your mind, and make it count when you unclap!

You need to consider what needs to be taken in account for the ordering requirement. In this example, ordering clearly matter only for messages from the same user, or for the same post, so you can use either the PostId or the MemberId, or both, and combine them together to use them as partition key. Your claps and unclaps will go in the same shard, and ordering will still be preserved.

Using a domain-specific partition key, and using PutRecord with SequenceNumberForOrdering is the simplest way to guarantee strict ordering on the records that will be retrieved from each shard in the stream. Using other things, like application specific timestamps or the Kinesis generated approximateArrivalTimestamp field will just give you an approximate order, and that is probably not enough for your consistency requirements.

I know, right now you are feeling cheated, because I made it too simple. And what if the volume of state mutations for the same entity is so high that a single shard is not enough and then scaling out is not a solution? Before you start pushing and keeping pressed the clap button to see what happens (insert evil grin here), please consider that if you exceed the throughput limitations of a single shard for a single entity, maybe you are modeling the problem wrong.

You would need to have 1,000 state changes or 1MB/second of mutations for each entity before you cannot solve the problem by scaling out to multiple shards. That would be a really hot shard. I am sure it’s a possibility, but I am also sure that you should also be able to rethink the application domain where this extremely fickle entity exists and split it into many sub-entities, using different partition keys and hence shards.

Reading the Event Stream

Well well, now you have some ideas on how to produce idempotent and ordered events in the Kinesis Data Stream, and what about reading them? That should be simpler right?

Wrong.

It’s not as trivial as it seems, because in Event Sourcing, when subscribing to a Stream, the reading state is stored in the subscriber, and you need to keep track of the last event you read from each shard. Your need to do that in a durable way so that if your application dies in the middle of reading you can start from where you left off, and also so that you are able to manage dynamic shard allocations, to make sure that when your stream is split in more shards, you can still subscribe to all of them.

You can use the Kinesis Client Library which does all of that for you, but only if you run your stream subscribing applications on EC2 (meaning on servers), and you are willing to spend some time configuring it correctly. There’s nothing wrong with that of course, and you could leverage Kubernetes to do elegantly as a scalable deployment. Or you can just use Lambda Kinesis subscriptions using event-source mapping.

Lambda Kinesis subscriptions

In this model, AWS will simply handle basically everything for you. It will periodically invoke a Lambda function for each shard in your stream, passing an object with at most BatchSize records; it will also handle keeping track of the last record that you read in the stream, restarting from the last safe checkpoint if any error happens during processing in your function, and finally it will keep track of newly added shards and instantiate new copies of your function for each one of them.

It’s like magic. It’s perfect. I cannot think honestly of a reason to not use it. Reading from Kinesis using Lambda is very much a code it, deploy it and forget it experience. It just works.

Until it doesn’t.

But it’s not a bug, it’s a feature. Remember when I talked about rate limits? And how 5 reads per second per shard is one of them? That means that if you have just 5 functions subscribing to the stream, and you expect to read data more frequently than once per second because you care about latency in your system, it just won’t work. Also, adding shards to the stream will not help, because for each shard, you will still have 5 functions reading from them, and their combined read rate will still be at most 5 reads/second per shard.

This constraint would not be so restricting if you could define low latency, “high priority” functions, that are invoked more often, at the expense of low priority ones which can afford a higher latency because of less stringent consistency requirements.

That’s where you would think that the BatchSize parameter that you define for your functions would become important. Because a function that has still more records to read will be invoked more often, you could set a smaller batch size for functions that have more real-time requirements, and set a larger batch size for those other that can tolerate a higher latency.

But, alas, it’s not enough and it doesn’t work like that in practice. Testing this scenario with multiple functions with different batch size doesn’t show a clear and predictable behavior.

AWS provides a Lambda metric, IteratorAge, which measures the age of the last record for each batch of records processed. This metric is essential to track how much your function is lagging when subscribing to a stream.

The function with the smaller batch size will be probably invoked more often, but it will also be throttled more, and it will end up possibly lagging behind, with a higher IteratorAge than the other functions. By empirical observation (I learned that from my old school buddy Galileo) it seems that the contention on the rate limits shared resource causes the scheduler that takes care of invoking your functions to randomly succeed in executing one function but not the other, and there’s very little guarantee that it will be the one that you wanted. It kind of works, but it doesn’t.

To better understand this behavior, I have deployed 16 sink functions that are just reading from the same stream and sinking the records that have been read. For each one of them I have assigned a batch size (see the table on the left) and started pumping records into Kinesis. I thought that, in theory, I would have been able to see a pattern emerge, with the sink functions from sink00 to sink03 being invoked more often and consequently having a smaller IteratorAge compared to the other functions, such as sink12 to sink15, which have a larger batch size. I have also set up a CloudWatch dashboard to keep track of the various functions and their metrics.

What I got, after 15 minutes of pushing records in the stream, with a rate of 1.3k records/minute is this beautiful piece of modern art:

I think it’s quite obvious that the distribution of invocations is random, and the iterator age is also all over between 1 and 80 seconds, also following a seemingly random distribution.

So what now? Is it doom and desperation? No, not yet. I have a clever and hacky solution that hopefully, if somebody from the amazing AWS Lambda team reads this, you won’t need in future. But in the meanwhile, it’s here.

Your Lambda functions are invoked synchronously by AWS, every time that there’s data in the stream, using the request/response model. In other words, those are blocking call that will not be re-issued until your functions return (with either success or failure). Wait, do you see it already? All you need to do to fix the problem is to make sure that your not-so-urgent functions take longer to execute than the ones for which you want the lowest latency.

Incidentally, this could be a consequence of having a larger batch size to process, and then nothing to worry about, but it could also be forced, with what we all know to be the secret weapon of every good programmer, the ubiquitous sleep() function.

Did you just roll your eyes? I think I saw you rolling your eyes.

So, for each sink function, I have added a pause parameter, as shown in this table. The larger the pause, the larger the batch size — to make sure that no function falls too much behind. It’s important to set a batch size that is at least double the expected throughput for the stream. I have then run the same load generator for 15 minutes and this time I got exactly what I wanted: a way to provide higher priority for a function and to make sure it has the lowest IteratorAge, at the expense of other, less time critical functions, that can afford a higher latency.

If you zoom in the graph above, you will see that the highest priority functions, sink{00–03}, have the lowest iterator age, less than a couple of seconds, while the lowest priority functions, like sink{12–15} have an higher, but constant iterator age.

To be clear, and before that you blame me saying that you read it here, sleeping in a Lambda is absolutely an anti-pattern and I would never, ever suggest it in any other case than this, because you are charged by invocation time, and paying for sleeping is something you wouldn’t want to do (i mean, in real life sometimes I would, but not for my Lambdas).

On the other side though, when you look at the actual costs, it turns out to be much less expensive than what you would think. One Lambda, polling a shard every 60 seconds with 128MB of memory assigned to it, is going to cost you a not so whopping $5.41/month. Even if you have 10 shards, $54 is not a lot to pay for your peace of mind, and it’s not much more than what you will pay for low latency functions that never sleep. In fact, assuming they are invoked 5 times/second with a 100ms execution time it will be $5.29/month.

Lambda is cheap, face it.

Not my highest achievement in software engineering (click for gist)

I know that at this point you want to immediately grab this code, so here it is, in all its beautiful hackiness.

Yes, this would have been easier in any other language than the pre-ES6 version of Javascript that you can currently run on Lambda, since you can’t really make blocking calls, but the useful serverless Babel/Webpack plugin gives you async/await so, there you go. (Edit: AWS just announced support for Node v8.10 and async/await. Yay!).

Also, if you are more of a python person, here’s the full code that does demonstrate using 16 lambdas, with different pause times, for a single stream, in python.

The good thing to know is that when AWS, after reading this post, will implement the additional parameter in the CreateEventSourceMapping API just like this (and remember, you read it here first):

POST /2015-03-31/event-source-mappings/ HTTP/1.1
Content-type: application/json
{ 
  "PauseTime": number,
  "BatchSize": number,
  "Enabled": boolean,
  "EventSourceArn": "string",
  "FunctionName": "string",
  "StartingPosition": "string",
  "StartingPositionTimestamp": number
}

you will be able to remove the sleeping hack and fully control the rate of invocation.

More troubles

Another constraint that you need to keep in mind when using Lambda with Kinesis is the max payload size for a Lambda synchronous invocations. If you set your BatchSize to a value too high, and the resulting payload is greater than 6 megabytes, AWS will not not send you all the records you expected, but only as many as they can fit in a single invocation, and your function may start to lag behind.

Let’s assume your data rate is 8000 records/minute, the average message size is 1 kilobyte and your batch size is 8000, with 60 seconds of pause time. At the first invocation only 6000 messages will be received, and after a minute you will have the remaining 2000 still to process (plus the next 8000), but you will still receive only 6000. You could tweak the sleep hack logic, for instance reducing the time you sleep based on the approximateArrivalTimestamp of the last record you read (it’s it’s too behind, sleep less time), or, you can just increase your number of shards. In this case, each shard will be effectively carry less records and then they can all be sent to the lambda in a single invocation.

And finally, my favorite trouble is with CloudFormation. Don’t take me wrong. I love CloudFormation.

But if you are using it to register your Lambda subscriptions, you will soon discover that when you are using the template AWS::Lambda::EventSourceMapping it will invoke the CreateEventSourceMapping API which, if you have more than a few other functions reading from the same stream, will randomly fail with an exception that will look something like this:

Received Exception while reading from provided stream. Rate exceeded for stream event-store-dev-event-stream under account ******. (Service: AmazonKinesis; Status Code: 400; Error Code: LimitExceededException; Request ID: ed339cfe-2d25-df52-be02–6eef40c6e5d1)

It’s like getting a “Hi, I bailed out at the first error” message. It’s also annoying because CloudFormation will automatically try to rollback to the previous configuration, and that may also fail with the same exception, making your stack update operation end up in the infamous update rollback failed state. And it’s particularly annoying because both the Serverless Framework and AWS SAM are relying on this CloudFormation resource to create the subscription.

I am sure that the amazing CloudFormation team, after reading this, will be already working on a fix (just trying again on what is substantially a benign exception). In the meanwhile, for the rest of us, I have pushed to the same Github repo that I mentioned before a little bit of python code that will create a CloudFormation Custom Resource to handle registering Lambda functions to Kinesis without bailing out on the first error. I tested it with those 16 subscriptions and it works just fine, so check it out. And you are welcome.

Alternative patterns.

I should also mention that there are other solutions to the constraints imposed by Kinesis on a single stream, and they all predicate using multiple streams to overcome those nasty service limits.

First of all, and more easily, if your application domain allows it, you could use separate streams for separate sub-domains. For instance, your clickstream data for analytics doesn’t necessarily have to use the same stream that you are using for updating your business entities. Yes, having a single stream is nice and simpler, but I don’t think you should consider it as a dogma, just be practical about your actual needs and design a solution that works for you.

The other approach, the Kinesis Fanout pattern, suggests using different techniques to branch out one single input stream in many domain specific streams, each carrying a subset of the events based on the domain. It’s a valid approach, but it comes with some caveats that you need to keep in mind.

Lambda based fanout: AWSLabs has published this project some time ago, which allows to dynamically map various outputs (not only other Kinesis streams) to a Kinesis input stream . Or you can read this good article to learn about another asynchronous lambda fanout pattern. Or this other good one. I think I could keep googling and find more. The main issue that I see with all of these approaches is that you will be trading off ordering for performances, or increase the number of duplicate messages that you will receive (which is easier to handle).

Let me explain: your fan out function will be invoked with say, 1,000 records. Based on the domain of each record, you can send say, 500 to Stream A and 500 to Stream B. You need to do that one record after the other, using the SequenceNumberForOrdering parameter as explained above. The problem happens when, for whatever reason, your PutRecord to the downstreams fails. You can either retry, using some exponential backoff algorithm, or you can keep going and trying sending the failed records at the end, or you can fail the entire batch (returning an error in your Lambda) and restart from scratch. In the first case, all records after the failed one will be delayed, and if it happens a few times, you are starting to lag behind. In the second case, you will be losing strict ordering, and in the third case, all your downstreams will receive duplicate messages (and you will still be lagging).

Kinesis Analytics based fanout: Kinesis analytics is an extremely powerful tool for real time stream analytics, using SQL to select records that you can then route to other streams, based on your own logic. You don’t even need to write any actual code, and there’s an open source GitHub project to get you started. But, there are limits that may not make it a feasible solution. In particular, each record (or data row) cannot be bigger than 50 kilobytes, and the data format for your records is restricted to JSON, CSV and TSV (and aggregation is not supported). You can work around this using a preprocessing lambda, but then it’s more complexity that you are adding in the system.

So, caveat emptor, it’s certainly worth experimenting with these patterns if the sleep hack is not good enough in your case. The important thing is that you run some realistic simulations of your loads, and keep testing with different solutions.

Records retention and stream snapshots

In Kinesis, every message in the stream lives up to 24 hours, so that a momentary down time in a subscription service is not equivalent to a data loss scenario: you just fix the issue, and start your function again, it will catch up.

Since sometimes fixing the issue with your code is not trivial, you can extend this up to 7 days, just using an API call. Seven days is a lot, if it’s meant to provide a buffer against one of your services going down, so that it can restore the state after the outage by quickly catching up with what it missed when it was down. But it’s obviously not enough if instead you are adding a new service after maybe months that events have been streamed in your platform.

Also, if your event stream is the source of truth for your application state, you will probably want to make sure that it’s backed up, in more than one way.

I would certainly recommend, as first line of defense, using the managed datastore offerings from your cloud provider, and making sure that they are automatically snapshotted, at least daily. But there’s more that you can do.

Kinesis Firehose

The first thing you will want to do is to make sure that one consumer for your stream is the Kinesis Firehose Delivery stream, and that you are using S3 as a target for it. Thankfully that’s as simple as connecting the Firehose to Kinesis, and choosing a bucket where to dump all the good stuff. Keep in mind that adding a Firehose to your stream will consume some of your read quota, but nothing comes for free. And you you should know how to work around your read limits by now.

Once the records are in S3, you can leverage Glue or your favorite ETL process to load them and play them back in the stream. Actually, I would recommend having a separate PlayBack stream. When you are playing back events, you want to do that fast, especially if you have lots of messages that need to go through. Having a separate stream to read from, and a separate subscription, helps avoiding saturating the main stream for your application.

DynamoDB Event Store

Ideally you would want to use something like DynamoDB to store your events forever. That would allow you to write quick ETLs that are selective about what events to play back in the stream. And could also be a way to create point in time snapshots for each entity, so that instead of having to play back everything from January 1st, 1970 (when, as we all know, the world started), you could start from yesterday.

A simple store for your time based events

It’s not as trivial as you may think, because time based events have the tendency to happen at the same time, and if you are choosing a naive partition key in your table, they will create a moving hot spot in DynamoDB, like a beatiful wave of throttling errors. But here’s is a place to start. It’s far from complete and being production ready. I have a couple of good ideas on how to improve from there, creating point in time snapshots and speeding up the playback of events, and PRs are welcome.

I realize that there’s much more to cover, and in the next installment of this series I will dig deeper in the blueprint shown at the top of this post. In the meanwhile, feel free to chime in the comments if you have questions.