Will it scale? Let’s load test geohashing on DynamoDB

James Beswick and Richard Boyd test how far a geohashing solution can scale serverlessly.

Published in

ITNEXT

7 min readApr 1, 2019

In my article on the geohashing for DynamoDB, I showed a simple way to query a list of locations to find the nearest Starbucks store. The method uses a few lines of code to load and query data using a geohashing NPM package.

But how scalable is this approach? The next question is how to measure the performance of this NPM library with load testing, analyze the cost of using this package, and look at architecture options for massive scale.

Today I’m working with iRobot Cloud Engineer Richard Boyd, who is helping me find some smart ways to boost performance.

What‘s the NPM library doing?

The DynamoDB-Geo library manages the underlying DynamoDB table, setting the partition key and secondary index to manage the lookups. Typically each search request runs 8 individual queries under the hood, so it doesn’t get all the results in a single fetch from the DynamoDB table.

This is because once the initial geohash square is identified, the algorithm also needs to know what’s in the surrounding 8 squares, since the item might be on the edge of a square, or the radius specified may cover more than one square.

Some of the performance characteristics are also attributable to our choice of hash key length — the shorter the hash, the larger the square; the larger the hash, the smaller the square. For example, if you have retail locations spread across the globe but use a long hash and large search radius, the performance would be much worse than a shorter hash with a smaller search radius.

Load testing with Artillery

The likely load for this kind of service is hard to estimate but we can create an arduous stress test for the API very easily with the Artillery toolkit.

This Node package can simulate a reasonable number of independent concurrent users on your local development machine. The actual number will depend upon your PC and network connection since under the covers Artillery is firing up multiple Chrome instances for testing.

On my machine, I find I can simulate 20 users a second hitting the API but measurements become unreliable if I set a higher limit due to resource constraints on my PC.

Configuring Artillery with a custom function to simulate random latitude and longitude requests within a bounding box.

In this test, I have set a boundary square around the Manhattan area, using a custom function in Artillery to randomly select a latitude and longitude within these bounds. I chose this area due to the large number of Starbucks stores in New York City — I’m likely to get multiple results regardless of exactly where the test lands.

This test runs for 2 minutes (with 20 users a second), completing 2400 requests in total. During this time DynamoDB RCUs increase to 86 and latency drops to 7–10ms:

In this initial configuration with the code shown above and with no caching in place, we can easily serve 1.7 million API hits daily (at 20 requests a second evenly spread throughout a day). But what are the costs?

DynamoDB: requires under 100 RCUs, so less than $10 a month.
API Gateway is charged at $3.50 per million requests on the lowest tier, so 52.3 million requests monthly costs around $183.
Lambda: the average function execution was 382ms, so 400ms with 1024Mb memory and 52 million executions costs around $357.
The total estimated monthly cost to serve 52 million look-ups in this setup is around $550.

More locations, more writes

The Starbucks location list is relatively static and the activity on the database table is almost exclusively ‘reads’. Based on DynamoDB’s performance characteristics, the number of items in the table isn’t likely to affect the read performance, and the table could be substantially larger with no noticeable difference in query latency.

However, the NPM library is using the geohash as a partition key so it’s not possible to update locations for items already in the table (you must delete and recreate the item). This isn’t a issue for physical places like Starbucks because they don’t change locations generally, but if you had a list of objects that move frequently, this probably isn’t the right library to use.

Global tables

If a large retail company like Starbucks used this service to find their nearest store globally, DynamoDB Global Tables would be a great option for spreading the load across multiple regions and reducing the latency of the lookup. In this case, if 50% of the searches occurred in Europe, replicating the table to this region effectively offloads 50% of the traffic from the main table in us-east-1.

This approach may not be suitable for all business cases (for example, where lookups are coming from a single geographic region or where data may not be moved out of specific regions) but given the simplicity of using this feature, it’s another tool for achieving greater scale with no change to our code.

API Gateway caching

Our current approach has the user supplying their latitude and longitude coordinates, while our Lambda function computes the geohash (z-index) and finishes the lookup. This is great for our users because they don’t need to know anything about geohashing to send a request, but it’s bad for caching. Why?

Unless two users are sitting atop one another, they will have slightly different coordinates, even though their position computes the same geohash. Currently, we won’t be able to leverage request caching at the API level.

However, if we pull the hashing computation out of the Lambda and put it in the client-side library, a user’s request goes from POST /starbucks body={lat:45.345, lon:45.123} to GET /starbucks?geohash=876543. What’s the difference?

Users from many different coordinates will ‘collapse’ or ‘bucket’ into a single geohash index and we can re-use their requests in a cache to reduce the load on our database. For example, suppose that (45.345, 45.123) geohashes to 876543, it is very likely that (45.145, 45.323) also geohashes to the same bucket and therefore their new requests would both query the same bucket, and we could reuse the response from the first request to satisfy the second request.

Because the creation of a Starbucks is a relatively slow-moving process, we can set a high time-to-live (TTL) on the cache so that requests are remembered for several days. API Gateway creates a cache on all GET requests by default when you enable caching and the owner of the API can further specify which querystring parameters to use in determining which requests “hit” or “miss” in a cache.

Is there a better (serverless) way?

Richard had an interesting point-of-view on this question: “Now that we have pulled most of the logic out of the Lambda function and pushed it to the client-side of our application, our Lambda function is a cold dead husk of its former self. The most respectful thing we can do now is to thank it for the joy (and caffeine) it once provided, hug it deeply, and promptly delete it!”

It’s an interesting idea because it gets to the heart of the tradeoffs in all of these decisions. With AWS Service Integration on our API Gateway resource, we can directly connect API Gateway to our Dynamo Table to query geohashes without the added complexity of managing a Lambda function.

The major drawback to this approach is that our Lambda function could have made several query calls to the location table and aggregated the responses into a single response. We will lose this capability with Service Integration. Richard concludes, “I would argue that if you’re getting more than 1Mb worth of location data for Starbucks near you, the second megabyte is like the second page of a Google search — it’s there, but if you haven’t decided which link to click on by now, the second page doesn’t have the answers you seek.”

Richard’s idea here dramatically reduces the cost by eliminating Lambda completely, but also improves latency and scale by creating a direct line between API Gateway and DynamoDB. It may not be suitable for all use cases but shows how some creative thinking provides other serverless solutions for this problem.

Want to learn more? Check out Richard’s cloud blog at https://rboyd.dev.