Hi all, Dave Cline here, Software Engineer at Orum. (PS: we’re hiring!)

Around mid-July of 2022, I spent a week or two digging deep into part of our code to figure out why it was slow and how to speed it up. This document lists some of the things I learned. The best takeaway I have from this exercise is: when you are analyzing a complex software system, try to simplify and reduce the number of variables you are looking at at one time. 

Thankfully this is something we frequently do at Orum, both internally and for our customers

The Process

Let’s start with Cognito, an authentication service provided by AWS.

Secret pre-auth trigger lambda

When I first started looking at the /login endpoint, I actually didn’t realize how it all fit together. There is a lambda that serves that endpoint, but in our setup it’s actually not the only lambda involved. This is because Cognito allows you to attach lambda triggers to many of its operations so that you can alter the JWT token or do other code-based things within the Cognito workflow, which we were using to add metadata to our JWTs on login. Here’s an image explaining it a bit: 

Cognito JWT generation flow

In our previous attempts to speed up logins, we mostly looked at the login-handling lambda, and added provisioned concurrency there. What we didn’t think of is that there would be a second lambda that was running on every login request, that would actually be even slower than the original lambda. By adding memory/cpu and provisioned concurrency to both the login and pre-token authorization lambdas, I was able to greatly speed up logins.

Provisioned What Now?

If you aren’t familiar with provisioned concurrency and how it can help speed up your lambdas, I’ll give a quick rundown. Basically, lambdas are little containers that start up when you need them, and go away when you don’t. This saves a lot of money when you have a low traffic or a highly fluctuating traffic situation. 

Lambda processing power moves in lock-step with event/s

When a lambda first starts up, it runs some initialization code, then passes whatever event woke it up on to a handler function. The handler function runs and returns output, and then the lambda waits for a certain amount of time before turning itself off. If it receives a second event to process within about 5 minutes, it processes that event and then resets its turn-itself-off timer. As long as the lambda is ready to receive events it’s referred to as being “warmed up”, and it won’t need to rerun its initialization code.

So now to the point: when you have a lambda that is customer-facing and needs to be performant, one thing you might want to do is to skip that pesky initialization code that your lambda is running every time it starts up to keep it “warmed up” forever. This is entirely possible and is called using provisioned concurrency, which keeps your lambdas about as fast as they’re going to be. In situations where your logins would take 1 second—500ms to startup and 500ms to process the request—you can use provisioned concurrency to achieve consistent 500ms logins instead.

Base latency

I tried removing the pre-authentication trigger lambda, and I was still seeing a lot of authentication calls to Cognito that took too long. So at its fastest, Cognito is still kind of slow.

DynamoDB

DynamoDB is a noSQL database service provided by AWS.

Learnings and how I sped it up

The first step to speeding up a lambda that uses DynamoDB is realizing that your lambda is using DynamoDB 😂 Once I figured out that all of our authentication calls were actually doing a round trip to our pre-token authentication lambda, I was able to dig in and find some DynamoDB queries.

These queries were taking about 250ms, when querying a table that contains maybe 200 rows. This is abysmal performance. I would expect a similar query on RDS postgres to take 10ms or less. I took a look and found a few things:

  • We were doing a scan instead of a query, so no indexes were being used
  • We were at a read capacity of 1 …🐌
  • We were using scan/query instead of getItem

Adding an index, and setting read capacity to 5, I was able to get the query performance down to about 100-150ms. A small victory, but I’ll take it.

Additional index learnings

A second thing I learned about DynamoDB while doing my investigation is that indexes work very differently than with Postgres. 

Postgres indexes versus roughly correct DynamoDB indexes

In Postgres when you add an index, the db will do some magic with hash tables and will make your queries faster when you are asking for data that has certain values in that index. In DynamoDB, it pretty much copies the data you tell it to into a new table, and rearranges it somehow so that it is somewhat quicker to look up things by an index. 

The key takeaway here is that if you tell Dynamo to only copy certain things to the index, such as only the keys you’re actually going to be querying on, then it will return only the keys you are querying on and nothing else, because nothing else exists in the “index” table that it’s created for you. Only by making a complete copy of the original table will the indexes speed up queries AND provide you with the full set of data that the original table contains. 

Bonus index learnings

Another thing I discovered when setting up the DynamoDB index is that their indexes aren’t actually all that helpful. Most query() calls still take over 100ms, which is still insane for a table that has only 200 rows in it. If you really want things to go fast, you need to set a partition key (DynamoDB’s version of a primary key) on your table, and use getItem() calls, which usually return in less than 10ms. We can’t really go back and change this on our existing table, but we could consider switching to a new table where the partition key is the user’s email in the future.

DataDog Profiling

Data dog is an excellent logging and profiling tool.

Lambda layers

To add profiling to a lambda, you need to add their lambda layers. This appears to add up to 1s to the lambda warmup times. For internal lambdas this doesn’t really matter, but you may want to be more careful for time-sensitive operations like logins.

Conclusion

Development is tough, Orum devs are tougher 💪 🤓