Be careful when using global variables in AWS Lambda!

I discovered some unexpected behavior with AWS Lambda that I thought was worth pointing out.  Each time a Lambda function executes, there is no guarantee that it will execute in a new interpreter or context – what this means in practice is that if you are using global variables within your Lambda functions, they may contain data from previous Lambda executions.  As such, you should always reinitialize them at the beginning of your Lambda entry point.

I had been using Lambdas triggered from Kinesis streams, and they seemed to be a lot less lightweight than I’d expected.  Turns out I was processing and shipping around 50,000,000 records per minute whereas the actual figure should’ve been somewhere around 20,000,000 per minute, and with each Lambda execution within the same interpreter, the number of records processed roughly doubled, as did the execution time.  Whoops!

This behavior does make sense because I’m sure it’s super expensive to spin up new interpreters for functions that might execute 10,000 or more times per second, but for some reason I had wrongly assumed that the environment would be fresh for each execution.

The really interesting thing here is that, knowing this behavior allows us to have a shared state between multiple Lambda executions with a single interpreter (albeit unpredictably with regards to what data will be processed or how many cycles will execute before the interpreter is respawned).  For fun, I implemented some basic hash-based log deduplication making use of this behavior.  I wonder what else would be interesting to do using this shared state.