When we suggested to our bosses SQLite as a production solution for our large-scale servers, they thought we were kidding.
Could SQLite be appropriate for a high-throughput server (up to a billion requests a day) in a production environment?
We believed the answer is yes, if only for the specific use case we needed to solve.
Background
- Its light-weight implementation is naive, so it does not handle high-speed locking well
- It is not scalable
- It is easily corruptible
A production use-case where SQLite might still make sense
My colleague, Ariel, was the one who originally suggested this idea, to make a local relational cache solution.
The problem that we are trying to solve is of an Ad Server – it gets a lot of requests, and is supposed to choose the best available ad to serve for the requesting user. It is a well-worn problem, where the server needs to get all the available ads, filter out all the ones which do not fit the request (the wrong size, not meant for the user’s locale, or language, etc.), sort them according to our ‘secret sauce’ and serve the best ads.
The solutions are also well-worn – the most naive is to connect to the backing repository (most likely a relational database cluster like Postgresql) each time, with a big query returning the relevant ads, and then sorting them in code. To prevent an internal network bottle-neck or overwhelming the repository itself, a caching layer is in order – like a local memcache or Redis service, which will retain for a timed period a snapshot of the available ads, and alleviate the stress on the backing repository.
Unfortunately, adding the caching layer, you forgo the relational advantages of the backing repository, and you need to unmarshal a large chunk of the cached data with each request, and filter it in code.
Would SQLite be able to be a stand-in for the classic cache layers, while retaining the relational advantages of the mothership?
It might, but how about its shortcomings? What assumptions do we need that will avoid the known problems of SQLite in high-stress production environment?
Well, the main assumption which makes this solution theoretically possible is that our SQLite database is totally read-only. Our server code should never change it. This would solve any locking problems, as there are no read locks. We could find nowhere on the internet anyone saying there is a problem in high-throughput reading of SQLite when there are no writes – it could be possible!
The other two shortcomings I listed above are also not critical for a caching solution since:
- caching is local, so horizontal scale is not an issue
- the order of magnitude of our data is capped to around 10s-100s of thousands of entries (rather than millions – what user data would amount to), so vertical scale is also not an issue
- caching only holds data gleaned from other sources, so data corruption never results in data loss – simply rebuild it from the master data store!
Benchmarks
Throughput (rpm) | Response time (ms) | |
SQLite | 1800 | 6 |
Redis | 250 | 150 |
Wow! This is a bit one-sided, isn’t it? How do we explain this?
Well, our data contained a couple of thousands of available ads, each having up to a couple of hundreds of creatives (each of a different size, language or design).
In our Redis solution we provisioned the cache according to a few facets (locale and provider, for example) to avoid going through the whole database for each request, but it still called for a big chunk of data being read from the cache each request. Since Redis objects are actually a text (or JSON) representation of our objects, the server was busy unmarshaling the data with each request resulting in a CPU bottleneck, which restricted the response time, as well as the throughput.
In SQLite this is not a problem, since the filters happened on the indices, before any data was unmarshaled, ultimately loading only a much smaller data-set, which is much more optimized to the current scenario.
To prove that point – constraining the number of creatives per ad to 10 improved the Redis response time to 25ms.