Unconventional Databases

Friday, February 12, 2021

When we think about "databases", we probably think about SQL stores (MySQL, Postgres), NoSQL stores (MongoDB, Cassandra), and maybe some vendor-specific blends like BigQuery, DynamoDB, or Firebase. But not many people have considered that anything can be a database. In this post, I'll share my experience with using "unconventional" databases and when I think it's okay to use them.

Redis

The first thing that comes to mind when most people think about Redis is "volatile cache, use only as a speed optimisation". That's fair, but it's not the whole story. Redis actually provides some persistence and backup options that allow you to persist data beyond the lifetime of the Redis server.

I use Redis as a database to store user settings (and a bit more) on two projects:

  • On @RemindMe_OfThis, I store the user's chosen timezone, a flag to enable push notifications and the Firebase notification token.
  • On @this_vid, I store the push notification flag and Firebase token.

An important detail: on both of these systems, a "user" is a Twitter user. My app doesn't handle any user management or store any user accounts. Authentication is handled by Twitter.

Given that, the implementation is pretty straightforward: The app uses the default settings for everyone; when a user logs in with Twitter for the first time, I create their own settings and store as a JSON string in a settings-<username> key (eg settings-theshalvah) and no TTL.

Files

I use a log file as a database for a part of Tentacle. I have an internal API (the "engine") that's responsible for fetching posts on a blog. I want to track the performance of this API, so I modified the code in the main app so it logs some performance details after any request to the API. It uses Guuzle's on_stats parameter for this:

$response = $this->client->post(
    'fetch',
    [
        'json' => ['sites' => $sitesToFetch],
        'on_stats' => function (TransferStats $stats) use ($sitesToFetch) {
            $timeTaken = $stats->getTransferTime();
            Log::channel('engine_tracker')->info("sites=".json_encode($sitesToFetch) . " count=".count($sitesToFetch)." time={$timeTaken}");
        },
    ]);

This produces a log like this:

[2021-02-12 07:45:45] production.INFO: sites=["schlein.net\/blog"] count=1 time=3.779627  

And thanks to Laravel's log channels, I can send it to a different file from my regular logs with Log::channel('engine_tracker').

And then on the /open page (it only shows for my account, sorry😜), I convert this data into a graph that shows me how the engine has been performing lately. The code goes something like this:

$contents = file_get_contents(realpath(__DIR__ . "/logs/engine_tracker.log"));
if ($contents) {
    $lines = explode("\n", trim($contents));
    $engineResponseTimes = array_map(function ($line) {
        $line = trim($line);
        // Multiplying timestamp by 1000 because it's JS on the frontend
        $date = Carbon::createFromTimeString(substr($line, 1, 19))->timestamp * 1000;
        $responseTime = floatval(substr($line, -8));
        return ['x' => $date, 'y' => $responseTime];
    }, $lines);
}

I then pass the engineResponseTimes to the view and use Chart.js to render it into a chart.

Google Sheets

Yep. In a small Slack app I built for my previous workplace, I used a Google Sheet to store user tokens for updating their Slack profile.

I used Sheety, which provides a REST-ish API for the Google Sheet. All I had to do was create the sheet with the appropriate name ("Access Tokens") and columns ("Team ID"", "User ID", "Access Token|), and then connect Sheety to it, and it automatically generated a set of endpoints. Writing to the sheet was as simple as calling the Sheety API:

await sheetyApi.post({
    json: {
        accessToken: {
            userId: slackAuthResponse.user_id,
            accessToken: slackAuthResponse.access_token,
            teamId: slackAuthResponse.team_id,
        }
    }
})

And fetching the data was:

// The ?userId parameter filters by the "User ID" column
await sheetyApi.get(`?userId=${user_id}`).json();

How did I get here?

A few years ago, I was one of those people that believed SQLite wasn't meant for production use. And then I discovered Pieter Levels, who runs sites with millions of visits per month on SQLite databases. I realised that, though there are good reasons behind the "anti-SQLite" belief, the eventual conclusion is wrong. Depending on your needs, SQLite may be just fine. I also started following Indie Hackers around then, and started coming across folks using weird things like Google Sheets, Airtable and Notion as databases. And so I began to expand my thinking.

Sometime later, I began thinking of adding a database to @this_vid. At the time, I was only using Redis to cache downloads, but I wanted to add push notifications, so I needed somewhere to store the configuration for each user. I didn't need any strong consistency requirements, so I was thinking about a NoSQL solution like MongoDB, but then I talked with a friend who's a big fan of Redis and uses it for nearly everything. That was my lightbulb moment. You already have a database; why are you looking for another? So I tried Redis out, and everything went fine. And so I crossed over to the dark side.😈

Of course, there were tradeoffs to make. For instance, Redis is a key-value store, so your data should ideally fit in that. But it worked fine for my use case.

Why?

I think there are five major factors I evaluate when picking an unconventional database.

Data structure

What's the data like? Does it fit within the capabilities of the datastore, or are you trying to force it? As I mentioned earlier, Redis works best for key-value data, so I can easily do this:

const settings = {notifications: true, timezone: 0};
await cache.setAsync(`settings-${username}`, JSON.parse(settings));
// Retrieve the full settings object
const settings = JSON.parse(await cache.getAsync(`settings-${username}`));

But if I need to retrieve part of the settings data, or query specific fields in the data, such as fetch only users that have notifications, then it will suck, because that's not what's Redis is designed for. I thought long and hard about this, and decided Redis fit with my needs.

For spreadsheets: When you think about it, spreadsheets and SQL databases have the same basic structure: tables/sheets, rows and columns. So, since that the data I wanted to store (teamID - user ID - token) fit nicely into an SQL shape, it was obvious a sheet would work as well.

Files are the most flexible of all. They can store anything in any format you want, but you have to write the querying logic yourself, so it's best to stick to simple formats and don't use them for data where you need complex queries. I think sticking with simple data that fits in a key=value format or JSON (like lowdb) is best.

Access requirements

After thinking about the data structure, you naturally have to think about accessing the data. How often are you going to be reading the data? Does the datastore allow you to get to your data in a simple and performant way? If you'll have to write complicated logic to retrieve and parse the data, it might not be worth it.

For Redis, this wasn't a problem. JSON.stringify -> redis.set -> redis.get -> JSON.parse, and I'm good. And it's Redis, so it's pretty much the fastest possible database I could have used for this project.πŸ˜‰

Files and Google Sheets are a bit dicier. Reading from files isn't necessarily a slow process, but has a number of caveats to be aware of (like resource usage). Google Sheets is decidedly slower β€” I'm making an API call (to Sheety) which makes another API call (to Google Sheets), which then talks to its database. But remember that all this is in the context of your application. For Tentacle, I only visit that page occasionally, so waiting a few extra seconds isn't a huge price. And for the Slack bot, users typically used it once a day and the response time was (mostly) bearable.

Risk profile

Another factor is how important is the data I want to store? What persistence guarantees does the datastore give? What's the worst that can happen if I lose all this data? In the scenarios above:

  • user settings in Redis (@RemindMe_OfThis): I have persistence and backups enabled, but if I somehow lose this data (for instance, it's not a big deal. A user can log in once again and change their settings.
  • video downloads in Redis (@this_vid): There's a slightly bigger annoyance problem here, since the site says it has downloads in the last 48 hours, but still, it's a minor inconvenience.
  • performance data in file: Meh. I'll wait a couple of days and have data again.
  • user tokens in Sheets: If the sheet gets cleared, the Slack command will fail with a "You need to authorize this app" message + a link to the authorization URL, which will trigger the flow and store the token again.

Security

And then security considerations. How safe is this datastore? Are there any security risks involved? Of course, this is also dependent on how sensitive the data you're storing is.

I'd already secured my Redis instance long ago, so there wasn't a worry there. More importantly, the data there isn't very sensitive β€” I don't store any auth data; the Firebase tokens are only used for sending notifications and can be easily invalidated. There was one thing I had to be careful about, though β€” not passing user input directly into the cache. It wouldn't have exposed me to any direct attacks, but it's best to be safe all the same.

The Sheets issue was a bit trickier because it's very easy to accidentally expose a Google Sheet to the public. But I made sure to set proper permissions on the sheet, so only an authorized account could access it. Additionally, the data isn't very sensitive. The user tokens only have the users.profile:write, which allows the app to make a few profile changes.

Storage capacity

Kinda goes without saying, but you also need to consider capacity. What's the storage limit of the datastore? What happens when it gets full?

For Redis, I had a storage capacity of several hundred MB, and I had only used a tiny percentage of it, so I knew I was likely not going to hit that limit. And Redis automatically evicts older keys (based on your eviction policy) when Redis is full, so the app would still keep running. Of course, I still have monitoring in place that will alert me when I use up a certain percentage of my storage.

Files have a very large storage limit, but you obviously don't want your files getting too large, otherwise you risk crashing your server just trying to read a file. Managing a file's size is a bit more involved; you can use logrotate to rotate files at a certain size, but ideally you want to remove older data. I haven't set up a system for this yet, but I spent some time monitoring the file growth to know what my window was, and I've still got a lot of time before I need to worry about the file size being a problem.

I didn't check the maximum size of a Google Sheet, but, based on the size of the company and the amount of data we were storing, I knew we didn't have to worry about that.


So there you have it. Using unconventional databases is one of the things that has helped me do more with less (ie run projects at a low cost). I spent < $100 in cloud costs last month, while serving over 10k Twitter mentions per day. A database is just a datastore that meets your needs. Find one that works, and go with it.


HeyπŸ‘‹. I write about interesting software engineering challenges. Want to get updated when I publish new posts? Just visit tntcl.app/blog.shalvah.me.

(Confession: I built Tentacle.βœ‹ It helps you keep a clean inbox by combining your favourite blogs into one weekly newsletter.)

Powered By Swish