IDs, UUIDs and alternatives

Tuesday, September 20, 2022

IDs. We need them to reference entities uniquely. But not all kinds of IDs are equal.

Features

Some IDs are the main identifier (the primary key) of an entry in our datastore, so they're stored alongside it, and never changed. Other IDs are simply used for display purposes and are translated back to some persistent data, or even another ID.

Some IDs are sortable. For instance, if your primary key is an autoincrementing integer, you can sort by ID to get items in creation order. Other IDs don't give you any such guarantee.

Some IDs are opaque. They don't mean anything by themselves. They need to be stored in a datastore alongside the entity they're referencing. Other IDs encode some information, such as when they were generated or some details about the entity they reference.

Some IDs are random. You're extremely unlikely to get the same value more than once. Others are more predictable. In fact, for some IDs, given the entity and some details about the means of generation, you can recreate the ID.

Some IDs are independent. Multiple servers can generate these IDs at the same time without conflicts, making them suitable for distributed systems, or cases where IDs need to be sent from the client. Other IDs need to be generated by (or coordinated with) a single server or database.

IDs come in all shapes and sizes. Some are case-sensitive. Some are strings of a fixed length. Others are variable-length. Some can have certain characters. Some can be numbers. Some are just a series of bytes.

Use cases

These features of IDs are good or bad depending on your needs. For instance, let's take a realtime chat app with multiple clients. Each client may need to send multiple messages in quick succession, track message delivery status and order, and possibly send actions like deleteMessageor editMessage, even before the server has acknowledged the message. Having each client wait for the message to be stored on the server and an ID created and returned would limit this functionality (and likely lead to some race conditions and intermittent failures). You'd probably want message IDs to be independently generated on each client.

You might also want your message IDs to be relatively small, so they're quick to generate and transmit. On the other hand, you might not care about sorting (you can easily add a created_at field) or encoding any information (a random ID is fine, as long as it's stored with the message).

Another scenario: a URL shortener. A user gives you http://google.com, and you need to convert it to something like short.ly/f4g. The f4g is an ID, because you use it to reference the original URL. You need a short ID, because it's a URL short-ener. One way to go is to generate a short random string for any URL entered and save that to the database. Of course, you have to calculate the possible number of IDs you can generate and when to anticipate collisions. Another way would be to map an integer primary key in the database to a short string ID for display.

You'd probably also want a URL-safe ID (that is, a string that can be typed directly in a URL without getting encoded). p3+j?d is URL-unsafe, because the + and ? need to be encoded, otherwise the URL might be interpreted differently.

One more scenario: an app where users sometimes call your customer support to resolve issues, and they often need to read out their user ID. This is the same as the "display" problem—you need something short that users can easily read, but you don't necessarily want to expose the primary key in your database ("Your user ID is 3" sounds really weird). A good option would probably be generating some other integer (e.g. 7863409) and mapping or storing it as the user-facing ID. You probably don't want letters in here, as they can look alike (lowercase l vs uppercase i), as well as cause confusion over case-sensitivity. A similar thing might apply to a meeting app like Zoom, where users might need to manually enter a meeting ID.

I can't say for sure what IDs you should be using, but here are a few I've learnt about. (I'm assuming you already know when auto-incrementing integer primary keys are not good enough.)

UUID

Many devs can recognise a UUID on sight (like 9e5368ab-6531-4c9a-8afd-c547275304e9). UUIDs (Universally Unique Identifiers) are 128-bit values. The idea is that they're unique and independent, so multiple programs or machines can generate a UUID at the same moment and have different results.

Note that I said "values" earlier, not "strings". UUIDs aren't strings; they're a set of 16 bytes (16 bytes = 128 bits). When represented as strings, each byte is written in hexadecimal format, making it 32 characters (+ 4 hyphens for easier reading). So the UUID above is really a sequence of the following bytes:

9E 53 68 AB 65 31 4C 9A 8A FD C5 47 27 53 04 E9

This is the reason why some databases have a UUID column type; they're not storing the UUID as a string, but as the actual bytes. Why is that better? Well, strings in UTF-8 have at least 8 bits per character (plugging my article on string encodings again), so the text version of the UUID is 36 * 8 = 288 bits. That's more than double the actual number of bits in the UUID (128). So, yes, if your database has a UUID column type, you should use that when storing UUIDs.

Not all UUIDs are equal, though. The UUID spec has versions, so it's important to know which version you're using.

  • Version 1 and version 2 include the MAC address of the device they were generated on, as well as the date/time. Most people don't use these anymore.
  • Versions 3 and 5 are different: you give them some text and they use a hash function (MD5 for v3, SHA-1 for v5) to generate a UUID. This means they're not random; you'll always get the same UUID for the same input. This StackOverflow answer explains the process in detail.
  • Version 4 is the most popular one in use today. It's completely random (doesn't include any date/time or device data).

There are also non-standard UUID formats. (You can invent yours, as long as it's128 bits and matches the needed format.) For instance, Laravel offers an "ordered UUID", which is a v4 UUID with some of the starting bits replaced with the timestamp, so that the UUID can be mostly random but still sortable by time.

And there are newer proposals as well: v6 is like v1, but the time part comes first, so the UUIDs are sortable. v7 is similar, but the time part is in a different format, while v8 has no specific format.

These days, most people use v4 UUIDs when they need a really random, independent ID for the database. And if they want to be able to sort by ID, they often use custom implementations like ordered UUID, or the more recent v6 or v7.

ULID

ULID (Universally Unique Lexicographically Sortable Identifier) aims to address some of the perceived limitations of UUIDs, while keeping compatibility with them.

A ULID is also 128 bits, so you can store it in a UUID column, but, unlike UUIDs, the string version of a ULID isn't just the bytes printed out. Instead, it passes them through base32 encoding, to create a 26-character string that looks like this: 01ARZ3NDEKTSV4RRFFQ69G5FAV. It's shorter than UUIDs, but there are no hyphen separators, so it's up to you whose looks you prefer.

The second major difference is how they're generated. The first 48 bits of a ULID are the timestamp, allowing them to be sortable by time. They're lexicographically sortable, which means you can sort the generated ULID strings alphabetically (you don't need to convert to the original bytes or timestamp value).

One thing I personally like about ULIDs is that they're "double-click-selectable". There are no hyphens or special characters, so when you double click on one in your browser, it selects the whole ULID. Of course, this isn't a reason to choose them over others, but it helps if you expect users to copy/paste the ID frequently.

Nano ID

Nano ID bills itself as a shorter UUID alternative. I couldn't find any definite information on whether it generates 128-bit values like UUID, but I think that's implied. Like ULID, it uses a different form of representation, so the string ID is much shorter (21 characters by default). A Nano ID looks like this V1StGXR8_Z5jdHi6B-myTV1StGXR8_Z5jdHi6B-myT, but it also allows you to supply your custom alphabet and choose your ID length. It's not sortable, though—it's purely random.

I'm personally not a fan of Nano ID, although it seems to be gaining popularity. There isn't a documented spec I could find, and the latest version of its JS package only supports ES modules. I'd rather go with ULIDs if I want a shorter UUID-compatible ID. However, either Nano IDs or ULIDs might be a good option for our message IDs problem from earlier.

CUID

CUIDs (the "C" is from "Collision-resistant") are another new system. The spec is also not clearly defined, but they're made up of the letter c followed by the timestamp of generation, a counter, a fingerprint specific to the device of generation, and a random value, all as base-36 numbers, giving you a string like this ckqs83wwv00033d61mvknp45t. They're at least 25 charaters in length, so shorter than a UUID but longer than a ULID. They seem to be lexicographically sortable, since the timestamp comes before the random parts, but that doesn't seem to be encouraged.

Hashids

Hashids ("hash" + "ID") are a different kind of ID.

  • They need an input (similar to UUID v5).
  • The input must be an integer (or a list of integers).
  • You also need to pass a secret value (a salt) when generating.
  • The output is a string of variable length (but typically short). It's not a set of bytes like UUID.
  • They're reversible. You can decode a hashid to get the original integers, as long as you have the same salt that was passed in.
const Hashids = require('hashids');
const hashids = new Hashids("a salt");
const id = hashids.encode(20); // "jE"
const numbers = hashids.decode(id); // [20]

In most cases, you won't be storing hashids. You'd rather store the integers that correspond to them, and the secret. Hashids are mostly for display; you might want to provide a user with a link to something which has a numeric ID, but you'd rather not expose that ID, such as in our URL shortener scenario.

There's also Optimus, which does the same thing as hashids, but its output is another number.

// Optimus requires three specific kinds of salts
$optimus = Jenssegers\Optimus\Optimus(1580030173, 59260789, 1163945558);
$encoded = $optimus->encode(20); // 1535832388
$original = $optimus->decode(1535832388); // 20

I think Optimus would be a good fit for our "customer support" problem earlier.

One disadvantage with both of these is that you have to also store your salt, otherwise you won't be able to map your generated IDs back to the originals again.

Roll your own

Finally, you could implement your own method, especially if it's for display purposes only. Perhaps all the popular methods you could find generate IDs that are too long or too short, or you'd rather not store a salt. There are no rules against going your own way, as long as you understand the limitations. Twitter and Instagram each did theirs (both 64-bit IDs), and so did Firebase (120 bits, then encoded into base64). Heck, CUID, Nano ID and Hashids were all people's systems that ended up becoming popular.

For instance, you might like the look of Hashids or Optimus, but not want to store a salt and take the risk of being unable to decode your hashids. You could:

  • generate a random string or integer as the "display ID" and store it in the database, alongside the entity and the regular ID, or
  • encode the raw integer ID from your database in your preferred scheme. For instance, 43 becomes NDM in base64-encoding, JZCE2PI in base32, and 4yQ in base58.

Of course, you should find out details like the characters used and output lengths before deciding on a scheme. You should understand the security status of your IDs—can something undesirable happen if someone decodes some of your IDs? Good luck!


Hey👋. I write about interesting software engineering challenges. Want to get updated when I publish new posts? Just visit tntcl.app/blog.shalvah.me.

(Confession: I built Tentacle.✋ It helps you keep a clean inbox by combining your favourite blogs into one weekly newsletter.)

Powered By Swish