Base64's goodness

I had a use case some time ago where base64 saved my ass, and I thought I'd write a short refresher for myself on what it can do, and how it can be a useful tool for any software engineer.

First, base64 is a character set (remember those?). There are 64 characters in this set: the 26 uppercase and 26 lowercase English letters (A - Z, a-z), the 10 digits (0-9) and two other characters (usually + and /), making a total of 64 characters. Every character here maps to a number from 0-63.

Encoding

More importantly, base64 is a byte encoding. You give it a bunch of bytes, and it will convert them to a string that contains only those 64 characters (and in some cases, an = sign). Note that these bytes don't have to be just text. You can supply a file or any random sequence of bytes. The goal of the encoding is to represent any byte stream as printable ASCII characters, because they're "simple" and most computers support them (in the Western world).

Every programming language and OS has a bunch of tools for converting to/from base 64, but let's do it ourselves. How do we encode a string like hi in base64?

First off, forget about the "string" part. Base 64 works on bytes, so it doesn't care whether or not the input is a string. The first thing to do is to write this as a sequence of bytes. In Unicode/ASCII, h is the codepoint 104₁₀ and i is 105₁₀. (I'm using ₁₀ to indicate these are base-10 numbers.) Now we write these as base-2 numbers:

01101000 01101001

And now we encode!

  1. We put everything together, then group them in 6-bit groups (add 0s at the end if there aren't enough bits).
  2. Take each group as a new number and look it up in the base64 alphabet. For instance, if the number is 3, we replace it with number 3 in base64, which is D (A = 0, B = 1...). Full character set here.
// Put everything together
0110100001101001

// Group them by 6 bits. Last group has only 4, so add two zeros.
011010 000110 100100

// Convert them to base64 numbers
011010 = 26₁₀ = a₆₄
000110 = 6₁₀ = G₆₄
100100 = 36₁₀ = k₆₄

So hi is aGk in base64 encoding. You'll more likely see it written as aGk=, because base64 strings are often padded with one or two =s characters to ensure their length is a multiple of 4 (for reasons such as verifying that the received string is complete).

Use cases

The important thing about base64 is that it can represent any byte in these "regular" text characters, making it useful for a bunch of things:

Transferring binary data

You could use this to transfer non-text data (such as a file) over mechanisms designed for text. These mechanisms would normally break or result in corrupted data, because you're passing non-text bytes to them. By encoding in base64, you can safely send such data and then decode it to get the original file without any loss.

In fact, this was my initial need. I was working with a remote server, and I needed to generate a PDF on the server and check its content. I only had access to the terminal, so I couldn't open a fancy app like Adobe Reader on the server. I didn't have access to SCP or other such tools. I had to generate on the server, then download it to my machine and open it locally. With base64 this was:

heroku run rails r "puts Base64.encode64(thing.generatePdf.string)" | base64 --decode > file.pdf

(This is a Ruby-based example, but the same thing applies everywhere.)

The heroku run rails r bit is the wrapper code to run this command on my remote Rails app via Heroku. We run the thing.generatePdf method, which returns a StringIO instance which can be saved to a file, but in this case, we call .string on it to read the content as a string. Then we pass that into the base64 encoder and print that to stdout (puts). The pipe (|) then passes that output to the base64 utility on my machine, which decodes it and stores it to file.pdf. And just like that, I could open the file locally.

(I'm ashamed to admit that I first tried just printing the original string directly into a file on my machine, but it didn't work. The file was corrupted. Which makes sense, because there are many bytes in PDFs that aren't text, and terminals and SSH are designed for text.)

One trick you can do with this is JSON file uploads. JSON is a text format, so you can't upload files in it. You'd have to switch to multipart request bodies. But, if you really wanted to stick to JSON, you could encode the file in base64 and pass it like a regular string. (Be warned, it'd likely be a very long string.)

{
  "photo": "N7c2PJAxPJ9mIGFueSBjYXJuYWwgcGxlYXN1cmUu..."
 }

You probably shouldn't, though, but we'll come to that.

Getting around unwanted characters

Another reason you might want to use base64 is to serialize strings in a safe way. For instance, supposing you have an arbitrary string, and you want to serialize it as a string into a quoted string, like in a code snippet:

const str = "Some string";

const codeSnippet = `const x = "${str}";`;

This produces invalid code if str contains double quotes:

const str = 'There are "quotes" here.';

const codeSnippet = `const x = "${str}";`;
// Output:
// const x = "There are "quotes" here";

One way you could get around this is to manually escape the quotes before interpolating. Another option would be to use the encoded version instead, which is guaranteed to not have quotes. (I learnt this trick from Caleb Porzio).

const str = 'There are "quotes" here.';

const codeSnippet = `const x = "${btoa(str)}";`;
// Output:
//'const x = "VGhlcmUgYXJlICJxdW90ZXMiIGhlcmUu";

Unfortunately, you lose readability here, so if that's important for you, you should probably escape manually.

Serializing to string

Aanother use I had recently was replicating an object in memory. Essentially, I wanted to make a copy of some object (eg an instance of a Person class, and related objects) from one environment and import that as a fully formed object into another. These objects were big, so I didn't have the time to manually copy each property over. I remembered serialization, which was designed for things like these. I could simply serialize the objects in one environment, copy the output, and deserialize that in the other:

// in one environment
$object = (...); // instance of Person
echo serialize($object);

// in another environment
$input = trim(fgets(STDIN));
$object = unserialize($input); // instance of Person
# Pipe serialized output from (1) as input to (2)
php env1.php | php env2.php

This works in PHP, because PHP's serialization algorithm produces only text. But Ruby's produces non-text bytes (as explained in my post on serialization), so this would break. The solution? Base64.

object = (...)
puts Base64.encode64(Marshal.dump(object))

input = gets.strip
object = Marshal.load(Base64.decode64(input))

Downsides and points to note

  • Base64's output is larger than the input. Every 3 bytes of data get encoded to 4 bytes of data, so that's around a 33% increase in size.
  • You can use base64 to force file uploads via JSON, but you probably shouldn't. It takes up more data and involves unnecessary extra processing (encoding and decoding). Multipart is a scheme designed for binary files over HTTP, so favour that.
  • Base64 is not a security mechanism. Encoding is not encryption. Anyone can read a Base64 string. JWTs are base64-encoded, but that's just for ease of transport. Anyone who gets a hold of your JWT can decode it and read the information you're storing in it, so be careful about what you store there.
  • Base64 isn't entirely URL-safe; you can put it in a URL, but the / character might be misinterpreted, so it often needs to be encoded. If you need something that can be put in a URL as-is, other encodings (such as base32) might be a better fit.

Encodings are fun.😄 Base64 isn't a character encoding in the same way UTF-8 is; it's more of a "data encoding". Here's an interesting short post on some different things we know as "encodings".



I write about my software engineering learnings and experiments. Stay updated with Tentacle: tntcl.app/blog.shalvah.me.

Powered By Swish