Goal: A brief exploration of what it means to "pack" and "unpack" bytes.
Inspiration
I've come across Ruby's Array#pack
and String#unpack
methods, but never had the time to dive into them. While researching another article, I came across this question and decided to stop to explore it.
Exploration 1: Packing into two bytes
I can't define "packing", but I've gathered that it's a term for representing a series of bytes as a string. And depending on how you do it, you can even do this in fewer bytes than the original. Unpacking is the reverse: recovering the original information.
Trying an example based on the Stack Overflow question. I have a bunch of bytes, ie values between 0 (00000000
) and 255 (11111111
). Supposing I take two at random, maybe 126 and 2.
let [a, b] = [126, 2]
console.log(a.toString(2).padStart(8, '0')) // 01111110
console.log(b.toString(2).padStart(8, '0')) // 00000010
I could represent them in a string by using the JS escape hexadecimal sequence:
console.log(a.toString(16).padStart(2, '0')) // 7e
console.log(b.toString(16).padStart(2, '0')) // 02
console.log('\x7E\x02') // "~"
However, this isn't what I want, as this string has two characters. JavaScript strings are UTF-16 [note 1], so this string has 4 bytes, which is more than the original.
Buffer.from('\x7E\x02', 'utf16le').byteLength
// 4
This string has two characters of two bytes each: 00 7e
and 00 02
. I want to pack the bytes so the string has only one character, 7e 02
. Here's how:
let char = String.fromCharCode((a << 8) | b)
console.log(char); // "縂"
Buffer.from(char, 'utf16le').byteLength // 2
This is a bit of bit arithmetic (haha).
a << 8
means "shift the bits ina
left 8 times"- shifting 126 (
01111110
) left 8 times gives us01111110 00000000
| b
is a bitwiseOR
operation01111110 00000000
ORed with 2 (00000010
) gives01111110 00000010
, which is what I want (7E 02
)
So there it is. I started with two bytes, and was able to fit them into a 2-byte character [note 2]. How about unpacking? Some more bitwise magic.
let bytes = char.charCodeAt(0)
let byteA = bytes >> 8 // Shift the bits to the right 8 times to get the first byte
let byteB = bytes & 0xFF // Bitwise AND the bits with 11111111 to keep only the second byte
// Alternative:
// byteB = bytes ^ (byteA << 8)
console.log(byteA, byteB) // 126, 2
Cool, cool.
I also found out you can do this packing natively with the TextDecoder
API! [note 3]
let byteArray = new Uint8Array([a, b])
let packedStr = new TextDecoder('utf-16be').decode(byteArray)
console.log(packedStr) // "縂"
However, unpacking with TextEncoder
gives wrong results for this use case, since it only supports UTF-8:
let unpackedArray = new TextEncoder.encode(packedStr)
console.log(unpackedArray) // Uint8Array [231, 184, 130]
Exploration 2: packing into one byte
Speaking of UTF-8, it's time to try that. But I'm changing some things:
- I won't use JS here, since its strings are UTF-16. I probably can use it, but I don't want that headache. Plus, I love any excuse to work with Ruby.
- All the bytes I'll pack are in the range 0 to 15. I've intentionally made it smaller so that I can pack two bytes into one UTF-8 character (one byte). I'll use 13 and 2 as my test bytes.
Packing in Ruby is pretty similar:
a, b = 13, 2
puts a.to_s(2).rjust(8, '0') # 00001101
puts b.to_s(2).rjust(8, '0') # 00000010
# hex
puts a.to_s(16).rjust(2, '0') # 0d
puts b.to_s(16).rjust(2, '0') # 02
char = ((a << 4) | b).chr # Shift by 4 bits, not 8, since I'm now packing in one byte
puts char # => "\xD2"
puts char.length # => 1
puts char.bytes.length # => 1
bytes = char[0].ord
byteA = bytes >> 4
byteB = bytes & 0x0F # AND with 0F, not FF, since I'm splitting up one byte
puts byteA, byteB # 13, 2
The output string here is a single byte "\xD2"...which is simply the original 0D
and 02
bytes packed together 😀 Unfortunately, it's not a valid printable character, so printing it shows �
, but it's there.
As mentioned earlier, Ruby has inbuilt pack
and unpack
methods, but they can only map byte to byte, so i couldn't use them for this example.
packed = [a, b].pack('c*') # => "\r\x02"
packed.unpack('c*') # => [13, 2]
But they work with the original UTF-16 example:
a, b = 126, 2
packed = [a, b].pack('c*') # => "~\x02"
packed.unpack('c*') # => [126, 2]
It may not look like that, but the packed version here ("~\x02") is exactly the same as my manually packed JavaScript version. It contains the exact two bytes, 7E 02
. The difference is the encoding; in Ruby, this string is UTF-8, so it's rendered differently. But I can change the encoding and see for myself!
packed.force_encoding 'utf-16be' # => "\u7E02"
packed.length # => 1
packed.bytes.length # => 2
Possible uses of packing
Why would you want to pack, though? I'm thinking, perhaps in a constrained environment like gaming over the Internet. If there is a limited number of possible buttons a player can press (say 12), instead of transmitting each button press as one byte, I could:
- wait for a few milliseconds, to gather the next few keypresses and send in a batch
- pack these keypresses into a byte. 12 possible buttons can fit in 4 bits (2^4 = 16), so two keypresses can go in one byte (8 bits).
In this, packing serves as a form of compression, to send less data over the network and improve the gaming experience (less data to download, so responses can be faster).
I also found this question, from a user who wanted to send a UUID as binary data. This is a valid use, since UUIDs are often rendered as strings, but they're actually a sequence of 16 bytes. Sending them as a string would take 36 bytes, so packing is useful here. You could also do this for other "binary-but-look-like-strings" data, like SHA-512 hashes for instance.
Let me know if you can think of any other uses.
Notes
1. The ECMAScript spec says:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
So JS strings are UTF-16. However, many modern Web APIs, like Blob
and TextEncoder
, and even older Node.js ones like Buffer
assume (or accept only) UTF-8. My guess is that they expect the string to be from the outside world (reading a file, an API response, etc), in which case, it's most likely UTF-8.
2.
The only reliable way I found to get the byte length of a native JS string (UTF-16) is Buffer.from(string, 'utf16le').byteLength
. Commonly suggested ways I found include TextEncoder
and Blob
, but they always assume UTF-8.
3.
For this to work as expected, I had to specify UTF-16 Big Endian (utf-16be
) as the encoding. UTF-16 because I want 2-bytes per character, and big-endian because I want the big digits at the end, like I did in the custom packer.
I write about my software engineering learnings and experiments. Stay updated with Tentacle: tntcl.app/blog.shalvah.me.