An Exercise in Shortening Ids

Maga D. Zandaqo
ITNEXT
Published in
3 min readAug 3, 2022

--

On the Web Platform, we constantly deal with numerical and binary labels or identifiers (ids), and we often use them in URIs for our API endpoints or filenames. A couple of issues arise in the process:

  1. Sequential ids simplify id discovery and can lead to a vulnerability known as “insecure direct object reference”¹ (if access control is lacking). They also reveal the size and velocity of data they identify, leading to business intelligence leaks.
  2. Randomized ids like UUID or MongoDB’s ObjectId are better. Although, they are 2 to 4 times larger than sequential ids for storage, and their default hexadecimal encoding is suboptimal for URI, i.e. results in longer than necessary strings.

A popular solution to exposing ids is to store a separate randomized id as a “public” id alongside a “private” sequential one. There are two main issues with this approach:

  1. We have to store and index a significantly larger extra field.
  2. Architecture-wise, it violates the separation of concerns by mixing an application-specific concern with our domain entities.

What if we could encode our existing ids in such a way as to:

  1. Obfuscate and make them reasonably hard to predict;
  2. Shorten as much as URI permits;
  3. Do it faster than the alternatives?

Below is a short overview of one such solution implemented by objectid64.

Base64

Our requirement to have a URI compatible string limits us to the following set of 66 characters²: A-Z a-z 0-9 - _ . ~. Naturally, we will round the number down to the closest power of two leaving us with 64 characters.

Using an off-the-shelf Base64 encoder won’t obfuscate much. We can, however, write our own encoder with a configurable alphabet, where changing the arrangement of characters produces different encodings:

It should be noted, that this is not encryption: it merely obfuscates and shortens ids enough to dissuade amateurs. Use encryption when ids need to be truly hidden, though, mind its performance impact.

Bit Twiddling

To encode numerical ids (i.e. numbers) in Base64, we can use the classic base conversion algorithm with division and modulo operations. However, since our base is a power of two, we can speed up the implementation quite a bit (2 to 5 times faster encoding in synthetic benchmarks) by using bitwise shift instead of division and bitwise AND instead modulo³:

In the same way, we can handle the 64-bit sequential ids by using JavaScript’s now built-in bigint type.

UUID Shortening

Hexadecimal encoding of UUID and ObjectId used in the wild is suboptimal for our use case of encoding for URI where we can use Base64 that is 25% more efficient (i.e. produce shorter strings). While we can convert from Base16 to Base64 using the base conversion algorithm above, we can do much better by creating a lookup table.

Notice that Base16 uses 16 characters that encode 4 bits (2⁴ = 16), whereas Base64 has 64 characters encoding 6 bits (2⁶ = 64). Thus, three hexadecimal characters can encode 12 bits, the same as two characters of Base64. Since the amount of hex triplets is limited to 16³ = 64² = 4096, we can create a reasonable lookup table linking each triplet to a pair of equivalent Base64 characters and use it to convert from one base to another without any calculations.

By converting from hex to Base64, we shorten ObjectId from 24 characters to 16 and UUID from 36 to 22:

And do so faster than alternatives, albeit by trading a bit of memory for the speed.

Of course, if we get the ids in their binary forms, we can forego hex encoding altogether and directly encode into our Base64:

Thus, a simple task of making ids URL-friendly led us to contemplating architecture, learning in-depth about encodings and base conversions, employing bit twiddling hacks and memoization. In the end, your choice of ids should not be dictated by the way they look in URIs or even by the security concerns due to their exposure — you can always encode (or encrypt) into a preferred shape.

--

--