The primary key dilemma: ID vs UUID and some practical solutions

The problem

Databases need a unique identifier called primary key (PK) to address their data units (records). There are two types of primary keys: natural keys, which are extracted from the data (i.e. first and last names or the social number), and surrogate keys which are sequential serial numbers added externally (usually by the database itself) and called sequential ID.

The most common form of sequential ID is an 8 byte integer which is very compact and performant in the context of a single database but it’s not very helpful in a distributed environment where key uniqueness among all systems is paramount. It is analogous to the difference between a local serial number added to a book by a library and the ISBN code: the former is useful, practical and efficient within the local indexes and users but only the latter can be used effectively to exchange books with other libraries.

Solutions for a distributed system

We need an identifier that maintains its uniqueness in a context with many different interacting systems. There are three solutions available:

  • An hybrid natural/surrogate key model
  • A centralized identifier generator factory
  • Universal identifiers

Hybrid natural/surrogate key model

If the data unit has a fitting natural key that is assured stable, unique, easily indexable and compact it might be used as an external identifier while still using a surrogate key internally to keep the database performant. The natural key could be substituted by an UUID (see later) with similar results. Applying this method would cost an extra index to be paid in speed and space and the different databases involved are bound to be managed separately because of the non unique primary key. This option might still be viable for loosely interacting systems that are managed separately (i.e. in a wide area network vs cloud). Natural keys are commonly discouraged though because of the many concerns about stability, security and privacy (it might be a Personal Data).

Centralized identifier generator factory

A centralized identifier generator factory can be a solution where a strict control over identifiers is needed (i.e. authorizations). It requires specific measures to address performance bottleneck, security and availability.

Universal identifier

A big random number can be assigned as an identifier by every system independently. A commonly used one is the UUID: Universally Unique IDentifier.

Random UUIDs are guaranteed unique (no need for a centralized factory), secure (non guessable), compact (16 bytes long) and easy to use but they have one huge problem: their own randomness! They cannot be efficiently indexed by the B-Tree algorithm used by most databases leading to a significant drop in performances compared to sequential IDs.

On the other hand universal identifiers, avoiding key collisions over multiple systems, improve considerably the management of a distributed data layer allowing for a far easier data migration, merging, distribution and backup.

Sequential UUID

To address the B-Tree indexing problem there is a proposal for an extension to the UUID specification to include monotonically incremented, node and time dependent UUID versions (sequential UUID, v6, v7, v8) that improve indexing performances dramatically. Note that UUID version 1, incrementing its most significant bits, doesn’t get those improvements.

Unfortunately even sequential UUIDs, being double the size of a normal ID, suffers some performance loss for the intrinsic slowness of some common operations like sorting by ID and a noticeable increment in the overall database size.

Other solutions

Cloud companies have created a plethora of different types of universal identifiers with various mix of characteristics that aim at solving the two big problems of UUID:

  • indexing performances
  • size
Hybrid ID/UUID solution

One possible solution would be to use a sequential UUID as a secondary index and a normal sequential ID as a primary key. This is based on the premise that an object will be referred much more internally by means of inner relationships, indexes, views, caches etc than from external systems (internal cohesion). This is very similar to the hybrid natural key-surrogate model presented before, it only uses the UUID as an address to be searched externally but it completely misses any advantages it could bring to the data layer management.

Small and efficient: TSID

One interesting solution is TSID which is a monotonically incremented, node and time dependent universal identifier of only 8 bytes (against the 16 bytes of a standard UUID). Being sequential and of the same size of a normal ID it keeps all its benefits in terms of efficiency on the database. Being universal it’s unique among all involved systems with benefits on the distributed data layer that sequential ID doesn’t have.

TSID mimics and extends the notorious Snowflake ID which was created by Twitter and now used by many others including Instagram and Discord. It’s very configurable. Its 64 bit payload is divided between a node identifier ranging from 0 to 20 bits (0 to 1,048,576 maximum nodes), a millisecond accurate timestamp of 42 bits (will be running for the next 139 years) and a counter of 2 to 22 bits for infra millisecond generations (22 – node bits, able to allocate from 4 to 4,194,304 TSIDs for each millisecond). The more bits are allocated to the node, the less are available for the infra millisecond counter but the generator can allocate values from the next milliseconds mitigating this problem if a generation burst should happen. This is probably its only flaw: given the right condition (low assigned infra-millisecond counter bits) it’s fairly easy to guess the next sequence value. A way to prevent that is to export it encrypted (at the cost of losing its sortability).

Note that because it uses all of its 64 bits the resulting long numbers cannot be managed by Javascript and must be encoded into a string (TSID has an helper that returns a 13 characters string representation of itself). That’s because a Javascript number is a IEEE a double-precision 64-bit binary format IEEE 754 value and its integer part can only store 52 bit (see Javascript Number.MAX_SAFE_INTEGER), a bigger value will be rounded (with imaginable consequences). The upper limit of 252 is a huge number (more than 9⋅1015) and will not pose any practical limitations for long sequential IDs (that’s why you have probably never heard of this problem before).

Further readings on my other blog article about “How to not use TSID factories“.

Next Sequence Value Attack

Being in a sequence and because of its very limited bit count it’s vulnerable to next sequence value guessing attack. This means that given a TSID, under some conditions, the next values could be guessed quite easily. To avoid this, and if required, TSID can be encrypted before being sent externally at the JSON serialization phase (API boundaries).

This Java project helps to encrypt various identifiers: GitHub: Id-Encryptor project.

Bibliography

3 thoughts on “The primary key dilemma: ID vs UUID and some practical solutions

  1. I still don’t understand why TSID counter bytes resets to random bits and not by zeros like generic snowflake. Can you explain what problem this solves in more detail in next post maybe?

    1. Hi, it’s a defense against:
      1) concurrent requests for a TSID where the generation could not be made thread safe (i.e. synchronized)
      2) next value attack where the next TSID could be guessed knowing the previous one
      Of course, giving the very few bits available this defense is not very strong but still better than nothing…

  2. It’s a possible mitigation for the next value guess attack. I suppose Snowflake doesn’t care about that by design (it isn’t meant to protect sensitive information).
    The problem is if you need to disclose the address of an object without giving away the addresses of other objects in the same repository (without having to use access controls).
    Note that the mitigation is as effective as the number of bits given to the counter: to be safe you should really encrypt them.

Leave a comment