The problem
Databases need a unique identifier called primary key (PK) to address their data units (records). There are two types of primary keys: natural keys, which are extracted from the data (i.e. first and last names or the social number), and surrogate keys which are sequential serial numbers added externally (usually by the database itself) and called sequential ID.
The most common form of sequential ID is an 8 byte integer which is very compact and performant in the context of a single database but it’s not very helpful in a distributed environment where key uniqueness among all systems is paramount. It is analogous to the difference between a local serial number added to a book by a library and the ISBN code: the former is useful, practical and efficient within the local indexes and users but only the latter can be used effectively to exchange books with other libraries.
Solutions for a distributed system
We need an identifier that maintains its uniqueness in a context with many different interacting systems. There are three solutions available:
- An hybrid natural/surrogate key model
- A centralized identifier generator factory
- Universal identifiers
Hybrid natural/surrogate key model
If the data unit has a fitting natural key that is assured stable, unique, easily indexable and compact it might be used as an external identifier while still using a surrogate key internally to keep the database performant. The natural key could be substituted by an UUID (see later) with similar results. Applying this method would cost an extra index to be paid in speed and space and the different databases involved are bound to be managed separately because of the non unique primary key. This option might still be viable for loosely interacting systems that are managed separately (i.e. in a wide area network vs cloud). Natural keys are commonly discouraged though because of the many concerns about stability, security and privacy (it might be a Personal Data).
Centralized identifier generator factory
A centralized identifier generator factory can be a solution where a strict control over identifiers is needed (i.e. authorizations). It requires specific measures to address performance bottleneck, security and availability.
Universal identifier
A big random number can be assigned as an identifier by every system independently. A commonly used one is the UUID: Universally Unique IDentifier.
Random UUIDs are guaranteed unique (no need for a centralized factory), secure (non guessable), compact (16 bytes long) and easy to use but they have one huge problem: their own randomness! They cannot be efficiently indexed by the B-Tree algorithm used by most databases leading to a significant drop in performances compared to sequential IDs.
On the other hand universal identifiers, avoiding key collisions over multiple systems, improve considerably the management of a distributed data layer allowing for a far easier data migration, merging, distribution and backup.
Sequential UUID
To address the B-Tree indexing problem there is a proposal for an extension to the UUID specification to include monotonically incremented, node and time dependent UUID versions (sequential UUID, v6, v7, v8) that improve indexing performances dramatically. Note that UUID version 1, incrementing its most significant bits, doesn’t get those improvements.
Unfortunately even sequential UUIDs, being double the size of a normal ID, suffers some performance loss for the intrinsic slowness of some common operations like sorting by ID and a noticeable increment in the overall database size.
Other solutions
Cloud companies have created a plethora of different types of universal identifiers with various mix of characteristics that aim at solving the two big problems of UUID:
- indexing performances
- size
Hybrid ID/UUID solution
One possible solution would be to use a sequential UUID as a secondary index and a normal sequential ID as a primary key. This is based on the premise that an object will be referred much more internally by means of inner relationships, indexes, views, caches etc than from external systems (internal cohesion). This is very similar to the hybrid natural key-surrogate model presented before, it only uses the UUID as an address to be searched externally but it completely misses any advantages it could bring to the data layer management.
Small and efficient: TSID
One interesting solution is TSID which is a monotonically incremented, node and time dependent universal identifier of only 8 bytes (against the 16 bytes of a standard UUID). Being sequential and of the same size of a normal ID it keeps all its benefits in terms of efficiency on the database. Being universal it’s unique among all involved systems with benefits on the distributed data layer that sequential ID doesn’t have.
TSID mimics and extends the notorious Snowflake ID which was created by Twitter and now used by many others including Instagram and Discord. It’s very configurable. Its 64 bit payload is divided between a node identifier ranging from 0 to 20 bits (0 to 1,048,576 maximum nodes), a millisecond accurate timestamp of 42 bits (will be running for the next 139 years) and a counter of 2 to 22 bits for infra millisecond generations (22 – node bits, able to allocate from 4 to 4,194,304 TSIDs for each millisecond). The more bits are allocated to the node, the less are available for the infra millisecond counter but the generator can allocate values from the next milliseconds mitigating this problem if a generation burst should happen. This is probably its only flaw: given the right condition (low assigned infra-millisecond counter bits) it’s fairly easy to guess the next sequence value. A way to prevent that is to export it encrypted (at the cost of losing its sortability).
Note that because it uses all of its 64 bits the resulting long numbers cannot be managed by Javascript and must be encoded into a string (TSID has an helper that returns a 13 characters string representation of itself). That’s because a Javascript number is a IEEE a double-precision 64-bit binary format IEEE 754 value and its integer part can only store 52 bit (see Javascript Number.MAX_SAFE_INTEGER
), a bigger value will be rounded (with imaginable consequences). The upper limit of 252 is a huge number (more than 9⋅1015) and will not pose any practical limitations for long sequential IDs (that’s why you have probably never heard of this problem before).
Further readings on my other blog article about “How to not use TSID factories“.
Next Sequence Value Attack
Being in a sequence and because of its very limited bit count it’s vulnerable to next sequence value guessing attack. This means that given a TSID, under some conditions, the next values could be guessed quite easily. To avoid this, and if required, TSID can be encrypted before being sent externally at the JSON serialization phase (API boundaries).
This Java project helps to encrypt various identifiers: GitHub: Id-Encryptor project.
Bibliography
- UUID IEFT RFC
Reference documentation about UUID - IEFT: New UUID Formats
Proposed sequential UUID extensions (v6,v7,v8). Lists other non standard UUID implementations. - Postgresql Tutorial: The Basics Of PostgreSQL UUID Data Type
Explains the native PostregSQL UUID data type. - Thorben Janssen: How to generate UUIDs as primary keys with Hibernate
Very useful article on how to generate and use UUID with Hiberante. - JPA implementation patterns: Using UUIDs as primary keys
- Using UUID on Spring Data JPA Entities
Shows a trick to know if a JPA entity using UUID has been saved by using JPA@Version
annotation - JpaBuddy: The Ultimate Guide on Client-Generated IDs in JPA Entities (Andrey Belyaev)
Shows various drawbacks in using UUID. - Cybertec: UUID, serial or identity columns for PostgreSQL auto-generated primary keys?
“a change in PostgreSQL v11 made sure that monotonically increasing values will fill an index more efficiently than random inserts ever could” - Vlad Mihalcea: The best UUID type for a database Primary Key
A very good explanation of UUID and its pitfalls. Introduces TSID. Java approach. - Percona: UUIDs are Popular, but Bad for Performance — Let’s Discuss (2019)
Shows some performance tests on using UUID. - Percona: Storing UUID and Generated Columns (2017)
Other performances tests for UUID. - Soham Kamani: A Complete Guide to UUID Versions (v1, v4, v5) – With Examples
- Tom Harrison: UUID or GUID as Primary Keys? Be Careful!
Analyzes pitfalls of UUID. Introduces the hybrid UUID/ID approach that I expanded with encryption. - Identity Crisis: Sequence v. UUID as Primary Key
Evaluates slowness in using UUID. - Baeldung: An Overview of Identifiers in Hibernate/JPA
Very nice introduction to various ID strategies. - Supabase: Choosing a Postgres Primary Key
Explains the meaning of various standard and non standard UUID versions. - Percona: Storing UUID Values in MySQL (2014)
- Mozilla Developer: Number.MAX_SAFE_INTEGER
I still don’t understand why TSID counter bytes resets to random bits and not by zeros like generic snowflake. Can you explain what problem this solves in more detail in next post maybe?
Hi, it’s a defense against:
1) concurrent requests for a TSID where the generation could not be made thread safe (i.e. synchronized)
2) next value attack where the next TSID could be guessed knowing the previous one
Of course, giving the very few bits available this defense is not very strong but still better than nothing…
It’s a possible mitigation for the next value guess attack. I suppose Snowflake doesn’t care about that by design (it isn’t meant to protect sensitive information).
The problem is if you need to disclose the address of an object without giving away the addresses of other objects in the same repository (without having to use access controls).
Note that the mitigation is as effective as the number of bits given to the counter: to be safe you should really encrypt them.