3. DB Primary Key Type

šŸ”™

Date: 2025-01-23

Status

Accepted

Context

It’s possible to design a format containing any meaningful information, possibly encrypted or signed with an RSA key.

The keys must:

  • be resistant to guessing / brute force
  • have sufficient capacity for big data
  • be fast for searching
  • not take up much disk space
  • be - if printing on receipts / showing to end users is required - not too long and easily readable (all characters in one case, without similar looking characters)

Desirable:

  • convenience of input on smartphones, ideally ā€œnumbers onlyā€, so that a compact numeric keypad is sufficient
  • ability to copy-paste with one click (as one word, i.e. no whitespaces or punctuation)
  • validation at the DB level
  • some form of syntax and/or check digit to filter out obviously non-existent values at the application level, before querying the DB
  • global uniqueness
  • avoid accidental word creation, for example by excluding vowels, especially the letter U
  • only allow decimal digits and Latin letters in strings, preferably uppercase and not similar to digits (without 0/O, 1/I, 1/L)
Realistic upper bound estimate

The largest online stores deliver about 4 million orders per day. This is 124 million per month, or 1.5 billion per year, or 100 billion orders in 60-70 years.

Decision

For values used in a purely technical sense, i.e. not for showing to end users - UUID version 4 (random number). If sorting keys by creation time is desirable and revealing this time does not carry business risks, then - UUID version 7 (timestamp + random number).

For values used in a business sense, i.e. those printed on receipts, shown to end users, typed in web forms - numeric FPE string in the form 012-345-678. Clients are only required to enter digits (on the smartphone’s compact numeric keypad) - separators are inserted automatically.

The rightmost digit is a check digit, so the above template digits can accomodate 100 million (minus 1) orders. The check digit is the same as for credit card numbers with the exception for all-zero strings. Such strings are unacceptable and if generated, must be re-generated (see Check Digit Calculation for details).

For documents and packaging, it is recommended to generate a QR code with the order number or a link to a payment form which has the order total pre-filled. This will make it easier to pay for orders using a smartphone.

Consequences

DB records - potentially big data - are quickly searched and uniquely indexed in a way that does not impose business risks. Identifiers intended for display to customers and/or input by customers do not reveal internal statistics, such as the total number of records.

Options

Check Digit Calculation

šŸ”

Manually entered numbers must be protected against typos by means of a check digit. It’s crucial for routing payments to the intended order number or bank account.

For alphanumeric strings (letters A-Z and digits 0-9) it’s reasonable to use the ISIN validation algorithm which is an extension to the Luhn algorithm. The latter operates on decimal digits only; the former first replaces: A=10, B=11, …, Z=35 and then invokes the Luhn algorithm. Thus, for decimal strings, the ISIN algorithm equals the Luhn algorithm which protects credit card numbers.

In general, any algorithm can be applied to arbitrary input by converting the input to digits first - by replacing each character with its index in the alphabet, for example in Unicode which accomodates slightly over a million entries.

All-zero strings

All modulo-based algorithms (Luhn, ISIN, EAN13, ISBN) fail on all-zero strings (0, 00, etc.) - at least their implementation in the Apache Commons Validator library do. If such strings are OK, consider leveraging the Verhoeff algorithm (but it accepts decimal strings only). As an option, such strings must not be accepted; the source DB sequence must be increased once again to generate another value. The final value (with the check digit appended) must not be all-zero because such an order number looks weird.

Sequential Number

šŸ”

Displayed in groups of 3 digits (minimum 9 digits, leading zeros for small values): 000-111-222, 333-444-555-6. Stored in DB and mentioned in logs in the same format to avoid discrepancies and problems with searching, input forms, copying, forwarding, etc.

Pros and Cons

Pros
  • Fastest generation at the DB level.
  • Fastest search both by primary key and in joins.
  • Lowest disk space consumption. This applies to secondary keys too: duplication of primary keys in other tables gives the smallest index volume.
  • Sufficient for big data: 2^63 - 1 equals 9 followed by 18 more digits.
  • No dependency on any secret key.
Cons
  • Values can be guessed and enumerated incrementally. One can find out the total number of, for example, orders in a store simply by creating a new order. This is partially mitigated by starting the autoincrement from some large value, but even then - one can estimate the number of orders per day by creating one at 00:00 and another at 23:59.
  • No global uniqueness.

Sequential Number With Random Delta

šŸ”

Displayed in groups of 3 digits (minimum 9 digits, leading zeros for small values): 000-111-222, 333-444-555-6. Stored in DB and mentioned in logs in the same format to avoid discrepancies and problems with searching, input forms, copying, forwarding, etc.

The numbers are based in a DB sequence, but a random delta is added each time. Besides the main goal - hiding the total number of records - this > serves as a protection against typos during manual input, so a check digit is not required (it would reduce the number capacity by 10 times). If the minimum delta is 100 and maximum is 200, then 9 digits are sufficient for (minus the initial DB sequence value):

  • worst case - 5 million records
  • best case - 10 million records

The very first number is a sufficiently large one, without trailing zeros. For example:

Previous Record Number Random Delta New Record Number Displayed As Comment
23 456 23 456 002-345-6 generation of the very first number
23 456 101 23 557 002-355-7
23 557 200 23 757 002-375-7
23 759 153 23 910 002-391-0
… … …
999 999 999 100 1 000 000 099 100-000-009-9 10th digit appears

For the upper estimate (1.5 billion records per year), 11 digits would be enough for 60-70 years. However, considering the reduction in number capacity by 100-200 times, 2 more digits are required.

To obtain the next value, execute in Postgres (this can be wrapped in a procedure and called automatically):

BEGIN;
ALTER SEQUENCE seq_name INCREMENT BY <random number>;
SELECT nextval('seq_name');
COMMIT;

Pros and Cons

Pros
  • Same as for sequential number.
  • Due to ā€œsparsenessā€ of values, guessing/enumeration is more difficult, and the total number of records is obfuscated (obscurity is not security!).
Cons
  • Due to DB sequence locking, parallel record creation works slower than usual.
  • Numeric digit capacity is not fully utilized, so more digits are needed than usual.
  • No fixed length, which makes validation in input forms more difficult (or leave many leading zeros as a reserve). However, the length can be limited based on the 8-byte limit (signed Java long): 19 significant digits plus a check digit.
  • It’s still possible to give a rough estimate (for the example above - 100-200 times higher than reality) of the total number of orders in the store. And if the increment pattern is inferred by creating many orders in a row, a more accurate estimate becomes possible.
  • No global uniqueness.

Sequential Number With Time-Dependent Prefix

šŸ”

Let the first 3 digits be allocated for the number of months starting from a certain date (83 years = 996 months). Then, for the upper estimate (124 million records per month) - and even for 1 billion - 9 digits suffice, which, considering the reduction in number capacity by 100-200 times (the random delta), becomes 11. This sums up to 14 digits again, like in the previous section, just with fewer leading zeros. However, all the digits must always exist and be filled with zeroes.

The prefix can extend to the left as much as needed, but the number of records per month is limited. This is a characteristic of all formats that have a prefix.

In the 1st month, the numbers will look like 001-000-00-000-000, in the 2nd - 002-000-00-000-000. Each prefix needs its own DB sequence. When creating a record, the DB sequence corresponding to the prefix (number of months) is updated.

Pros and Cons

Pros
Cons

FPE String

šŸ”

Theory

It has been mathematically proven possible (FPE - Format-Preserving Encryption), using a secret key, to encrypt any string so that the result:

  • consists of the same alphabet as the original string
  • has the same length as the original string
  • is always the same for the same input string (no randomness, unlike in regular cryptography)
  • can be decrypted back to the original string

The alphabet is the complete set of all possible characters. For example, if the alphabet consists of digits + Latin letters + hyphen, the string ā€˜567’ can turn into ā€˜0T-’ (if there is a requirement to keep letters as letters etc., then the string will have to be encrypted in parts, with different alphabets). If the alphabet consists of digits only, then one set of digits will always result in another set of digits, however, this will not be a number, but a string where leading zeros may appear.

For example, the number 12 - as string ā€˜12’ - might turn into ā€˜07’, and the number 123 - as string ā€˜123’ - might turn into ā€˜007’. If we interpret the result as a number, then both would be 7, which is incorrect. If we exclude zero from the alphabet so that the result is a raw number, then the algorithm, strangely enough, will encrypt, but it will be impossible to decrypt the result, because the algorithm Ā«doesn’t knowĀ» about the existence of zero.

An arbitrary alphabet can be used, for example hexadecimal (0-9A-F), for which the original number needs to be first encoded with this alphabet to a string; the string is then transformed. FPE, by definition, is designed for mapping characters from and to the same alphabet.

Implementation

The recommended FPE implementation is FF3 version FF3-1 approved as a NIST standard - see their document, section 5.2. A Java implementation has been tested that accepts, when using the decimal alphabet, strings 6 to 56 characters long. This limit depends on the alphabet and is due to the following FF3-1 requirement: the number of possible values of the input string must be at least 1 million.

In this (decimal) case, 10^6 = 1 000 000. Therefore, before encryption, it’s necessary to left-pad the input decimal string with zeros if it’s shorter than 6 characters. As for the maximum length (56), the largest 8-byte signed integer (Java long) consists of 19 digits.

The secret key length can be 128, 192, or 256 bits. An additional key, called tweak, is 56 bits long.

Format Description

The format (alphabet + group separator) is determined by the business requirements. This section merely describes a few options.

Let’s define the ID as a variable-length string in the format 012-345-678 - minimum 9 decimal digits, the rightmost one being a check digit, so the string length remains constant for up to 100 million (minus 1) records. Any number of leading zeros is allowed, and they have a meaning, they can’t be stripped out.

Digits are separated from left to right by hyphens in groups of 3 (for visual harmony). The separators are part of the format, because if we only use them for display and store 012345678 (with the leading zero!) for 012-345-678 in DB, there will be ambiguity and problems with searching, input forms, copy-pasting, forwarding, etc. The separator can be anything, for example a dot (012.345.678), a forward slash (012/345/678).

The source of the string is an ordinal number obtained from a DB sequence. The string can extend to the left as the source number increases: 000-000-000 → 000-000-000-0 and so on. Formally, the extension is infinite, but in reality it will not exceed 20 digits: the maximum 8-byte signed integer (Java long) has 19 digits (9 followed by 18 digits), plus a check digit. Thus, the approximate maximum possible value takes 26 characters: 900-000-000-000-000-000-0c where c is a check digit depending on the algorithm.

For the upper realistic estimate - 1.5 billion records per year - 11 significant digits plus a check digit are sufficient for 60-70 years (100 billion records minus 1): 999-999-999-99c - which, with group separators added, takes 15 characters.

A different alphabet can be used, for example Crockford’s Base32 famous for the absence of ambiguously looking characters (0/O, 1/I, 1/L): 0123456789ABCDEFGHJKMNPQRSTVWXYZ. Then the max. record number will take not 26 characters, but only 18: 2G1-D1V-5J0-000-0c. The upper estimate - 100 billion records minus 1 - will require not 15 characters, but only 10: 2X4-7FT-Zc. On the other hand, this reduces readability, and sometimes 5 can be confused with an S.

Transformation Algorithm

Original sequence numbers (positive integers obtained from a DB sequence) are, after conversion to string, left-padded with zeros to 8 digits: 123 → 00000123. If the original number has more than 8 digits, it remains unchanged: 12 345 678 → 12345678. Inherently, 8 zeros and 9 zeros are real and different values; see their decryption below.

After the string has been encrypted:

  • a check digit is appended
  • group separators are inserted appropriately

Then, for self-validation:

  • string syntax is analyzed, taking into account the minimum (11) maximum (26) allowed length, character set, and group separators in correct positions
  • group separators and the check digit are removed
  • a check digit is calculated and compared to the original one

Here’s an example of algorithm operation for the 128-bit secret key 2DE79D232DF5585D68CE47882AE256D6 and the 56-bit ā€˜tweak’ CBD09280979564:

Sequence Number → Padding
(if needed) →
Encrypted → Final Number
(w/ISIN check digit)
1 000001 943130351 943-130-351
2 000002 887763811 887-763-811
3 000003 308392836 308-392-836
1 000 001000 568737225 568-737-225
10 000 010000 216388769 216-388-769
100 000 100000 151996170 151-996-170
1 234 567 1234567 683768261 683-768-261

Since any combination of digits corresponds to some original (sequential) number, we can find out at which sequence number the client will receive known values:

Final Number
(ā€˜c’ is the check digit) →
Decrypted → Sequence Number
000-000-00c 08959589 8 959 589
000-000-000-c 746999857 746 999 857
000-000-01c 93005957 93 005 957
000-000-02c 62197228 62 197 228
000-000-10c 69986121 69 986 121

Pros and Cons

Pros
  • The numbers look random.
  • Any alphabet can be used.
  • Sufficient for big data - no length limitation, numbers grow to the left.
  • Numeric digit capacity is fully utilized.
  • Validation (length + character set + check digit) takes place in the source code before sending request to DB. This reduces DB load by filtering out obviously non-existent values.
Cons
  • Strings have the slowest search speed.
  • A secret key must be stored safely. Its leak leads to the possibility of not only decryption but also generation of, for example, discount coupons or paid software serial keys.
  • Encryption takes some time. A rough estimate is 100 000 results per second on a MacBook.
  • Several times more disk space is required as compared to binary formats, for example the maximum 8-byte signed integer (Java long) takes 19 bytes in string format. This also affects secondary index sizes.
  • No fixed length, making validation of input forms more difficult. But there are min. and max. lengths. This applies to the DB column definition as well.
  • Any sequences are generated, including Ā«all zerosĀ». Therefore, such a check digit is required that’s not equal to zero for such sequence (i.e. not modulo addition, but the Verhoeff algorithm), otherwise it would be quite a strange number 😲 As an option, if we get Ā«all zerosĀ», repeat the generation, taking a new value from the DB sequence, and then apply any checksum calculation algorithm.
  • The string length provides an upper bound estimate for the total number of records, e.g. 3 decimal digits mean Ā«less than a thousandĀ». This is mitigated by artificial lengthening of the original string - left-padding with zeros.
  • Copy-pasting is inconvenient - can’t select with one click, as punctuation is recognized by editors as word boundary.
  • No global uniqueness.

UUID

šŸ”

The UUID length is 16 bytes (128 bits). The most common variant is just a random number - version 4.

Pros and Cons

Pros
  • Less disk space consumption than for strings (16 bytes are stored in binary format if DB supports this column type). Just to compare, UUID in string format (6cb32aba-88a7-4188-b32a-ba88a751883c) takes 36 bytes.
  • Native column type in Postgres, so there’s both validation during record creation/update and autogeneration (however, in Hibernate its generator needs to be connected).
  • Effortless validation in REST controllers, which allows to avoid looking up invalid values in DB.
  • UUID capacity is sufficient for both big data and global (planet-wide) uniqueness. Usually 6 bits out of the 128 are allocated for metadata, therefore the number of combinations (2^122) equals approximately this number: ā€˜5’ followed by 36 zeros.
  • Thanks to its global uniqueness, UUID generation can be delegated to client applications (replace HTTP POST with PUT - the so-called idempotency1). This, for example, avoids order duplication from mobile apps in unstable network conditions when the order reached the server, but the ā€œorder acceptedā€ response didn’t reach the client.
  • With hex digits, there are no such issues as ā€œdigit looks like letterā€ or ā€œconfusing upper and lower caseā€.
  • B-tree index performance can be improved by using the new UUID v7 format, where the prefix is a timestamp. This allows for time-based sorting, and the keys with close creation times will be adjacent on disk.
  • No dependency on any secret key.
Cons
  • Not suitable for printing on receipts, showing to clients, entering in payment terminals, as it’s a string like 6cb32aba-88a7-4188-b32a-ba88a751883c.
  • UUID represents a random number (in UUID v7 it’s supplemented with a millisecond precision timestamp). Therefore, strictly speaking, collisions are possible, but due to the very long length its probability is negligible. UUID is widely used everywhere, and nobody worries 🤷.
  • UUID v7 may create business risks as it contains information about its creation time.
  • No check digit.
  • Copy-pasting is inconvenient - can’t select with one click, as punctuation is recognized by editors as word boundary.

  1. Idempotency - a property of an object or operation that guarantees that repeated application of the operation to the object yields the same result as the first application. For example, if the object has already been created, it won’t be created again.

    ↩