Date: 2025-01-23
Accepted
Itās possible to design a format containing any meaningful information, possibly encrypted or signed with an RSA key.
The keys must:
Desirable:
The largest online stores deliver about 4 million orders per day. This is 124 million per month, or 1.5 billion per year, or 100 billion orders in 60-70 years.
For values used in a purely technical sense, i.e. not for showing to end users - UUID version 4 (random number). If sorting keys by creation time is desirable and revealing this time does not carry business risks, then - UUID version 7 (timestamp + random number).
For values used in a business sense, i.e. those printed on receipts, shown to end users, typed in web forms - numeric FPE string in the form 012-345-678
. Clients are only required to enter digits (on the smartphoneās compact numeric keypad) - separators are inserted automatically.
The rightmost digit is a check digit, so the above template digits can accomodate 100 million (minus 1) orders. The check digit is the same as for credit card numbers with the exception for all-zero strings. Such strings are unacceptable and if generated, must be re-generated (see Check Digit Calculation for details).
For documents and packaging, it is recommended to generate a QR code with the order number or a link to a payment form which has the order total pre-filled. This will make it easier to pay for orders using a smartphone.
DB records - potentially big data - are quickly searched and uniquely indexed in a way that does not impose business risks. Identifiers intended for display to customers and/or input by customers do not reveal internal statistics, such as the total number of records.
Manually entered numbers must be protected against typos by means of a check digit. Itās crucial for routing payments to the intended order number or bank account.
For alphanumeric strings (letters A-Z
and digits 0-9
) itās reasonable to use the ISIN validation algorithm which is an extension to the Luhn algorithm. The latter operates on decimal digits only; the former first replaces: A=10, B=11, ā¦, Z=35 and then invokes the Luhn algorithm. Thus, for decimal strings, the ISIN algorithm equals the Luhn algorithm which protects credit card numbers.
In general, any algorithm can be applied to arbitrary input by converting the input to digits first - by replacing each character with its index in the alphabet, for example in Unicode which accomodates slightly over a million entries.
All modulo-based algorithms (Luhn, ISIN, EAN13, ISBN) fail on all-zero strings (0
, 00
, etc.) - at least their implementation in the Apache Commons Validator library do. If such strings are OK, consider leveraging the Verhoeff algorithm (but it accepts decimal strings only). As an option, such strings must not be accepted; the source DB sequence must be increased once again to generate another value. The final value (with the check digit appended) must not be all-zero because such an order number looks weird.
Displayed in groups of 3 digits (minimum 9 digits, leading zeros for small values): 000-111-222
, 333-444-555-6
. Stored in DB and mentioned in logs in the same format to avoid discrepancies and problems with searching, input forms, copying, forwarding, etc.
2^63 - 1
equals 9
followed by 18 more digits.Displayed in groups of 3 digits (minimum 9 digits, leading zeros for small values): 000-111-222
, 333-444-555-6
. Stored in DB and mentioned in logs in the same format to avoid discrepancies and problems with searching, input forms, copying, forwarding, etc.
The numbers are based in a DB sequence, but a random delta is added each time. Besides the main goal - hiding the total number of records - this
The very first number is a sufficiently large one, without trailing zeros. For example:
Previous Record Number | Random Delta | New Record Number | Displayed As | Comment |
---|---|---|---|---|
23 456 | 23 456 | 002-345-6 | generation of the very first number | |
23 456 | 101 | 23 557 | 002-355-7 | |
23 557 | 200 | 23 757 | 002-375-7 | |
23 759 | 153 | 23 910 | 002-391-0 | |
⦠| ⦠| ⦠| ||
999 999 999 | 100 | 1 000 000 099 | 100-000-009-9 | 10th digit appears |
For the upper estimate (1.5 billion records per year), 11 digits would be enough for 60-70 years. However, considering the reduction in number capacity by 100-200 times, 2 more digits are required.
To obtain the next value, execute in Postgres (this can be wrapped in a procedure and called automatically):
BEGIN;
ALTER SEQUENCE seq_name INCREMENT BY <random number>;
SELECT nextval('seq_name');
COMMIT;
Let the first 3 digits be allocated for the number of months starting from a certain date (83 years = 996 months). Then, for the upper estimate (124 million records per month) - and even for 1 billion - 9 digits suffice, which, considering the reduction in number capacity by 100-200 times (the random delta), becomes 11. This sums up to 14 digits again, like in the previous section, just with fewer leading zeros. However, all the digits must always exist and be filled with zeroes.
The prefix can extend to the left as much as needed, but the number of records per month is limited. This is a characteristic of all formats that have a prefix.
In the 1st month, the numbers will look like 001-000-00-000-000
, in the 2nd - 002-000-00-000-000
. Each prefix needs its own DB sequence. When creating a record, the DB sequence corresponding to the prefix (number of months) is updated.
It has been mathematically proven possible (FPE - Format-Preserving Encryption), using a secret key, to encrypt any string so that the result:
The alphabet is the complete set of all possible characters. For example, if the alphabet consists of digits + Latin letters + hyphen, the string ā567ā can turn into ā0T-ā (if there is a requirement to keep letters as letters etc., then the string will have to be encrypted in parts, with different alphabets). If the alphabet consists of digits only, then one set of digits will always result in another set of digits, however, this will not be a number, but a string where leading zeros may appear.
For example, the number 12 - as string ā12ā - might turn into ā07ā, and the number 123 - as string ā123ā - might turn into ā007ā. If we interpret the result as a number, then both would be 7, which is incorrect. If we exclude zero from the alphabet so that the result is a raw number, then the algorithm, strangely enough, will encrypt, but it will be impossible to decrypt the result, because the algorithm Ā«doesnāt knowĀ» about the existence of zero.
An arbitrary alphabet can be used, for example hexadecimal (0-9A-F), for which the original number needs to be first encoded with this alphabet to a string; the string is then transformed. FPE, by definition, is designed for mapping characters from and to the same alphabet.
The recommended FPE implementation is FF3 version FF3-1 approved as a NIST standard - see their document, section 5.2. A Java implementation has been tested that accepts, when using the decimal alphabet, strings 6 to 56 characters long. This limit depends on the alphabet and is due to the following FF3-1 requirement: the number of possible values of the input string must be at least 1 million.
In this (decimal) case, 10^6 = 1 000 000. Therefore, before encryption, itās necessary to left-pad the input decimal string with zeros if itās shorter than 6 characters. As for the maximum length (56), the largest 8-byte signed integer (Java long) consists of 19 digits.
The secret key length can be 128, 192, or 256 bits. An additional key, called tweak, is 56 bits long.
The format (alphabet + group separator) is determined by the business requirements. This section merely describes a few options.
Letās define the ID as a variable-length string in the format 012-345-678
- minimum 9 decimal digits, the rightmost one being a check digit, so the string length remains constant for up to 100 million (minus 1) records. Any number of leading zeros is allowed, and they have a meaning, they canāt be stripped out.
Digits are separated from left to right by hyphens in groups of 3 (for visual harmony). The separators are part of the format, because if we only use them for display and store 012345678
(with the leading zero!) for 012-345-678
in DB, there will be ambiguity and problems with searching, input forms, copy-pasting, forwarding, etc. The separator can be anything, for example a dot (012.345.678
), a forward slash (012/345/678
).
The source of the string is an ordinal number obtained from a DB sequence. The string can extend to the left as the source number increases: 000-000-000
ā 000-000-000-0
and so on. Formally, the extension is infinite, but in reality it will not exceed 20 digits: the maximum 8-byte signed integer (Java long) has 19 digits (9
followed by 18 digits), plus a check digit. Thus, the approximate maximum possible value takes 26 characters: 900-000-000-000-000-000-0c
where c
is a check digit depending on the algorithm.
For the upper realistic estimate - 1.5 billion records per year - 11 significant digits plus a check digit are sufficient for 60-70 years (100 billion records minus 1): 999-999-999-99c
- which, with group separators added, takes 15 characters.
A different alphabet can be used, for example Crockfordās Base32 famous for the absence of ambiguously looking characters (0/O, 1/I, 1/L): 0123456789ABCDEFGHJKMNPQRSTVWXYZ
. Then the max. record number will take not 26 characters, but only 18: 2G1-D1V-5J0-000-0c
. The upper estimate - 100 billion records minus 1 - will require not 15 characters, but only 10: 2X4-7FT-Zc
. On the other hand, this reduces readability, and sometimes 5
can be confused with an S
.
Original sequence numbers (positive integers obtained from a DB sequence) are, after conversion to string, left-padded with zeros to 8 digits: 123
ā 00000123
. If the original number has more than 8 digits, it remains unchanged: 12 345 678
ā 12345678
. Inherently, 8 zeros and 9 zeros are real and different values; see their decryption below.
After the string has been encrypted:
Then, for self-validation:
Hereās an example of algorithm operation for the 128-bit secret key 2DE79D232DF5585D68CE47882AE256D6
and the 56-bit ātweakā CBD09280979564
:
Sequence Number ā | Padding
(if needed) ā |
Encrypted ā | Final Number
(w/ISIN check digit) |
---|---|---|---|
1 | 000001 | 943130351 | 943-130-351 |
2 | 000002 | 887763811 | 887-763-811 |
3 | 000003 | 308392836 | 308-392-836 |
1 000 | 001000 | 568737225 | 568-737-225 |
10 000 | 010000 | 216388769 | 216-388-769 |
100 000 | 100000 | 151996170 | 151-996-170 |
1 234 567 | 1234567 | 683768261 | 683-768-261 |
Since any combination of digits corresponds to some original (sequential) number, we can find out at which sequence number the client will receive known values:
Final Number
(ācā is the check digit) ā |
Decrypted ā | Sequence Number |
---|---|---|
000-000-00c | 08959589 | 8 959 589 |
000-000-000-c | 746999857 | 746 999 857 |
000-000-01c | 93005957 | 93 005 957 |
000-000-02c | 62197228 | 62 197 228 |
000-000-10c | 69986121 | 69 986 121 |
The UUID length is 16 bytes (128 bits). The most common variant is just a random number - version 4.
6cb32aba-88a7-4188-b32a-ba88a751883c
) takes 36 bytes.6cb32aba-88a7-4188-b32a-ba88a751883c
.Idempotency - a property of an object or operation that guarantees that repeated application of the operation to the object yields the same result as the first application. For example, if the object has already been created, it wonāt be created again.
ā©