Unique Identifier Systems

A lot of time you will need a way to uniquely identify items, be it in a database, a file system, or a generalised heap. You can start counting at 1 and the second would be 2, third be 3, etc.  This is okay for a singularly constrained system but in the case where there isn’t a single authority on the identification schema then you need to come up with an alternative. One would be to have an identifier per authoritative server.  I have worked on a system that did this in the past.  Back when the internet was the fraction of the speed it is now, it was faster for a man on a motorbike to transport a bunch of 40gb tape drives around the country than to send it over the wires.  Each system started its id at a ridiculously high number, the first being 1×1012, second 1×1012 x 2, third 1×1012 x 3, etc.  The exact number escapes me but it was ridiculously large and unlikely ever to overlap. So it needs to be truly unique?  Then you will probably reach for UUID. UUID stands for Universally Unique IDentifier.  This generates a bunch of seemingly random numbers with the form of 123e4567-e89b-12d3-a456-426614174000.  I have used these in the past.  One company I worked at had the following at the start of every header:
#ifndef EXAMPLE_H_123e4567_e89b_12d3_a456_426614174000
#define EXAMPLE_H_123e4567_e89b_12d3_a456_426614174000
Problem I have with UUID is that there is a lot of wasted bytes if you don’t need that level or randomness.  Take an online game server.  If you send it as the normal textual representation then you will be transmitting 36 bytes.  Even if you send it as binary then you will be sending 16 identification bytes with every packet.  On a fast paced twitch game then that will mount up to a lot of bytes, especially if you are sending every 20ms.  A lot of bits go into recording the time since 1582 in 100 nanosecond increments.  Also more bits are spent recording the MAC address (in some variants) to further differentiate the value.  So maybe the UUID is not for us.  What about other algorithms that are out there?
One is the Snowflake Id created by Twitter and is used for the IDs of tweets.  This is 8 bytes in length to half that of the UUID or if you send it via hexadecimal text then it would be 16.  Like UUID it has a component of time since an epoch (41 bits), but this has the resolution milliseconds.  It too has a component for the machine id and several bits for sequence meaning it can handle 4096 items per millisecond.
Which maybe enough for our needs.  But can we reduce the size?  If we refine our use case we can look for optimisations.  Say we contact a server with a all zero representation, the server will then assign us a value which becomes our identifier. We could just have an ever increasing number but we want something fairly non-guessable as a first level defence (remember obfuscation is not security!). With this knowledge we can drop the need for data that defines the machine.  Lets just say that the server hits come in waves as new lobbies open.
By varying we the resolution of the time and the sequence we can determine the effect.  If we have 1 second resolution from an epoch and reserve 8 bits for sequencing, if we have a 10,000 people all join at the same time then it will take nearly 40 second to assign all the id’s.  If we save 10 bits then reduces to under 10 seconds.  A further 2 bits will reduce that to 2.5 seconds, which maybe acceptable but do we have the bits to spare?
So let’s see if we do: The standard Unix Epoch is January 1, 1970. At the moment of writing 1680009514 seconds have elapsed. Splitting it up into more readable format we get 1,680,009,514 or just over a billion and half seconds.  It doesn’t take a maths major to see that we need 31 bits to store this value.  How about we shift the epoch?  The dropping the billion would give us a wrap around date of 2033 which may be enough, but with some games being around a long, long time (Counterstrike, World of Warcraft, etc.) then it may not be enough.  If we shift the epoch to 1.5 billion seconds after the Unix Epoch that gives us 168+ million to store.  That takes our requirements to 28 bits.  That only leaves us we 4 bits for sequencing (if we are aiming for 32 bits) and a wraparound date of 2026.
So we reduce further: 1.6 billion seconds after the epoch give us 27 bits which puts the wrap around date to December of next year.  Then next thing is to drop the 1.68 billion.  That funnily enough puts the epoch to about 3 hours ago (at time of writing).  The puts the requirement to store 8 digits which can be done in 26 bits.  That extends wraparound to just 2025 which is not enough.
All in all, you can’t really get a unique time based identifier to 32 bytes without some serious limitations.
So what can we do? Increase the bits. 36 bits would be enough to store the epoch, give a massive wrap around date and 8 bits of sequencing which may or may not be enough.  Also accessing 8 bytes may dramatically increase access time and space due to padding and cache lines.  Remember 40 seconds to service 10,000 users?!
By using 48 bits of storage you can store a fairly recent epoch (2015) to millisecond accuracy with 41 bits of time (wrap around occurs in about 1000 years) with 7 bits for sequence which means we could generate 128 ids every millisecond.  That means our 10,000 users would be serviced in 78 milliseconds. So while you can reduce the bytes used for identification you have to know your use case and tailor it to it.

Be the first to comment

Leave a Reply

Your email address will not be published.


*