Every time you try to create a new account on any of the websites, you begin with your name and, more often than not, you get the response “Username is already taken.”
Then, you add “your name + date of birth”, to realize it also has been “already taken” to finally end up with “your name + date of birth + license plate + graduation” to create the account.
I’m sure a lot of you are nodding and saying “been there, done that.”
But how many of you have wondered how these giant sites like Facebook, Instagram, and Gmail verify whether the username is taken or not?
Let’s start with the two possible approaches
A linear search may not be a good idea
Let’s assume that Facebook stores all the data in its directory.
And the software simply checks each name on the list one at a time and if it doesn’t find a match, it tells you your desired username is available.
Doesn’t sound sensible, does it?
The software has to look at each name every time a username needs to be verified.
The technique is unreasonable when you compare it with the Facebook database, which has over 1.5 billion users, and Twitter, which has 300 million users.
What if they use a Binary search?
This makes more sense, with all the brains working at Facebook.
Facebook keeps all the data sorted and arranged in an alphabetized list.
The list is 1.5 billion characters long, stored like a, aa,aaa……xyy, XYZ, yaa,yaa,yxz, zaa, zac and is very similar to your dictionary.
When you enter a name, it matches the entry with the username exactly in the middle of the list. If it matches, the software rejects the new username.
If it doesn’t match (which has a lot of possibilities), the next question the software addresses are “ If searched alphabetically, does the requested username come before or after this username in the middle?”
If it comes before, then the software knows that all the 750 million people after the username found in the middle of the list is of no use for the current search.
That eliminates 750 million possibilities in a single comparison.
If the desired username comes after the name in the middle (alphabetically), it eliminates all the names before it.
Either way, the software eliminates almost 750 million names for search in the first comparison.
Next, it takes the selected half of the list and immediately matches the requested username with the name in the middle of the remaining list.
If it matches, the requested name is rejected and if it doesn’t, the requested name is again checked for the possibility of it occurring before or after the name in the middle.
If it is before, reject the 350 million names after the name.
And go ahead with divide and conquer for the rest of the names as done earlier.
If the requested name is after the middle string, reject the names before it and try with the 350 million remaining names.
By dividing the list every time, you can compare the required username with the names in the list quite quickly…
But the question is…how quickly?
You will continue dividing the list into two until you can no longer do so.
And when you are left with one name in the database, you match it with your desired username.
This would be the last step before you find whether the username “chosen” is available or not.
For data as big as 1.5 billion, this method would need no more than 30 steps. 2 to the power of 31 gives you 2.14 billion, which is closest to our expectation of 1.5 billion users on Facebook.
This means fewer steps and complications for the same data when searched with a linear search.
What if the developers are very smart and use a Bloom filter as the solution?
Before you understand Bloom filters, you need to understand the concept of Hashing.
Hashing is like the license plate of your car.
A hash function takes data of any length as input and gives you a smaller identifier of a smaller, fixed length, which is often used to identify or compare data.
Bloom filters work simply – Test and Add.
Test whether the element is present in the list:
- If it returns false, the element is definitely not on the list.
- If it returns true, the element might probably be on the list. This false positive (will discuss it below) is a function of the Bloom filter and depends on the size and is independent of the hash function used.
A Bloom filter divides the memory area into buckets, which are filled with various hashes generated from one or many hash functions.
Let’s understand with an example.
Suppose, you have a memory bucket of size 10 and 3 hash functions which will give you three unique identifiers.
Suppose, you enter Ronaldo into this memory bucket.
Ronaldo, when passed through these hash functions, gives the value of 1,4, and 5. The filter quickly fills the memory in the bucket with these identifiers.
Now, you enter Messi into the memory bucket. Messi, when passed through the hash function, gives its own unique identifier. In this case, it is 3,7, and 8 and the filter fills the bucket.
As the functions always return the same value for similar inputs, we can be sure that when the name Ronaldo is given to the filter, it would check in locations 1,4, and 5 to find it full, which means that the name Ronaldo is already on the list.
Let’s continue with another example of entering Rooney into the memory.
Rooney, when passed through the hash function, returns 2,6, and 8. The filters check the memory to find that though 8 is full 2 and 6 are empty, which means you don’t have Rooney in the memory.
Therefore, the name is available.
But when the name Neymar is passed through the hash functions, it returns the value of 1,3 and 7 which eventually makes the filter believe that the name Neymar is already present on the list.
This scenario explains the concept of false positives used in Bloom filters. One can control the false positive by controlling the size of the Bloom filter.
More space is inversely proportional to false positives.
Each of the above-mentioned techniques comes with its own advantages and disadvantages.
With technology and computers getting smarter and faster every day, even the brute-force method seems feasible.
But with space and time complexity, many companies, such as Reddit, prefer Binary search, whereas some others, such as Medium, use Bloom filters smartly to suggest articles for you without repeating them again on your timeline.
Register now before your username is taken on the HackerEarth platform.