Probably There, Definitely Not There – The Magic of Bloom Filter

When signing up for an email or to a website do you remember the exhausting procedure of selecting a username? First, you try with something simple like your first name or last name. Unless you have a really unique name, good luck with that. Next, you append your birth year to it. Still no luck? I am sure you have tried adding your phone number too. By this time, you already start loathing your username.

We can only try so many usernames, and we feel mentally exhausted. Imagine what goes under the hood. For each username you try, the computer must check against all the username it has stored in its database. And for applications such as Facebook, Twitter, and Instagram, the task is a challenge in itself.

How do you think these social media giants with hundreds of millions of users manage to check if a username is available or not and within milliseconds inform you? Clearly simple solutions like linear search are inefficient. Well, one possible optimal solution is the use of Bloom Filter.

What is Bloom Filter?

Bloom Filter is a probabilistic time and space efficient data structure used to represent a set and perform membership queries i.e. query whether an element is a member of the set or not. Burton H. Bloom first purposed bloom filter in his paper Space/Time Trade-offs in Hash Coding with Allowable Errors in 1970.

There are two parts to a Bloom Filter:

  1. Bit Vector
  2. Hash Function

Bit Vector

A bit vector is an array of n x 1 dimensions consisting of either 1 or 0 as its member. All the bit values are first initialized to 0.

bit vector in bloom filter

Hash Function

A hash function takes an input and generates a fixed-bit-size hash code. Here in our case, we use the hash function to generate an index value for the bit vector.

Here is how the two combine to implement bloom filter.

For every member (username), to be stored in the bit vector, it must first pass through the hash function. The hash function generates a hash code or index value. The bit value at the corresponding index of bit vector is set to 1. We can use multiple hash functions as well which will alter the bit at that many index positions.

hashing in bloom filter

Now, if we want to check whether a member (username) is in the bit vector or not, we simply pass it through the hash function and generate an index value. Next, we check the bit value at that index position. If the value is 1, the member is in the bloom filter, else it is not.

The Probabilistic Part

Bloom Filter is called a probabilistic data structure and here is the reason why.

When we query the membership of an element into a bloom filter one of two results is possible. Either the member is not present i.e. get 0 as the return value or the member is present i.e. get 1 as return value at the index position generated by the hash function.

When we get a 0 as the return value, it signifies that the element is definitely not present in the bloom filter. However, if we get a 1 as the return value, it does not guarantee that the element is present in the bloom filter. It signifies that the element may be present. The reason behind it is the use of the hash function.  A hash function maybe collision-prone i.e. the hash function may generate same index value even for different members. As a result, the hash function may generate an index value at which the bit value is already 1. Hence, even though the member is not present in the bloom filter, the result will be that the member is already present. This is the case of false positive.

False positives in bloom filter can be minimized by using bit values at multiple positions to represent a single element. One of the ways of achieving it is the use of multiple hash functions. Or we can use a hash function with good uniform distribution.

Bloom Filter is a simple and elegant solution for membership queries.

Time Efficiency

Bloom Filter implements the concept of indexing which has a time complexity of O(1). If we use k number of hash functions, it becomes O(k).

Space Efficiency

The most attractive feature of a bloom filter is its space efficiency. All that a bloom filter stores are bit values. One member is represented by a single bit value. Hence, we can store billions of members in a bloom filter that takes only a few hundred MB of memory. As a result, we can store the data even in primary memory for fast access.

Detecting Duplicate Username While Registration using a Bloom Filter

Now that we are clear what a bloom filter is, how it works, and what its advantages are, let’s implement it in detecting duplicate username while registering to a dummy email application.

We will implement it using Node.js and we will use Redis database to implement the bit vector and use Murmur hash function.

Part I: Bit Vector using Redis Database

Redis is an open-source in-memory database that stores data in memory rather than hard disk which means database query is faster.

Redis supports different types of data structures such as string, sets, bit set etc. We will use the bitmap data structure to implement the bit vector. Redis implements the bitmap using its string data structure i.e. it stores the bit vector as a string.

The string is capable of storing 32 bits while occupying a memory space of just 512 MB. That means to store 4 billion members, it will only occupy 512 MB of space.

The bitmap has two function calls:

  1. setbit: to set bit value to 1 at specified index position
  2. getbit: to get the bit value at specified index position
var redis = require('redis');  // Import Redis module
var murmurHash3 = require('murmurhash3js');  // Import MurmurHash3 module

var client = redis.createClient();  // Create Redis client

client.on('connect', function () {  // Connect to Redis Database
    console.log("Connected to redis.");
});

var bit;

module.exports = {
    /* Set bit value to 1 at index position generated by hash function for 'string'
      *  'bloomFilter' is the name of the bit vector
    */
    set: function(username) {
        client.setbit('bloomFilter', murmurHash3.x86.hash32(username), 1);
    },
    /* Check the bit value at index position generated by hash function */
    get: function(username) {
        client.getbit('bloomFilter', murmurHash3.x86.hash32(username), function(err, value){
            if (err) throw err;
            bit = value;
        });
        return bit;
    }
};

Part II: Hash Function using MurmurHash

We are using a 32-bit version MurmurHash function. We will install the MurmurHash3js module and call murmurHash3.x86.hash32(string) function

We are using the MurmurHash function because it is an excellent choice for hash-based lookups.  In a test conducted by Colin Pesyna, Murmur hash function was
found to have no collision while testing with 42 million keys. Further tests concluded that MurmurHash function had random uniform distribution and good avalanche effect. All these properties make Murmur hash an excellent choice for use in bloom filter.

Part III: Implementation in Registration Logic

First, we check if the username is available or not.

var bf = require('../BloomFilter');
var bit = bf.get(req.body.username);
res.send({
    bit: bit
});

If the username is available, then we allow registration. First, enter the user details in the database and then set the bit value corresponding to the username to 1.

var sql = "INSERT INTO users 
   (first_name, last_name, user_name, password, gender, date_of_birth, phone_number) " +
   "VALUES ('" + firstName + "','" + lastName + "','" + username + "','" + 
   password + "','" + gender + "','" + dateOfBirth + "','" + phoneNumber + "')";

db.query(sql, function (err, result) {
    if (err) throw err;
    bf.set(username); 
});

Screenshots

bloom-filter-username-available

Case I: Username is available

bloom-filter-username-not-available

Case II: Username is not available

You can find the full project on my GitHub account.

Bloom Filter is such a versatile data structure. There are many more variants of the data structure and it has a wide range of applications in networking, security, big data. Here is a good resource that highlights just that.

Build an Animal Guessing Game in Kotlin in 5 Minutes

Animal Guessing Game is a toy problem in Artificial Intelligence and is one of the best introductory projects for programmers who want to learn AI. I am taking Artificial Intelligence course for my fifth semester and this is my first AI project which I wrote in Kotlin.

Animal Guessing Game as the name suggests the guesses the animal that the player is thinking of. Here is how the game goes:

Description

First, the Animal Guessing Game agent asks the player to think of an animal. Then the agent will ask a series of Yes/No questions to narrow down its guess. When the agent runs out of questions to ask, it makes a guess and asks if it got is correct.

If the agent got the answer correct, it wins. Otherwise, the agent asks the player what it was thinking of and also asks what question should the agent had asked to make the correct guess. In other words, the agent learns from the player so that it can make a better guess next time.

Demo Snapshot

Here is a snapshot of a run in the game.

Kotlin Code

A snapshot of a run in Animal Guessing Game

Understanding The Problem

Here, it is very important to understand the problem. This is a simple binary problem i.e. ask the player a Yes/No question; if the answer is ‘Yes’ go to this question; if the answer is ‘No’ go to that question. Hence, it is best implemented in a Binary Search Tree which I have used.

Implementation

Binary Search Tree

Create a simple Binary search tree [Wiki] with Nodes containing data (A Yes/No question or an animal name) and two children nodes. The binary search tree consists of a function to insert a child node which the agent uses to store new information.

class Node ( /* Node data structure for binary search tree */
        var data: String, 
        var leftChild: Node? = null, 
        var rightChild: Node? = null
)

class BinarySearchTree {

    var rootNode = Node("") /* Initialize root node */
    var currentNode = Node("") /* A pointer variable to the current node*/

    /* Initialize the Binary Search Tree with initial data */
    fun initialize(data: String, yesAnswer: String, noAnswer: String) {
        rootNode = Node(data, Node(noAnswer), Node(yesAnswer)) /* Create new node and set it as root node */
    }

    /* Function to insert children node for left and right node of current node */
    fun insertChildren(currentNode: Node, data: String, userAnswer: String) {
        currentNode.leftChild = Node(currentNode.data) /* Shift the current answer to a new left node */
        currentNode.rightChild = Node(userAnswer) /* Store the user answer */
        currentNode.data = data /* Replace answer with new hint given by user */
    }
}

Animal Guessing Game Agent

Next, create the Animal Guessing Game Agent. It will contain the following functions:

  • A function to start the game. Here, we will initialize the game with a question and two answers (one for a ‘Yes’ answer and the other for a ‘No’ answer). In the above run, the initial question is ‘Can it fly?’ and the answer is Pigeon for ‘Yes’ and Elephant for ‘No’.
  • A function to play the game. This function whenever called points to the topmost node and traverses down. It contains a loop that iterates until a node is found null. Then it exits the loop and makes a guess.
  • A function to learn new information. This is what makes the agent intelligent. It asks the player the name of the animal that the player was thinking of and also a hint related to that animal. Then, the agent inserts the new information into its binary search tree.
class AnimalGuessingAgent {

    var BST = BinarySearchTree() /* Initialze binary search tree */
    var counter: Int = 0 /* Initialize counter to calculate number of steps */

    fun start() {
        println ("Think of an animal. I will try to guess which animal you are thinking of.")
        println ("I will ask some questions. Please, answer with a YES or NO.")

        BST.initialize("Can it fly?", "Pigeon", "Elephant")
        play()
    }

    private fun play() {

        BST.currentNode = BST.rootNode /* Each time the user plays the game, the pointer points to the initial node */
        counter = 0 /* Set counter to zero for new game */

        while (BST.currentNode.leftChild != null || BST.currentNode.rightChild != null){ /*  Traverse until a child node is null. */
            askQuestion()
            var userResponse: String = readLine()!!
            when (userResponse.toUpperCase()) {
                "YES" -> BST.currentNode = BST.currentNode.rightChild!! /* Traverse right */
                "NO" -> BST.currentNode = BST.currentNode.leftChild!! /* Traverse left */
                else -> println("My creator only designed me to understand a YES or NO. 🙁 ")
            }
            counter++ /* Increment counter */
        }
            guess() /* Make a guess */
    }

    private fun learn() {

        println ("Ok. I give up. What is it?")
        var userAnswer: String = readLine()!!

        println ("What question should have I asked?")
        var newHint: String = readLine()!!

        BST.insertChild(BST.currentNode, newHint, userAnswer) /* Insert newly learned information */
        println("Ok. Got it.")
        replay()
    }
}

You can find the full executable code here.

© 2018 Awale Sushil

Theme by Anders NorénUp ↑