Bloom Filters

Based on https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/ ; https://www.baeldung.com/cs/bloom-filter

Introduction

Have we ever wondered how websites validate millions of records within seconds? One of the solutions is using the Bloom filter.

What is it ?

A bloom filter is a probabilistic data structure meant for checking whether a given entry does not occur in a list. It is meant to be quite fast, and is used as a way of not doing costly queries when it can be determined that no results will be returned.

Under the hood, the Bloom filter is just an array of bits with all bits set to zero initially. The below diagram depicts the Bloom filter of size 19:

On this page.

What are the Bloom Filter Operations ?

Insert

We can perform Insert operations in constant space and time complexity. Bloom filter performs the below steps for Insert operation:

Hash the input value
Mod the result by the length of the array
Set corresponding bit to 1

Let’s illustrate this by inserting two strings – John Doe and Jane Doe into the Bloom filter. Let’s assume that hash values for these two strings are 1355 and 8337, respectively. Now, let’s perform the modulo operation to get an index within the bounds of the bit array: 1355%19 = 6 and 8337%19 = 15.

The Bloom filter will look like this after inserting values:

Lookup

We can perform a Lookup operation in constant time complexity. Bloom filter performs the below steps as a part of the Lookup operation:

Hash the input value
Mod the result by the length of the array
Check if the corresponding bit is 0 or 1

If the bit is 0, then that input definitely isn’t a member of the set. But if the bit is 1, then that input might be a member of a set. So let’s check if string John is present or not in the Bloom filter:

Note : the bit at the 10th index is 0, which indicates that the given string isn’t present in the Bloom filter.

What are the False Positive Analysis ?

Bloom filter is a space and time-efficient data structure. However, the tradeoff for that efficiency is that it’s probabilistic in nature. It means that searching for a nonexistent element can give an incorrect answer. Let’s check if the string James Bond is present or not in the Bloom filter. Assume that the hash value of the input string is 1355, which is the same as John Doe.

In this example, the Lookup operation returns the true result. However, we never inserted the string James Bond in the Bloom filter.

The false-positive scenario occurs due to hash collision. We can use multiple hash functions to reduce the hash collision frequency. So instead of setting only one bit, multiple bits will be set for a single input.

Where are used the Bloom Filter ?

Bloom filters are used by many popular applications due to their efficiency :

Spell Checker: In the early days, spell checkers were implemented using the Bloom filter
Databases: Many popular databases use Bloom filters to reduce the costly disk lookups for non-existent rows or columns. This technique is used by PostgreSQL, Apache Cassandra, Cloud Bigtable, etc.
Search Engines: BitFunnel is a search engine indexing algorithm. It uses the Bloom filter for its search indexes
Security: Bloom filters are used to detect weak passwords, malicious URLs, etc.