This article was originally posted on my personal blog.
A Bloom Filter is a space-efficient probabilistic data structure, created by Burton Howard Bloom in 1970. It is used to test whether an element is a member of a set with no false negatives.
In Bloom Filter, we hash an element with k-hashing algorithms, generate indices from these hashes and then set the value at these indices to 1 in the filter-list. To check if an element is in the set, we run the same k-hashing algorithms on that element and check if all the values at the indices produced by the hashes are 1. If all the values are 1 then the element is probably in the set and if any one of the value is 0, then the element is definitely not in the set.
In the figure above, w
is definitely not in the set since one of the value in the filter-list is 0.
Where is Bloom Filter Used?
Bloom Filter is most frequently used in cache layers to prevent caching of “One Hit Wonders”. It avoids the caching of data requested only once. The data is cached only after multiple requests to the same data. Similarly, it is also used to avoid recommending articles a user has previously read. There are many other applications of Bloom Filter which can be found here.
Implementing a Simple Bloom Filter
We’re going to learn about Bloom Filter by writing a simple implementation using Python as the programming language. In this program, we want to check if a query is definitely new or probably old using Bloom Filter.
Filter-List
A bloom filter can be implemented by using a list of a certain length. Initially all the members of the list are initialized to zero.
bloom_len = 10
bloom = [0] * bloom_len
Getting User Input
First, we get the input from the user inside a while loop so that we can take multiple inputs. We also need to encode the input text in ‘utf-8’ for hashing which will be described next.
while True:
query = str(input("Enter a query: ")).encode('utf-8')
Hashing
Here we use three hashing algorithms – md5, sha1 and sha224 since they are readily available in the hashlib
package. These three hashing algorithms produce three output for the same query
the user inputs. The outputs of the hash functions are stored in a list for further processing.
# query = str(input("Enter a query: ")).encode('utf-8')
m1 = hashlib.md5()
m2 = hashlib.sha1()
m3 = hashlib.sha224()
m1.update(query)
m2.update(query)
m3.update(query)
hash_values = [m1.digest(), m2.digest(), m3.digest()]
Find the Filter-Indices
After the hash values are calculated which are of bytes
data type, we convert them into integers and store them in a list. The integers are then modulo divided by the length of the Bloom Filter to find the indices that need to be set inside the Bloom Filter.
# bloom = [0] * bloom_len
def bytes_to_int(hash_value):
return int.from_bytes(hash_value, byteorder='big')
def bloom_index(hashint):
return hashint % bloom_len
# while True:
# hash_values = [m1.digest(), m2.digest(), m3.digest()]
hashints = list(map(bytes_to_int, hash_values))
indices = list(map(bloom_index, hashints))
Filtering the Query
Now, we check if the query is new or not. As we already mentioned, Bloom Filter finds if the query is definitely new or probably old. To do this, we check the values present in the filter-list on the indices that we got by hashing the query. If any of the value in the filter at those indices is zero, we can be sure that the query is a new one. If all the values are one then the query is probably old.
If we’re sure that the query is a new one, then we set the values at the indices to one for filtering future queries.
# return hashint % bloom_len
def new_query(indices):
new = False
for i in indices:
if bloom[i] == 0:
new = True
break
return new
def set_bloom(indices):
for i in indices:
bloom[i] = 1
# indices = list(map(bloom_index, hashints))
if new_query(indices):
print("Definitely new query")
set_bloom(indices)
else:
print("Probably old query")
Resetting the Filter
As more and more queries are processed, the values in the filter-list are updated This means that the rate of false positive increase steadily until every the element in the list becomes 1. To prevent this from happening, in our simple implementation, we’ll reset the list if the number of 1s in the list is greater than 80% of the total elements in the list.
# while True:
if fill_percentage() > 80.0:
print("Resetting Bloom Filter-List")
bloom = [0] * bloom_len
# word = str(input("enter a word: ")).encode('utf-8')
Conclusion
You can try the program to check for the output generated after each query by giving in inputs. You can also change the value of bloom_len
to play around with the program. This small implementation’s design is not meant to be used for real systems but it is helpful to learn how Bloom Filter generally works.
Here’s the complete code for the program:
import hashlib
bloom_len = 10
bloom = [0] * bloom_len
def bytes_to_int(hash_value):
return int.from_bytes(hash_value, byteorder='big')
def bloom_index(hashint):
return hashint % bloom_len
def fill_percentage():
return bloom.count(1) / bloom_len * 100;
def new_query(indices):
new = False
for i in indices:
if bloom[i] == 0:
new = True
break
return new
def set_bloom(indices):
for i in indices:
bloom[i] = 1
while True:
if fill_percentage() > 80.0:
print("Resetting Bloom Filter-List")
bloom = [0] * bloom_len
word = str(input("enter a word: ")).encode('utf-8')
m1 = hashlib.md5()
m2 = hashlib.sha1()
m3 = hashlib.sha224()
m1.update(word)
m2.update(word)
m3.update(word)
hash_values = [m1.digest(), m2.digest(), m3.digest()]
hashints = list(map(bytes_to_int, hash_values))
indices = list(map(bloom_index, hashints))
if new_query(indices):
print("Definitely new query")
set_bloom(indices)
else:
print("Probably old query")
External Resources
You can play with the interactive Bloom Filter demonstration by Jason Davies to help visualize the concepts. Wikipedia also has a comprehensive page on Bloom Filter.
References
原文链接:Implementing a Simple Bloom Filter Data Structure in Python
暂无评论内容