Implementing a Bloom Filter in Go

A Practical Guide

Francisco Escher
ITNEXT

--

This article presents the concept of a simple bloom filter and a more scalable one. It implements them in Go. The entire code can be seen and used at https://github.com/franciscoescher/gobloom. Also, while this article presents quite technical terms and discussions, I will try to explain them in a language a wider audience can understand.

Introduction

Imagine you’re at a vast library, with millions of books. You want to know if a particular book is in the library, but checking each shelf is impractical. Wouldn’t it be nice to know beforehand when a book is not there, so you can avoid searching for it? Here’s where a Bloom Filter comes into play — think of it as a pre-filter, a first line of defence in data retrieval. It quickly sifts through massive volumes of data, indicating whether an item is possibly in the set or definitely not, saving you from exhaustive searches through every item.

What is a Bloom Filter? It’s not a physical filter, but a clever data structure. It allows us to check the presence of an element in a set with incredible speed and surprising space efficiency. The beauty of a Bloom Filter lies in its ability to handle massive data sets while occupying minimal space.

False Positives, But No False Negatives: Bloom Filters have an intriguing quirk — they can mistakenly tell you that a non-existent item is in the set (a false positive). Still, they’ll never miss an item that’s actually there. This unique trait makes them invaluable when speed and space efficiency are more critical than absolute accuracy, like internet routers checking IP addresses or spell checkers in word processors.

Why This Post? As a developer, I found the elegance of Bloom Filters captivating. However, implementing them efficiently can be challenging. This post aims to bridge that gap. I’ll take you through a practical implementation of a Bloom Filter in Go, using my own code as a guide.

Understanding the Basics

The Mechanics of a Bloom Filter

  • Hash Functions at Work: A Bloom Filter uses multiple hash functions to map each element to several positions in a bit array. When you add an item, the Bloom Filter marks these positions as true. This multi-hash approach is what makes Bloom Filters highly space-efficient and quick in querying.
  • The Bit Array: At the heart of a Bloom Filter is a simple array of bits (or boolean flags). Initially, all bits are set to false. As elements are added, specific bits determined by the hash functions are set to true. The size of this array and the number of hash functions are critical for the efficiency of the filter.

Probability and Efficiency

  • False Positive Rate: Understanding the false positive rate is crucial. It’s the probability that the Bloom Filter incorrectly indicates the presence of an element. This rate is affected by the size of the bit array and the number of elements added.
  • Calculating Size and Hash Functions: The magic lies in balancing the size of the bit array (m) and the number of hash functions (k). A larger array with more hash functions reduces the false positive rate but requires more memory and computation. There’s an optimal balance based on the expected number of elements and the acceptable false positive rate.

The Trade-Offs

  • Space vs. Accuracy: Bloom Filters offer a trade-off between space and accuracy. They are incredibly space-efficient compared to other data structures like hash tables or lists, but this comes at the cost of occasional false positives.
  • Speed vs. Precision: They provide a quick way to eliminate non-existent elements, speeding up searches significantly. However, for applications where precision is paramount, additional checks might be necessary following a positive result from the Bloom Filter.

Implementation in Go

Let's first define the interfaces:

package gobloom

import "hash"

type Interface interface {
Add([]byte)
Test([]byte) bool
}

type Hasher interface {
GetHashes(n uint64) []hash.Hash64
}

The “Hasher” interface will be used to generate hashes for our bloom filter implementation. This allows for the decoupling of responsibilities (and allows us to pass it as a dependency).

Why would I want to make the hashes configurable? The choice of hash functions is a critical decision that influences the filter’s accuracy, efficiency, and overall suitability for a given application. It’s a balance between collision rate, uniform distribution, computational efficiency, and the specific nature of the data being handled. The Hasher interface in your implementation provides the flexibility to experiment with different hash functions and find the optimal setup for a specific use case.

The “Interface” interface defines the core functionality expected from any Bloom Filter implementation: Adding an element, and then testing if it might already be there or not.

Let’s implement it below:

package gobloom

import (
"fmt"
"hash"
"math"
)

// BloomFilter represents a single Bloom filter structure.
type BloomFilter struct {
bitSet []bool // The bit array represented as a slice of bool
m uint64 // The number of bits in the bit set (shortcut for len(bitSet)
hashes []hash.Hash64 // The hash functions to use
k uint64 // The number of hash functions to use (shortcut for len(hashes)
mutex sync.Mutex // Mutex to ensure thread safety
}

// NewBloomFilterWithHasher creates a new Bloom filter with the given number
// of elements (n) and false positive rate (p).
func NewBloomFilterWithHasher(n uint64, p float64, h Hasher) (*BloomFilter, error) {
if n == 0 {
return nil, fmt.Errorf("number of elements cannot be 0")
}
if p <= 0 || p >= 1 {
return nil, fmt.Errorf("false positive rate must be between 0 and 1")
}
if h == nil {
return nil, fmt.Errorf("hasher cannot be nil")
}
m, k := getOptimalParams(n, p)
return &BloomFilter{
m: m,
k: k,
bitSet: make([]bool, m),
hashes: h.GetHashes(k),
}, nil
}

// getOptimalParams calculates the optimal parameters for the Bloom filter,
// the number of bits in the bit set (m) and the number of hash functions (k).
func getOptimalParams(n uint64, p float64) (uint64, uint64) {
m := uint64(math.Ceil(-1 * float64(n) * math.Log(p) / math.Pow(math.Log(2), 2)))
if m == 0 {
m = 1
}
k := uint64(math.Ceil((float64(m) / float64(n)) * math.Log(2)))
if k == 0 {
k = 1
}
return m, k
}

The parameters used to instantiate your struct are crucial for calculating the optimal parameters for the Bloom Filter, namely the size of the bit array (m) and the number of hash functions (k). This optimization is based on the desired number of elements (n) you expect to store in the Bloom Filter and the acceptable false positive rate (p). mutex is used to make it thread-safe. Here are some considerations on how to decide them:

  • Adjusting to Expected Data: The function tailors the Bloom Filter based on how much data you expect to store in it (n), and how often you're willing to accept false positives (p).
  • Balancing Size and Accuracy: The goal is to find a balance where the Bloom Filter is not too big (which wastes memory) and not too inaccurate (which leads to more false positives).
  • Based on Data and Accuracy: The size of the bit array is calculated based on the expected number of elements and the desired false positive rate.
  • Avoiding Overcrowding: A larger bit array can reduce the chance of false positives but uses more memory. The function aims to find a size that’s “just right.”
  • Optimizing Performance: The number of hash functions is crucial for how the data is spread across the bit array.
  • Balancing Spread and Workload: Too few hash functions might not spread out the data enough, leading to false positives. Too many, and the Bloom Filter slows down because it’s doing more work than necessary.

Now let’s implement the Add and Test functions:

// Add adds an item to the Bloom filter.
func (bf *BloomFilter) Add(data []byte) {
bf.mutex.Lock()
defer bf.mutex.Unlock()
for _, hash := range bf.hashes {
hash.Reset()
hash.Write(data)
hashValue := hash.Sum64() % bf.m
bf.bitSet[hashValue] = true
}
}

func (bf *BloomFilter) Test(data []byte) bool {
bf.mutex.Lock()
defer bf.mutex.Unlock()
for _, hash := range bf.hashes {
hash.Reset()
hash.Write(data)
hashValue := hash.Sum64() % bf.m
if !bf.bitSet[hashValue] {
return false
}
}
return true
}

The Add function is used to insert elements into the Bloom Filter, hashing the value with each hash from the filter and writing it into the corresponding bits.

The Test function is used to check whether an element is possibly in the Bloom Filter, verifying if the corresponding hashed bits are set to true in the bitSet. If not, it is proved that the data was not yet written to the filter.

Testing the filter

Here is a small application to test our new filter:

package main

import (
"fmt"

"github.com/franciscoescher/gobloom"
)

func main() {
bf, _ := gobloom.NewBloomFilter(1000, 0.01)
bf.Add([]byte("foo"))
bf.Add([]byte("bar"))
bf.Add([]byte("baz"))

fmt.Println(bf.Test([]byte("foo"))) // true
fmt.Println(bf.Test([]byte("bar"))) // true
fmt.Println(bf.Test([]byte("baz"))) // true
fmt.Println(bf.Test([]byte("qux"))) // false
}

Implementing the hasher

The hasher struct should simply return an n number for hashes. Let’s use the github.com/spaolacci/murmur3 package for this:

package gobloom

import (
"hash"

"github.com/spaolacci/murmur3"
)

type MurMur3Hasher struct{}

func NewMurMur3Hasher() *MurMur3Hasher {
return &MurMur3Hasher{}
}

func (h *MurMur3Hasher) GetHashes(n uint64) []hash.Hash64 {
hashers := make([]hash.Hash64, n)
for i := 0; uint64(i) < n; i++ {
hashers[i] = murmur3.New64WithSeed(uint32(i))
}
return hashers
}

Note: if you want to write your own hasher, make sure to use an algorithm that allows you to seed your hashes. By using different seeds for each hash function, you ensure that each function behaves differently. This diversity in behavior significantly reduces the likelihood of collisions, as the same input will produce different hash outputs in each function.

For a Bloom Filter to be efficient, it’s essential that elements are uniformly distributed across the bit array. A uniform distribution ensures that each bit in the array has an equal probability of being set, minimizing the chances of false positives.

Different seeds help in creating hash functions that spread the elements more evenly across the bit array. If all hash functions were seeded the same, they might exhibit similar distribution patterns, leading to clustering of set bits and thus higher false positives.

Scaling the Bloom Filter

If you paid attention, you might have realized that this implementation is static for value cardinality, meaning that once you set the number of values, you are stuck with it.

What if you are not sure about these fields, instead you want to create a filter that grows as you use it?

Why Scalable?

Traditional Bloom Filters have a fixed size, which means their efficiency and accuracy are predetermined by their initial configuration. However, in real-world applications, data often grows unpredictably, and a fixed-size filter might either waste space or become less accurate over time. I will present a solution that addresses this by dynamically adjusting its capacity and false positive rate as more elements are added, making it an ideal solution for applications dealing with evolving data sets.

Let’s see how this goes:

Components of the Scalable Bloom Filter

// ScalableBloomFilter combines multiple BloomFilter slices to adapt to a growing number of elements.
type ScalableBloomFilter struct {
filters []*BloomFilter // A slice of BloomFilter pointers, representing each layer of the scalable filter
n uint64 // The number of items that have been added
fpRate float64 // The false positive rate for the current filter slice
fpGrowth float64 // Factor by which the false positive probability should increase for each additional filter slice
}
  • filters: A slice of BloomFilter pointers, each representing a layer in the scalable structure. These layers allow the filter to expand its capacity.
  • n: This field tracks the total number of items that have been added across all filter layers.
  • fpRate: The false positive rate for the current filter layer, dictating the likelihood of the filter incorrectly indicating the presence of an element.
  • fpGrowth: A factor that determines how the false positive probability increases with each additional filter layer.

The logic of the ScalableBloomFilter is designed to handle growing datasets while maintaining a manageable false positive rate. This is achieved by using a series of standard Bloom Filters, each acting as a "layer" within the scalable filter. Here's a breakdown of the logic and how it manages the dynamic nature of the data it stores

// NewScalableBloomFilterWithHasher instantiates a new ScalableBloomFilter.
func NewScalableBloomFilterWithHasher(initialSize uint64, fpRate float64, fpGrowth float64, h Hasher) (*ScalableBloomFilter, error) {
if initialSize <= 0 {
return nil, errors.New("invalid initial size, must be greater than 0")
}
if fpRate <= 0 || fpRate >= 1 {
return nil, fmt.Errorf("invalid false positive rate, must be between 0 and 1, got %f", fpRate)
}
if fpGrowth <= 0 {
return nil, fmt.Errorf("invalid false positive growth rate, must be greater than 0, got %f", fpGrowth)
}

bf, err := NewBloomFilterWithHasher(initialSize, fpRate, h)
if err != nil {
return nil, err
}

// Return a new scalable Bloom filter struct with the initialized slice and parameters.
return &ScalableBloomFilter{
filters: []*BloomFilter{bf}, // Start with one filter slice
fpRate: fpRate, // Set the initial false positive rate
fpGrowth: fpGrowth, // Set the growth rate for false positives as the filter scales
n: 0, // Initialize with zero elements added
}, nil
}

NewScalableBloomFilter initializes a new scalable Bloom filter with estimated initial size, target false positive rate, and growth rate for the false positive probability with each new filter slice. It is essential to carefully select these values based on the use-case requirements to maintain system performance and desired accuracy as the dataset grows.

Parameters:

  • initialSize (uint64): The estimated number of elements you expect to store in the bloom filter initially. This is not a hard limit but rather a guideline for preparing the initial Bloom filter layer. A size that’s too small could lead to a rapid addition of new slices, increasing memory usage, and a size that’s too large could waste memory upfront.
  • fpRate (float64): The desired false positive probability for the first Bloom filter slice. It determines how likely it is that a ‘Test’ operation falsely indicates the presence of an element. A smaller fpRate will increase the number of bits used in the initial filter (decreasing the chance of false positives), but also increase consumption of memory. Typical values are between 0.01 and 0.001 (1% — 0.1%).
  • fpGrowth (float64): The growth rate of the false positive probability with the addition of each subsequent filter slice. If set to a value greater than 1, each new filter layer tolerates a higher false positive rate, which can be useful to postpone the addition of new layers and control memory usage. Commonly, this parameter should be set to a value close to 1 (e.g. 1.5–2), as a higher growth rate could lead to a rapidly deteriorating false positive rate.

Best Practices

  • Conduct a pre-analysis based on expected data growth to estimate initialSize and what fpGrowth rate could be appropriate. Consider both current and future memory availability and access patterns of the data.
  • Create logging or metrics around the scalable Bloom filter behavior, monitoring the number of slices created and memory consumption over time, which can provide insights into usage patterns and need for adjustment.
  • Avoid using very small fpGrowth values, as these will lead to frequent addition of filter slices, compounding memory usage and potentially causing performance bottlenecks as the ‘Test’ function needs to check more slices.
  • In distributed systems where consistency across nodes is necessary, ensure that all instances are initialized with the same parameters and handle the updates to filter slices identically.
  • As with any Bloom filter, using multiple, independent hash functions improves the spread of elements across the bit set. In a Scalable Bloom Filter, these hash functions need to maintain their properties as additional layers are added.

Adding and Testing values

// Add inserts the given item into the scalable Bloom filter.
// If the current filter slice exceeds its capacity based on the growth rate, a new slice is added.
func (sbf *ScalableBloomFilter) Add(data []byte) {
// Add the item to all existing filter slices.
for _, filter := range sbf.filters {
for i := uint64(0); i < filter.k; i++ {
filter.Add(data)
}
}

// Increment the total number of items added across all filter slices.
sbf.n++

// Check the last filter's capacity, and if needed, add a new filter slice.
// The current implementation may be only adding new filters and not previous ones.
// We need to base the condition on the properties of the last filter and the current count of added items.
currentFilter := sbf.filters[len(sbf.filters)-1] // Get the most recent filter

// Use the correct threshold to decide when to add a new filter.
// This threshold should be defined by how full the current filter is.
currentCapacity := float64(currentFilter.m) * math.Log(sbf.fpGrowth) / math.Log(2)
// If the number of items exceeds the current capacity of the filter:
if float64(sbf.n) > currentCapacity {
newFpRate := sbf.fpRate * math.Pow(sbf.fpGrowth, float64(len(sbf.filters)))
// Create and append the new filter slice.
nbf, _ := NewBloomFilter(sbf.n, newFpRate)
sbf.filters = append(sbf.filters, nbf)
}
}

func (sbf *ScalableBloomFilter) Test(data []byte) bool {
// Check the item against all filter slices from the oldest to the newest.
for _, filter := range sbf.filters {
// Assume the item is in the filter until proven otherwise.
isPresent := true

// Use the same hash functions that were used to add the item.
for i := uint64(0); i < filter.k; i++ {
// If any of the bits corresponding to the item's hash values are not set, it's definitely not present in this filter.
if !filter.Test(data) {
isPresent = false
// We break out of the hash function loop as soon as we find a bit that is not set.
break
}
}

// If all the bits for this filter are set, then the item is potentially present (with some false positive rate).
if isPresent {
return true
}
// Otherwise, continue checking the next filter to see if the item may be present there.
}

// If none of the filters had all bits set, the item is definitely not in the set.
return false
}

Adding Elements (Add Function)

  • Distribution Across Layers: When an element is added, it is added to all existing Bloom Filter layers. This ensures that the element is tracked no matter how much the scalable filter grows.
  • Item Counting: The filter maintains a count of all elements added (n), which is crucial for knowing when to add a new layer.
  • Capacity Check and Expansion:
  • The filter checks if the current (most recent) layer has reached its capacity to maintain the desired false positive rate.
  • If the current layer is deemed full (based on currentCapacity, which considers the fpGrowth), a new Bloom Filter layer is created and added to the filter.
  • This new layer has a higher false positive rate, calculated based on the original fpRate and the number of existing layers. The fpGrowth factor determines how much the false positive rate increases with each new layer.

Testing for Elements (Test Function)

  • Layer-by-Layer Testing: To check if an element is in the scalable filter, the Test function checks the element against each Bloom Filter layer, starting from the oldest.
  • Early Exit on Positive Find: If any layer reports that the element might be present (all bits corresponding to the element’s hash are set), the function immediately returns true.
  • Definitive Absence: If none of the layers indicate the element’s presence, the function returns false, confirming that the element is definitely not in the filter.

Managing False Positives

  • Growth Factor: The fpGrowth parameter controls how the false positive rate increases with each new layer. As more layers are added, the likelihood of false positives increases, but this is a trade-off for handling more elements without significantly increasing memory usage per layer.
  • Balancing Act: This approach is a balancing act between maintaining a low false positive rate and accommodating an increasing number of elements. The filter allows for the dataset to grow while ensuring that the probability of false positives remains within an acceptable range.

Some use cases

  1. Network Security: In cybersecurity, Bloom Filters can quickly check if an IP address or domain is part of a blacklist, efficiently filtering out malicious traffic.
  2. Web Caching Mechanisms: Web browsers or proxy servers can use Bloom Filters to determine if a web resource is cached, reducing unnecessary network requests and speeding up web browsing.
  3. Database Query Optimization: Databases can employ Bloom Filters to check if a query result is stored in a cache, reducing the load on the database server by avoiding redundant queries.
  4. Spell Checking: Word processors and text editing software can use Bloom Filters to quickly check if a word is in a dictionary, offering a fast way to identify misspelled words.
  5. Sybil Attack Prevention in P2P Networks: In peer-to-peer (P2P) networks, Bloom Filters can help detect and prevent Sybil attacks by efficiently checking if a node is already part of the network.
  6. Blockchain and Cryptocurrency: Blockchain applications, like Bitcoin, use Bloom Filters to efficiently synchronize and verify transactions without downloading the entire blockchain.
  7. Anti-Fraud Systems: Financial institutions can use Bloom Filters to screen transactions or account access requests against known fraud patterns or compromised account lists.
  8. Email Spam and Phishing Protection: Email servers can utilize Bloom Filters to filter out spam or phishing emails by checking incoming messages against a list of known spam characteristics.
  9. Bioinformatics: In genome sequencing, Bloom Filters assist in quickly querying massive DNA databases, helping identify gene sequences without the need for exhaustive database searches.
  10. Distributed Systems and Big Data: In large-scale distributed systems, Bloom Filters can efficiently manage data distribution and replication, ensuring that operations like data sharding and load balancing are conducted optimally.

Conclusion

We’ve delved into the intricacies of implementing both simple and scalable Bloom Filter in Go. These filters, while conceptually straightforward, embody a powerful approach to managing data efficiently and effectively in scenarios where traditional data structures might falter due to size constraints or performance requirements.

I invite readers to not only understand the theory but also to experiment with these implementations. The world of data is vast and ever-changing, and tools like Bloom Filters are key to navigating this landscape with confidence and skill.

Happy coding!

--

--