Consistant Hashing, Bloom Filter, SkipLists, B-Tree, LRU/LFU

1. Consistence Hashing

Consistent hashing is a hashing technique that minimizes the need for remapping when a hash table is resized . In traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped. However, consistent hashing uses a special kind of hashing technique such that only keys need to be remapped on average, where “n” is the number of keys and “m” is the number of slots. This technique is particularly useful in load balancing, where a standard hash function can be used to calculate the hash value for a blob, and then the resultant value of the hash is used with modular operations to determine the specific server where the blob will be placed. This technique is used in various data storage and distribution systems, including OpenStack’s Object Storage Service Swift, Amazon’s storage system Dynamo, and Akamai content delivery network.

Consistent hashing is widely used in distributed system design for several use cases, including but not limited to:

  • Load balancing: Consistent hashing allows for easy distribution of load across multiple servers. When a new server is added to the cluster or an existing server is removed, only a small subset of keys need to be remapped to the new or remaining servers, minimizing the impact of the change on the overall system.

  • Caching: Consistent hashing is used to determine which cache node should store a particular key-value pair. This can help reduce the number of cache misses, improving the overall performance of the system.

  • Content delivery: Consistent hashing is used to map content to edge servers in a content delivery network (CDN). This helps ensure that content is served from the server closest to the user, reducing latency and improving the user experience.

  • Database sharding: Consistent hashing is also used to shard data across multiple database servers, allowing for efficient distribution of data and improved performance. In this case, each shard is allocated to a particular server based on its hash value.

Simple java implementation

2. Bloom Filters

A Bloom filter is a probabilistic data structure that is used to test whether an element is a member of a set or not 1. It works by hashing elements and setting the corresponding bits in an array of bits. When testing if an element is in the set, the Bloom filter hashes the element and checks if all of the corresponding bits in the array are set. If any of the bits are not set, then the element is not in the set. However, if all of the bits are set, the element may or may not be in the set, as there could have been a collision with another element that set the same bits.

Bloom filters have a high space efficiency as they require relatively small amounts of memory compared to other data structures, such as hash tables or binary search trees. They are also very fast in terms of insertion and search time, with constant time complexity. However, they have a trade-off in terms of the number of false positives, which is the probability that a Bloom filter will incorrectly identify an element as being in the set when it is not. The probability of false positives can be controlled by adjusting the size of the Bloom filter and the number of hash functions used.

Bloom filters are commonly used in various applications, including:

  • Spell checkers: Bloom filters can be used to quickly check if a word is in a dictionary, reducing the need to search through the entire dictionary word by word.

  • Network routers and firewalls: Bloom filters can be used to store information about the network traffic, such as the IP addresses of known malware or spam sources, and quickly identify if incoming traffic is from a known source.

  • Databases and search engines: Bloom filters can be used to pre-filter candidate records or documents to avoid expensive disk access and CPU processing.

  • Web browsers: Bloom filters can be used to store information about visited URLs for fast phishing and malware detection.

Java simple implementation

3. Skip Lists

Skip lists are a type of probabilistic data structure, similar to a multilevel linked list, that can efficiently search, insert, and delete elements in a sorted sequence. They are designed to have an average search, insertion, and deletion time complexity of O(log n), making them a suitable alternative to balanced trees in some cases.

Skip lists allow for efficient searching by adding multiple levels of links between nodes, with the lower levels acting as shortcuts to higher levels. The linking between levels is done randomly, so that each element has a probability of appearing at each level, with higher levels having progressively smaller probabilities. This randomness helps ensure that the skip list stays balanced on average, even as elements are added or removed.

Skip lists are used in a variety of applications, including databases, game programming, and network routing. They are particularly well-suited for applications that require fast search times and frequent insertions, and may be a better choice than balanced trees in situations where the cost of rebalancing becomes prohibitive. However, because they rely on randomness, their performance can be difficult to predict precisely, and they may not be as suitable for applications that require very tight control over performance characteristics.

Simple java implementation

4. B-Tree

A B-tree is a self-balancing tree data structure that maintains sorted data and allows efficient search, insertion, and deletion operations in logarithmic time. Unlike other self-balancing binary search trees, B-trees have the ability to deal with much larger data sets and are commonly used in storage systems such as databases and file systems. They generalize binary search trees by allowing nodes to have more than two children and have a fixed maximum size, referred to as the order, which determines the number of keys that can be stored in each node.

In a B-tree, internal nodes can have a variable number of child nodes within an ordered range, with each node storing keys and pointers to other nodes. The keys within a node are ordered, and the pointers are used to navigate between nodes during search operations. At the leaf level, actual data is stored in sorted order.

B-trees are designed to maintain balance, which ensures efficient search operations, and they achieve this through a process of node splitting and merging. When a new key is inserted, or a key is removed, the B-tree is rebalanced so that the number of levels in the tree remains relatively constant.

B-trees have logarithmic time complexity for search, insertion, and deletion operations, which makes them a popular choice for applications that require quick access to large data sets. They are also used extensively in modern database systems to provide fast indexing and querying capabilities.

Btree java implementation

A B+ tree is an extended version of a B-tree, where each non-leaf node contains a fixed number(m) of pointers to child nodes(m-ways). The B+ tree has better performance compared to binary search trees and B-trees when searching and inserting data. The B+ tree is often used in databases and file systems where the data is stored in external storage. In a B+ tree, all the data is stored in the leaf node, whereas the internal nodes only store index keys. The leaf nodes are linked, and a sequential access can be made along the leaf nodes. Because of its many pointers, the B+ tree has good space utilization, as well as good performance for range queries. It is also self-balancing to ensure that the time complexity for searching, insertion, and deletion is logarithmic in most cases.

B+ tree

5. LRU and LFU

In a typical LRU implementation, you maintain a map of key-value pairs and a linked list of keys in the order of their access. Whenever an item is accessed, you move its corresponding key to the head of the list. If you need to evict an item due to capacity constraints, you simply remove the least recently accessed item from the tail of the list.

LFU eviction policy can be implemented using a combination of a hash map and a priority queue. In this implementation, you maintain a hash map of key-value pairs and a priority queue of entries ordered by the least frequency of usage. Whenever an entry is accessed, you increment its usage count and update its position in the priority queue. If you need to evict an item due to capacity constraints, you simply remove the least frequently used item from the head of the priority queue

The choice between LRU and LFU eviction policies depends on the specific use case and access patterns of the cache. LRU is typically a good choice when you have a large cache with high locality of reference, i.e., the most recently accessed items are likely to be accessed again in the near future. LFU is a good choice when you have a small cache with many frequently accessed items that you want to keep in memory at all times. However, there are many variations of both eviction policies, and other policies such as LRU-K and Adaptive Replacement Cache (ARC) that combine the benefits of both LRU and LFU policies.

Last updated