Count-min Sketch, Hierarchial timing, Operational transform, Last write Wins, Vector clocks

26. Count-min Sketch

Count-min Sketch is a probabilistic data structure used for estimating the frequency of events in a stream of data . It is similar to HyperLogLog in that it uses significantly less memory compared to exact methods, and provides estimated counts with high accuracy. Count-min Sketch uses a hash table with multiple hash functions to map events to a set of counters. Each event is hashed multiple times, and the corresponding counters are incremented. To estimate the frequency of an event, the hash functions are used to retrieve the counters associated with the event, and then the minimum value is returned. Count-min Sketch is frequently used in big data and streaming applications due to its high scalability and accuracy, and can be used for applications such as spam filtering, network monitoring, and intrusion detection, etc.

27. Hierarchical timing wheels

Hierarchical timing wheels is a technique used for implementing timers in software systems. It is based on the idea of using multiple timing wheels with different granularities to handle timer events 12. The timers are organized into multiple wheels, with each wheel handling a range of timer intervals. A higher-level wheel manages a larger range of timer intervals than a lower-level wheel. When a timer fires, the corresponding event is processed by the lowest-level wheel. If the timer interval falls outside the range of the lowest-level wheel, it is automatically promoted to the next higher-level wheel. This process continues until the timer interval is accommodated by one of the wheels in the hierarchy. Hierarchical timing wheels are commonly used in event-driven systems, and have advantages over other timer implementations such as accuracy, scalability, and avoidance of unnecessary overheads.

28. Operational transformation (OT)

Operational transformation (OT) is a technology used to support collaborative functionalities in advanced collaborative software systems . Originally developed for consistency maintenance and concurrency control in collaborative text document editing, OT has since been expanded to include capabilities such as group undo, locking, conflict resolution, operation notification and compression, group-awareness, HTML/XML and tree-structured document editing, among others. OT uses algorithms to ensure that concurrent operations on shared data result in consistent and predictable outcomes. It has numerous applications in real-time collaborative software systems, such as Google Docs and Microsoft Office, and is used for minimizing latency while preserving document consistency.

The basic idea of OT is to transform operations on a shared object so that they can be applied in a consistent and deterministic way, even when they are applied concurrently by multiple users. In other words, OT ensures that every user sees a consistent view of the shared object, even when there are multiple conflicting updates.

The OT algorithm works by transforming each operation based on the previous operations that have been applied to the shared object. When a user makes an update, their operation is transformed based on the previous operations that have been applied to the shared object. This ensures that the update is applied correctly, even if other users have made conflicting updates.

The OT algorithm is typically implemented using a data structure called an operation history, which stores all the operations that have been applied to the shared object. The history is used to transform new operations and to resolve conflicts between concurrent updates.

29. Last Write Wins (LWW)

Last Write Wins (LWW) is a conflict resolution strategy used in distributed systems. When multiple nodes or clients attempt to update the same data at the same time, conflicts can arise. LWW is a simple approach to handling such conflicts, where the system chooses the update with the most recent timestamp or version number as the winner and discards the other updates.

In LWW, each update is timestamped or given a version number, and when a conflict arises, the system compares the timestamps or version numbers of the conflicting updates. The update with the latest timestamp or version number is chosen as the winner, and its value is used to update the data in the system. The other updates are discarded, and their values are lost.

LWW is easy to implement and works well in situations where conflicts are rare or where losing updates is acceptable. However, LWW can lead to data loss and inconsistencies when multiple updates are made to the same data in quick succession or when timestamps or version numbers are not synchronized correctly across the distributed system.

LWW is often used in distributed databases and other systems where data consistency is not critical or where conflicts are infrequent. In more complex scenarios, such as collaborative editing or real-time collaboration, more sophisticated conflict resolution strategies, such as Operational Transformation (OT), may be required to ensure data integrity and consistency.

Operational Transformation (OT) and Last Write Wins (LWW) are two techniques used to handle conflicts in distributed systems, but they have different approaches and trade-offs.

30. Vector clocks

Vector clocks are a mechanism used in distributed systems to track causality and ordering of events between different nodes. In a distributed system, events may occur concurrently on different nodes, and it can be challenging to determine the ordering of these events. Vector clocks are a way to track the relative ordering of events and detect causality between them.

In a vector clock, each node maintains a vector of integers that represent the number of events that have occurred on that node. When a node generates an event, it increments its own component in the vector. When a node receives an event from another node, it updates its vector to include the maximum value of each component from both vectors. This way, the vector clock of each node reflects the relative ordering of events between all nodes in the system.

Vector clocks are useful in many applications, including distributed databases, distributed file systems, and distributed consensus algorithms. They are used to determine whether two events are causally related, whether a node has seen all events from another node, and whether a node’s state is consistent with the state of other nodes. By maintaining vector clocks and comparing them across nodes, distributed systems can ensure correct ordering of events and maintain consistency across different nodes.

Some specific uses of vector clocks in distributed systems include:

Conflict detection and resolution: Vector clocks can be used to detect and resolve conflicts when multiple nodes concurrently modify the same data. By comparing the vector clocks associated with the conflicting updates, a system can determine whether the updates are causally related and which update should be applied.
Consistency maintenance: Vector clocks can be used to ensure that all nodes in a distributed system have an up-to-date view of the system state. By comparing the vector clocks associated with different versions of data, a system can determine whether a node has seen all updates to the data and can safely apply those updates.
Distributed consensus: Vector clocks can be used in consensus algorithms, such as Paxos or Raft, to ensure that all nodes agree on the ordering of events and decisions. By exchanging and comparing vector clocks, nodes can ensure that they have seen all relevant events before making a decision.

Implementing vector clocks typically involves assigning a vector of integer counters to each node in the distributed system. Whenever a node generates an event, it increments its own counter in the vector, and the updated vector is propagated to other nodes along with the event. When a node receives an event and its associated vector clock, it updates its own vector clock to include the maximum value of each component from both vectors.

PreviousGlobal state collection, Gossip, Replica management, Self-stabilization, HyperLoglog NextSystems design

Last updated 1 year ago