Topics 6
Last updated
Last updated
How to avoid crawling duplicate URLs at Google scale?
Option 1: Use a Set data structure to check if a URL already exists or not. Set is fast, but it is not space-efficient.
Option 2: Store URLs in a database and check if a new URL is in the database. This can work but the load to the database will be very high.
Option 3: 𝐁𝐥𝐨𝐨𝐦 𝐟𝐢𝐥𝐭𝐞𝐫. This option is preferred. Bloom filter was proposed by Burton Howard Bloom in 1970. It is a probabilistic data structure, that is used to test whether an element is a member of a set.
🔹 false: the element is definitely not in the set.
🔹 true: the element is probably in the set.
False-positive matches are possible, but false negatives are not.
The diagram below illustrates how the Bloom filter works. The basic data structure for the Bloom filter is Bit Vector. Each bit represents a hashed value.
Step 1: To add an element to the bloom filter, we feed it to 3 different hash functions (A, B, and C) and set the bits at the resulting positions. Note that both “www.myweb1.com” and “www.myweb2.com” mark the same bit with 1 at index 5. False positives are possible because a bit might be set by another element.
Step 2: When testing the existence of a URL string, the same hash functions A, B, and C are applied to the URL string. If all three bits are 1, then the URL may exist in the dataset; if any of the bits is 0, then the URL definitely does not exist in the dataset.
Hash function choices are important. They must be uniformly distributed and fast. For example, RedisBloom and Apache Spark use murmur, and InfluxDB uses xxhash.
Question - In our example, we used three hash functions. How many hash functions should we use in reality? What are the trade-offs?
This is the flowchart of how slack decides to send a notification.
It is a great example of why a simple feature may take much longer to develop than many people think.
When we have a great design, users may not notice the complexity because it feels like the feature just working as intended.
What’s your takeaway from this diagram?
Image source: Slack Eng blog
How does Amazon build and operate software?
In 2019, Amazon released The Amazon Builders' Library. It contains architecture-based articles that describe how Amazon architects, releases, and operates technology.
As of today, it published 26 articles. It took me two weekends to go through all the articles. I’ve had great fun and learned a lot. Here are some of my favorites:
🔹Making retries safe with idempotent APIs
🔹Timeouts, retries, and backoff with jitter
🔹Beyond five 9s: Lessons from our highest available data planes
🔹Caching challenges and strategies
🔹Ensuring rollback safety during deployments
🔹Going faster with continuous delivery
🔹Challenges with distributed systems
🔹Amazon's approach to high-availability deployment
How to design secure web API access for your website?
When we open web API access to users, we need to make sure each API call is authenticated. This means the user must be who they claim to be.
In this post, we explore two common ways:
Token based authentication
HMAC (Hash-based Message Authentication Code) authentication
The diagram below illustrates how they work.
Token based
Step 1 - the user enters their password into the client, and the client sends the password to the Authentication Server.
Step 2 - the Authentication Server authenticates the credentials and generates a token with an expiry time.
Steps 3 and 4 - now the client can send requests to access server resources with the token in the HTTP header. This access is valid until the token expires.
HMAC based
This mechanism generates a Message Authentication Code (signature) by using a hash function (SHA256 or MD5).
Steps 1 and 2 - the server generates two keys, one is Public APP ID (public key) and the other one is API Key (private key).
Step 3 - we now generate a HMAC signature on the client side (hmac A). This signature is generated with a set of attributes listed in the diagram.
Step 4 - the client sends requests to access server resources with hmac A in the HTTP header.
Step 5 - the server receives the request which contains the request data and the authentication header. It extracts the necessary attributes from the request and uses the API key that’s stored on the server side to generate a signature (hmac B.)
Steps 6 and 7 - the server compares hmac A (generated on the client side) and hmac B (generated on the server side). If they are matched, the requested resource will be returned to the client.
Question - How does HMAC authentication ensure data integrity? Why do we include “request timestamp” in HMAC signature generation?
How do microservices collaborate and interact with each other?
There are two ways: orchestration and choreography
The diagram below illustrates the collaboration of microservices.
Choreography is like having a choreographer set all the rules. Then the dancers on stage (the microservices) interact according to them. Service choreography describes this exchange of messages and the rules by which the microservices interact.
Orchestration is different. The orchestrator acts as a center of authority. It is responsible for invoking and combining the services. It describes the interactions between all the participating services. It is just like a conductor leading the musicians in a musical symphony. The orchestration pattern also includes the transaction management among different services.
The benefits of orchestration:
Reliability - orchestration has built-in transaction management and error handling, while choreography is point-to-point communications and the fault tolerance scenarios are much more complicated.
Scalability - when adding a new service into orchestration, only the orchestrator needs to modify the interaction rules, while in choreography all the interacting services need to be modified.
Some limitations of orchestration:
Performance - all the services talk via a centralized orchestrator, so latency is higher than it is with choreography. Also, the throughput is bound to the capacity of the orchestrator.
Single point of failure - if the orchestrator goes down, no services can talk to each other. To mitigate this, the orchestrator must be highly available.
Real-world use case: Netflix Conductor is a microservice orchestrator and you can read more details on the orchestrator design.
Deploying or upgrading services is risky. In this post, we explore risk mitigation strategies.
The diagram below illustrates the common ones.
Multi-Service Deployment
In this model, we deploy new changes to multiple services simultaneously. This approach is easy to implement. But since all the services are upgraded at the same time, it is hard to manage and test dependencies. It’s also hard to rollback safely.
Blue-Green Deployment
With blue-green deployment, we have two identical environments: one is staging (blue) and the other is production (green). The staging environment is one version ahead of production. Once testing is done in the staging environment, user traffic is switched to the staging environment, and the staging becomes the production. This deployment strategy is simple to perform rollback, but having two identical production quality environments could be expensive.
Canary Deployment
A canary deployment upgrades services gradually, each time to a subset of users. It is cheaper than blue-green deployment and easy to perform rollback. However, since there is no staging environment, we have to test on production. This process is more complicated because we need to monitor the canary while gradually migrating more and more users away from the old version.
A/B Test
In the A/B test, different versions of services run in production simultaneously. Each version runs an “experiment” for a subset of users. A/B test is a cheap method to test new features in production. We need to control the deployment process in case some features are pushed to users by accident.
Over to you - Which deployment strategy have you used? Did you witness any deployment-related outages in production and why did they happen?
One picture is worth more than a thousand words. In this post, we will take a look at how to design Google Docs
1️⃣ Clients send document editing operations to the WebSocket Server.
2️⃣ The real-time communication is handled by the WebSocket Server.
3️⃣ Documents operations are persisted in the Message Queue.
4️⃣ The File Operation Server consumes operations produced by clients and generates transformed operations using collaboration algorithms.
5️⃣ Three types of data are stored: file metadata, file content, and operations.
One of the biggest challenges is real-time conflict resolution. Common algorithms include:
🔹 Operational transformation (OT)
🔹 Differential Synchronization (DS)
🔹 Conflict-free replicated data type (CRDT)
Google Doc uses OT according to its Wikipedia page and CRDT is an active area of research for real-time concurrent editing.
Over to you - Have you encountered any issues while using Google Docs? If so, what do you think might have caused the issue?
Interesting read: Software Architecture and Design InfoQ Trends Report — April 2022
Key takeaways: “Data plus architecture" is the idea that, more frequently, software architecture is adapting to consider data. This holistically includes data quality, data pipelines, and traceability to understand how data influenced decisions and AI models.
Innovative software architecture is facilitating data quality the way we’ve improved code quality. Catching bad data early is as important as catching bugs early.
The practice of software architecture does not belong solely to people with the job title of architect. Every engineer can actively participate in the architecture, and architects should help facilitate that process.
One positive benefit of the pandemic and the shift to remote and hybrid work is increased asynchronous communication, which can manifest as Architecture Decision Records (ADRs).
Software architects are adapting their feedback loops, which can be challenging when dealing with colleagues across many time zones or other remote work constraints. Good architects are learning from distributed working how to design better distributed systems.“ [1]
To better understand this question, let’s first take a look at what is a Program. A Program is an executable file containing a set of instructions and passively stored on disk. One program can have multiple processes. For example, the Chrome browser creates a different process for every single tab.
A Process means a program is in execution. When a program is loaded into the memory and becomes active, the program becomes a process. The process requires some essential resources such as registers, program counter, and stack.
A Thread is the smallest unit of execution within a process.
The following process explains the relationship between program, process, and thread.
The program contains a set of instructions.
The program is loaded into memory. It becomes one or more running processes.
When a process starts, it is assigned memory and resources. A process can have one or more threads. For example, in the Microsoft Word app, a thread might be responsible for spelling checking and the other thread for inserting text into the doc.
Main differences between process and thread:
🔹 Processes are usually independent, while threads exist as subsets of a process.
🔹 Each process has its own memory space. Threads that belong to the same process share the same memory.
🔹 A process is a heavyweight operation. It takes more time to create and terminate.
🔹 Context switching is more expensive between processes.
🔹 Inter-thread communication is faster for threads.
Over to you:
1). Some programming languages support coroutine. What is the difference between coroutine and thread?
2). How to list running processes in Linux?