Topics 8
Last updated
Last updated
Data stored in the database could be interesting to many other data systems, such as analytics, AI, etc. If we have thousands of data systems, do we have to write thousands of converters?
The answer is NO. Change data capture (CDC) is a process that can solve the problem. This is how CDC works:
Data is written to the database normally.
Database uses the transaction log to record the modifications.
CDC software uses the source connector to connect to the database and reads the transaction log.
The source connector publishes the log to the message queue.
CDC software uses its sink connector to consume the log.
The sink connector writes the log content to the destination.
All these operations except step 1 are transparent to the user. Popular CDC solutions, such as Debezium, have connectors for most databases, such as MySQL, PostgreSQL, DB2, Oracle, etc. We only need to set up the CDC link between two databases and the data will automatically flow to the destination.
Over to you: can we use CDC for NoSQL/NewSQL data systems, such as Redis, Cassandra, MongoDB, ElasticSearch, etc?
Some failures are transient, which makes it a good candidate to retry the request.
If an application detects a failure when it tries an operation, it can handle the failure using the following strategies:
🔹 Cancel: the client can cancel the request.
🔹 Immediate retry: client immediately resends a request.
🔹 Fixed intervals: wait a fixed amount of time between the time of the failed payment and a new retry attempt.
🔹 Incremental intervals: client waits for a short time for the first retry, and then incrementally increases the time for subsequent retries.
🔹 Exponential backoff: double the waiting time between retries after each failed retry. For example, when a request fails for the first time, we retry after 1 second; if it fails a second time, we wait 2 seconds before the next retry; if it fails a third time, we wait 4 seconds before another retry.
🔹 Exponential backoff with jitter. If all the failed calls back off at the same time, they cause contention or overload again when they retry. Jitter adds some amount of randomness to the backoff to spread the retries.
Issues
Retry is not perfect. It can cause issues such as overloading the system, executing the same operation multiple times, and amplifying a problem by creating a storm of requests.
Rate limiting and circuit breakers patterns are commonly used to limit the load and avoid service overload.
This is just an overview of retry patterns. More to come later.
Evolution of the Netflix API architecture.
The Netflix API architecture went through 4 main stages.
Monolith. The application is packaged and deployed as a monolith, such as a single Java WAR file, Rails app, etc. Most startups begin with a monolith architecture.
Direct access. In this architecture, a client app can make requests directly to the microservices. With hundreds or even thousands of microservices, exposing all of them to clients is not ideal.
Gateway aggregation layer. Some use cases may span multiple services, we need a gateway aggregation layer. Imagine the Netflix app needs 3 APIs (movie, production, talent) to render the frontend. The gateway aggregation layer makes it possible.
Federated gateway. As the number of developers grew and domain complexity increased, developing the API aggregation layer became increasingly harder. GraphQL federation allows Netflix to set up a single GraphQL gateway that fetches data from all the other APIs.
Over to you - why do you think Netflix uses GraphQL instead of RESTful?
There is an exciting class of storage software like ScyllaDB and Redpanda that boasts at least an order of magnitude improvement in performance compared to Apache Cassandra and Apache Kafka, respectively.
They take full advantage of some of the explosive trends in the last decade in computer architecture. What are these trends?
When Apache Cassandra came out around the late 2000s, AWS EC2 instances with a few physical cores and 64GB of RAM were considered high end.
When Apache Kafka came out in the early 2010s, an SSD was about 30 times more expensive per GB than spinning disks.
What happened in the ensuing decade?
We can now rent an AWS EC2 instance with 36 physical cores and 15 TB of NVMe SSD drives and 512GB of RAM. Network bandwidth at 25Gbps is commonplace, and with some instances supporting 100Gbps. An NVMe SSD drive is about 100 times faster than a spinning disk from a decade ago.
In order to take full advantage of these advances, high performance software requires new designs.
This new class of storage software takes full advantage of these improvements with the following fundamental architectural decisions.
First, they all use shared-nothing architecture. In this architecture, each request is serviced by a single core, and each thread is pinned to a core. Instead of sharding at the server level, we can think of this as sharding at the CPU core level. There is no memory contention between cores, and the use of locks is practically eliminated.
Also, this architecture recognizes the high cost of traditional threading models. At the high core count of modern servers, context switching is extremely costly, with large thread stacks polluting the caches and slowing everything down.
To complement the shared-nothing architecture, an asynchronous programming model is widely used. In addition to async networking which was already common with the previous generation of storage software, with this class of software, everything is asynchronous. This includes file I/O, and even communication between CPU cores.
They run their own co-operative scheduler, instead of relying on the general purpose kernel scheduler. ScyllaDB and Redpanda use the same underlying C++ library called Seastar for the implementation of shared-nothing architecture and asynchronous operations.
These two design choices together allow this class of software to fully utilize CPU, memory, and I/O resources of modern servers.
Second, this new class of software keeps the external interface the same as the previous generation of software, but re-implemented everything under the hood in a low level language. Both ScyllaDB and Redpanda are written in C++. There is no JVM, and there is no production tuning for garbage collection. The tail latency is low and very predictable as the workloads scale.
Third, instead of relying on the kernel to handle file I/O and page cache, this new class of software handles their own I/O and caching. While the kernel is a very capable general purpose operating system, operating at this level of performance requires controlling everything. This includes caching, file I/O, and task scheduling.
What is the drawback of this new class of software? Performance does not come for free. The level of complexity of this class of software is higher than the ones from the previous generation. C++ is already difficult to program in. The asynchronous programming model enforced by Seastar makes it even harder to reason about.
Having their own co-operative scheduler means taking full responsibility for managing long running tasks. It is challenging to ensure that every task takes as short as possible to complete. Any latency impact from errant tasks could be felt throughout the entire stack.
What do WhatsApp, Discord, and Facebook Messenger have in common? Authored by Sahn Lam.
They are all powered by Erlang. What is so special about Erlang that makes it the technology choice behind these popular chat services?
Erlang was developed in the 80’s by Ericsson for telecom switches that demanded extreme reliability. It is upon this rock-solid foundation that these modern chat services were built.
When people talk about Erlang, they are usually referring to the entire ecosystem called Erlang/OTP. It consists of the VM (BEAM), runtime environment (Erlang/OTP), and the language itself. They work together to provide this highly fault-tolerant programming environment.
There are several secret ingredients that make Erlang special.
Erlang processes are extremely lightweight. It is fast and cheap to create an Erlang process. A big machine could have millions of these processes.
Erlang processes are isolated. They communicate with each other only through messages. It is easy to send a message to another process, whether it is on the same machine or a different machine. This makes it easy to scale an Erlang application either horizontally by adding more machines, or vertically by adding more cores.
Erlang’s “let it crash” design principle provides a unique solution to fault tolerance. Erlang implementers viewed software and hardware faults as inevitable. Erlang implements a concept called supervisor tree to allow Erlang applications to recover from these faults quickly and reliably. A supervisor monitors its child processes and decides how to recover when a fault occurs. This is central to any well-designed Erlang application.
There are now new languages built on top of the Erlang/OTP foundation. These languages improve the developer experience while relying on the rock-solid fault-tolerant foundation that is Erlang/OTP.
Besides chat apps, Erlang is used in many other mission-critical applications. Can you name any other well-known Erlang applications?
What is ELK Stack and why is it so popular for log management?
The ELK Stack is composed of three open-source products. ELK stands for Elasticsearch, Logstash, and Kibana.
🔹 Elasticsearch is a full-text search and analysis engine, leveraging Apache Lucene search engine as its core component.
🔹 Logstash collects data from all kinds of edge collectors, then transforms that data and sends it to various destinations for further processing or visualization.
In order to scale the edge data ingestion, a new product Beats is later developed as lightweight agents installed on edge hosts to collect and ship logs to Logstash.
🔹 Kibana is a visualization layer with which users analyze and visualize the data.
The diagram below shows how ELK Stack works:
Step 1 - Beats collects data from various data sources. For example, Filebeat and Winlogbeat work with logs, and Packetbeat works with network traffic.
Step 2 - Beats sends data to Logstash for aggregation and transformation. If we work with massive data, we can add a message queue (Kafka) to decouple the data producers and consumers.
Step 3 - Logstash writes data into Elasticsearch for data indexing and storage.
Step 4 - Kibana builds on top of Elasticsearch and provides users with various search tools and dashboards with which to visualize the data.
ELK Stack is pretty convenient for troubleshooting and monitoring. It became popular by providing a simple and robust suite in the log analytics space, for a reasonable price.
Over to you: which other log management products have you used in production? How do they compare with ELK Stack?
Gergely Orosz wrote an excellent article about this topic and he has kindly agreed to share excerpts with newsletter readers.
Startups: Typically do fewer quality checks than other companies.
Startups tend to prioritize moving fast and iterating quickly, and often do so without much of a safety net. This makes perfect sense if they don't – yet – have customers. As the company attracts users, these teams need to start to find ways to not cause regressions or ship bugs. They then have the choice of going down one of two paths: hire QAs or invest in automation.
Traditional companies: Tend to rely more heavily on QAs teams.
While automation is sometimes present in more traditional companies, it's very typical that they rely on large QA teams to verify what they build. Working on branches is also common; it's rare to have trunk-based development in these environments.
Large tech companies: Typically invest heavily in infrastructure and automation related to shipping with confidence.
These investments often include automated tests running quickly and delivering rapid feedback, canarying, feature flags and staged rollouts.
Facebook core: Has a sophisticated and effective approach few other companies possess.
Facebook's core product is an interesting one. It has fewer automated tests than many would assume, but, on the other hand, it has an exceptional automated canarying functionality, where the code is rolled out through 4 environments: from a testing environment with automation, through one that all employees use, through a test market of a smaller region, to all users. In every stage, if the metrics are off, the rollout automatically halts.
Over to you: how does your company ship code to production? Does well does it work?
If you want to read the full article, you can find it here:
To understand Linux file permissions, we need to understand Ownership and Permission.
Ownership
Every file or directory is assigned 3 types of owner:
🔹Owner: the owner is the user who created the file or directory.
🔹Group: a group can have multiple users. All users in the group have the same permissions to access the file or directory.
🔹Other: other means those users who are not owners or members of the group.
Permission
There are only three types of permissions for a file or directory.
🔹Read (r): the read permission allows the user to read a file.
🔹Write (w): the write permission allows the user to change the content of the file.
🔹Execute (x): the execute permission allows a file to be executed.
Over to you: what are some of the commonly used Linux commands to change file permissions?
I put together a list and explained why they are important. Those algorithms are not only useful for interviews but good to understand for any software engineer.
One thing to keep in mind is that understanding “how those algorithms are used in real-world systems” is generally more important than the implementation details in a system design interview.
What do the stars mean in the diagram?
It’s very difficult to rank algorithms by importance objectively. I’m open to suggestions and making adjustments.
Five-star: Very important. Try to understand how it works and why.
Three-star: Important to some extent. You may not need to know the implementation details.
One-star: Advanced. Good to know for senior candidates.
Dedeepya Bonthu wrote an excellent engineering blog that captures this nicely. Here is my understanding of how the system works.
Clients send emojis through standard HTTP requests. You can think of Golang Service as a typical Web Server. Golang is chosen because it supports concurrency well. Threads in GoLang are lightweight.
Since the write volume is very high, Kafka (message queue) is used as a buffer.
Emoji data are aggregated by a streaming processing service called Spark. It aggregates data every 2 seconds, which is configurable. There is a trade-off to be made based on the interval. A shorter interval means emojis are delivered to other clients faster but it also means more computing resources are needed.
Aggregated data is written to another Kafka.
The PubSub consumers pull aggregated emoji data from Kafka.
Emojis are delivered to other clients in real-time through the PubSub infrastructure.
The PubSub infrastructure is interesting. Hotstar considered the following protocols: Socketio, NATS, MQTT, and gRPC, and settled with MQTT. For those who are interested in the tradeoff discussion, see [2].
A similar design is adopted by LinkedIn which streams a million likes/sec [3].
Over to you: What are some of the off-the-shelf Pub-Sub services available? Is there anything you would do differently in this design?
How do we design a system for internationalization?
The diagram below shows how we can internationalize a simple e-commerce website.
Different countries have differing cultures, values, and habits. When we design an application for international markets, we need to localize the application in several ways:
🔹 Language
Extract and maintain all texts in a separate system. For example:
We shouldn’t put any prompts in the source code.
We should avoid string concatenation in the code.
We should remove text from graphics.
Use complete sentences and avoid dynamic text elements
Display business data such as currencies in different languages
🔹 Layout
Describe text length and reserve enough space around the text for different languages.
Plan for line wrap and truncation
Keep text labels short on buttons
Adjust the display for numerals, dates, timestamps, and addresses
🔹 Time zone The time display should be segregated from timestamp storage. Common practice is to use the UTC (Coordinated Universal Time) timestamp for the database and backend services and to use the local time zone for the frontend display.
🔹 Currency We need to define the displayed currencies and settlement currency. We also need to design a foreign exchange service for quoting prices.
🔹 Company entity and accounting Since we need to set up different entities for individual countries, and these entities follow different regulations and accounting standards, the system needs to support multiple bookkeeping methods. Company-level treasury management is often needed. We also need to extract business logic to account for different usage habits in different countries or regions.
Parse HTML and generate Document Object Model (DOM) tree
When the browser receives the HTML data from the server, it immediately parses it and converts it into a DOM tree.
Parse CSS and generate CSSOM tree
The styles (CSS files) are loaded and parsed to the CSSOM (CSS Object Model).
Combine DOM tree and CSSOM tree to construct the Render Tree
With the DOM and CSSOM, a rendering tree will be created. The render tree maps all DOM structures except invisible elements (such as or tags with display:none; ). In other words, the render tree is a visual representation of the DOM.
Layout
The content in each element of the rendering tree will be calculated to get the geometric information (position, size), which is called layout.
Painting
After the layout is complete, the rendering tree is transformed into the actual content on the screen. This step is called painting. The browser gets the absolute pixels of the content.
Display
Finally, the browser sends the absolute pixels to the GPU, and displays them on the page.