Topics 17
Last updated
Last updated
Based on the Lucene library, Elasticsearch provides search capabilities. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. The diagram below shows the outline.
Features of ElasticSearch:
Real-time full-text search
Analytics engine
Distributed Lucene
ElasticSearch use cases:
Product search on an eCommerce website
Log analysis
Auto completer, spell checker
Business intelligence analysis
Full-text search on Wikipedia
Full-text search on StackOverflow
The core of ElasticSearch lies in the data structure and indexing. It is important to understand how ES builds the term dictionary using LSM Tree (Log-Strucutured Merge Tree).
What are its principles, methods, constraints, and best practices? I hope the diagram below gives you a quick overview.
ChatGPT and copy. ai brought attention to AIGC (AI-generated Content). Why is AIGC gaining explosive growth?
The diagram below summarizes the development in this area.
OpenAI has been developing GPT (Generative Pre-Train) since 2018.
GPT 1 was trained with BooksCorpus dataset (5GB), whose main focus is language understanding.
On Valentine’s Day 2019, GPT 2 was released with the slogan “too dangerous to release”. It was trained with Reddit articles with over 3 likes (40GB). The training cost is $43k.
Later GPT 2 was used to generate music in MuseNet and JukeBox.
In June 2020, GPT 3 was released, which was trained by a much more comprehensive dataset.
More applications were developed based on GPT 3, including:
DALL-E: creating images from text
CLIP: connecting text and images
Whisper: multi-lingual voice to text
ChatGPT: chatbot, article writer, code writer
With the development of AIGC algorithms, many companies have applications to generate text, images, code, voice, and video.
I strongly recommend that you play with these applications. The results are astonishing!
Why is it hazardous to the services? Here is an example of how DDoS works.
The purpose of a DDoS attack is to disrupt the normal traffic of the victim servers through malicious requests. As a result, the servers are swamped with malicious requests and have no buffer to handle normal requests.
Steps 1 and 2: An attacker remotely controls a network of zombies via the controller. These zombies are then instructed remotely by the attacker.
Step 3: The zombies can send requests to the victim servers, exhausting the servers' resources. Since zombies are legitimate internet devices, it is difficult to distinguish DDoS traffic from normal traffic.
An example of a common DDoS attack is an SYN flood.
Normally the client and server establish a TCP connection via a 3-way handshake. As a result of an SYN flood attack, zombies send many SYN requests to the server, but they never return an ACK from the server.
This results in an exhaustion of resources on the victim server due to the accumulation of many half-open TCP connections.
About 30 years ago, Peter Deutsch drafted a list of eight fallacies in distributed computing environments, now known as "The 8 fallacies of distributed computing". Many years later, the fallacies remain.
Data is cached everywhere, from the front end to the back end!
This diagram illustrates where we cache data in a typical architecture.
There are multiple layers along the flow.
Client apps: HTTP responses can be cached by the browser. We request data over HTTP for the first time, and it is returned with an expiry policy in the HTTP header; we request data again, and the client app tries to retrieve the data from the browser cache first.
CDN: CDN caches static web resources. The clients can retrieve data from a CDN node nearby.
Load Balancer: The load Balancer can cache resources as well.
Messaging infra: Message brokers store messages on disk first, and then consumers retrieve them at their own pace. Depending on the retention policy, the data is cached in Kafka clusters for a period of time.
Services: There are multiple layers of cache in a service. If the data is not cached in the CPU cache, the service will try to retrieve the data from memory. Sometimes the service has a second-level cache to store data on disk.
Distributed Cache: Distributed cache like Redis hold key-value pairs for multiple services in memory. It provides much better read/write performance than the database.
Full-text Search: we sometimes need to use full-text searches like Elastic Search for document search or log search. A copy of data is indexed in the search engine as well.
Database: Even in the database, we have different levels of caches:
WAL(Write-ahead Log): data is written to WAL first before building the B tree index
Bufferpool: A memory area allocated to cache query results
Materialized View: Pre-compute query results and store them in the database tables for better query performance
Transaction log: record all the transactions and database updates
Replication Log: used to record the replication state in a database cluster
Over to you: With the data cached at so many levels, how can we guarantee the sensitive user data is completely erased from the systems?
A CI/CD pipeline is a tool that automates the process of building, testing, and deploying software.
It integrates the different stages of the software development lifecycle, including code creation and revision, testing, and deployment, into a single, cohesive workflow.
The diagram below illustrates some of the tools that are commonly used.
Below you will find a diagram showing the microservice tech stack, both for the development phase and for production.
Pre-production
Define API - This establishes a contract between frontend and backend. We can use Postman or OpenAPI for this.
Development - Node.js or react is popular for frontend development, and java/python/go for backend development. Also, we need to change the configurations in the API gateway according to API definitions.
Continuous Integration - JUnit and Jenkins for automated testing. The code is packaged into a Docker image and deployed as microservices.
Production
NGinx is a common choice for load balancers. Cloudflare provides CDN (Content Delivery Network).
API Gateway - We can use spring boot for the gateway, and use Eureka/Zookeeper for service discovery.
The microservices are deployed on clouds. We have options among AWS, Microsoft Azure, and Google GCP.
Cache and Full-text Search - Redis is a common choice for caching key-value pairs. ElasticSearch is used for full-text search.
Communications - For services to talk to each other, we can use messaging infra Kafka or RPC.
Persistence - We can use MySQL or PostgreSQL for a relational database, and Amazon S3 for object store. We can also use Cassandra for the wide-column store if necessary.
Management & Monitoring - To manage so many microservices, the common Ops tools include Prometheus, Elastic Stack, and Kubernetes.
Below is a diagram showing the evolution of architecture and processes since the 1980s.
Organizations can build and run scalable applications on public, private, and hybrid clouds using cloud native technologies.
This means the applications are designed to leverage cloud features, so they are resilient to load and easy to scale.
Cloud native includes 4 aspects:
Development process This has progressed from waterfall to agile to DevOps.
Application Architecture The architecture has gone from monolithic to microservices. Each service is designed to be small, and adaptive to the limited resources in cloud containers.
Deployment & packaging The applications used to be deployed on physical servers. Then around 2000, the applications that were not sensitive to latency were usually deployed on virtual servers. With cloud native applications, they are packaged into docker images and deployed in containers.
Application infrastructure The applications are massively deployed on cloud infrastructure instead of self-hosted servers.
Why is a credit card called a “credit” card? Why is a debit card called a “debit” card?
An example of a debit card payment is shown in the diagram below.
Each transaction in the business system is transformed into at least two journal lines in the ledger system. This is called double-entry accounting, where every transaction must have a source account and a target account.
Each journal line is booked to an account.
Each account belongs to one of the three components in the balance sheet:
Let’s look at the issuing bank’s ledger as an example:
Bob pays $100 to the merchant with a debit card. We have two accounts involved in this transaction:
Journal line 1 - From the issuing bank’s point of view, Bob’s bank account is a liability (because the bank owes Bob money). Bob’s bank account is deducted $100. This is a debit record.
Journal line 2 - Bank’s cash is an asset and the bank’s cash is deducted by $100. This is a credit record.
The balance sheet equation still balances with the two journal lines recorded in the ledger.
Bob’s card is called a “debit” card because it is a debit record when paying with a debit card.
👉 Why is this important? This is how a ledger system is designed, only a real ledger is more complicated. Applying these strict accounting rules makes reconciliation much easier!
Uber’s API gateway went through 3 main stages.
First gen: the organic evolution. Uber's architecture in 2014 would have two key services: dispatch and API. A dispatch service connects a rider with a driver, while an API service stores the long-term data of users and trips.
Second gen: the all-encompassing gateway. Uber adopted a microservice architecture very early on. By 2019, Uber's products were powered by 2,200+ microservices as a result of this architectural decision.
Third gen: self-service, decentralized, and layered. As of early 2018, Uber had completely new business lines and numerous new applications. Freight, ATG, Elevate, groceries, and more are among the growing business lines. With a new set of goals comes the third generation.
An HTTP server cannot automatically initiate a connection to a browser. As a result, the web browser is the initiator. What should we do next to get real-time updates from the HTTP server?
Both the web browser and the HTTP server could be responsible for this task.
Web browsers do the heavy lifting: short polling or long polling. With short polling, the browser will retry until it gets the latest data. With long polling, the HTTP server doesn’t return results until new data has arrived.
HTTP server and web browser cooperate: WebSocket or SSE (server-sent event). In both cases, the HTTP server could directly send the latest data to the browser after the connection is established. The difference is that SSE is uni-directional, so the browser cannot send a new request to the server, while WebSocket is fully-duplex, so the browser can keep sending new requests.
👉 Over to you: of the four solutions (long polling, short polling, SSE, WebSocket), which ones are commonly used, and for what use cases?