Topics 7
Last updated
Last updated
This post is a summary of a tech talk given by Twitter in 2013. Let’s take a look.
The life of a Tweet
1️⃣ A tweet comes in through the Write API.
2️⃣ The Write API routes the request to the Fanout service.
3️⃣ The Fanout service does a lot of processing and stores them in the Redis cache.
4️⃣ The Timeline service is used to find the Redis server that has the home timeline on it.
5️⃣ A user pulls their home timeline through the Timeline service.
Search & Discovery
🔹 Ingester: annotates and tokenizes Tweets so the data can be indexed.
🔹 Earlybird: stores search index.
🔹 Blender: creates the search and discovery timelines.
Push Compute
🔹HTTP push
🔹Mobile push
Disclaimer: This article is based on the tech talk given by Twitter in 2013 (https://bit.ly/3vNfjRp). Even though many years have passed, it’s still quite relevant. I redraw the diagram as the original diagram is difficult to read.
Over to you: Do you use Twitter? What are some of the biggest differences between LinkedIn and Twitter that might shape their system architectures?
Picking a database is a long-term commitment so the decision shouldn’t be made lightly. The important thing to keep in mind is to choose the right database for the right job.
Data can be structured (SQL table schema), semi-structured (JSON, XML, etc.), and unstructured (Blob).
Common database categories include:
🔹 Relational
🔹 Columnar
🔹 Key-value
🔹 In-memory
🔹 Wide column
🔹 Time Series
🔹 Immutable ledger
🔹Geospatial
🔹Graph
🔹Document
🔹Text search
🔹Blob
IDs are very important for the backend. Do you know how to generate globally unique IDs?
In this post, we will explore common requirements for IDs that are used in social media such as Facebook, Twitter, and LinkedIn.
Requirements:
🔹Globally unique
🔹Roughly sorted by time
🔹Numerical values only
🔹64 bits
🔹Highly scalable, low latency
The implementation details of the algorithms can be found online so we will not go into detail here.
Over to you: What kind of ID generators have you used?
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the Hypertext Transfer Protocol (HTTP.) HTTPS transmits encrypted data using Transport Layer Security (TLS.) If the data is hijacked online, all the hijacker gets is binary code.
How is the data encrypted and decrypted?
Step 1 - The client (browser) and the server establish a TCP connection.
Step 2 - The client sends a “client hello” to the server. The message contains a set of necessary encryption algorithms (cipher suites) and the latest TLS version it can support. The server responds with a “server hello” so the browser knows whether it can support the algorithms and TLS version.
The server then sends the SSL certificate to the client. The certificate contains the public key, hostname, expiry dates, etc. The client validates the certificate.
Step 3 - After validating the SSL certificate, the client generates a session key and encrypts it using the public key. The server receives the encrypted session key and decrypts it with the private key.
Step 4 - Now that both the client and the server hold the same session key (symmetric encryption), the encrypted data is transmitted in a secure bi-directional channel.
Why does HTTPS switch to symmetric encryption during data transmission? There are two main reasons:
Security: The asymmetric encryption goes only one way. This means that if the server tries to send the encrypted data back to the client, anyone can decrypt the data using the public key.
Server resources: The asymmetric encryption adds quite a lot of mathematical overhead. It is not suitable for data transmissions in long sessions.
Over to you: how much performance overhead does HTTPS add, compared to HTTP?
Things Not to do
🔹 Storing passwords in plain text is not a good idea because anyone with internal access can see them.
🔹 Storing password hashes directly is not sufficient because it is pruned to precomputation attacks, such as rainbow tables.
🔹 To mitigate precomputation attacks, we salt the passwords.
What is salt?
According to OWASP guidelines, “a salt is a unique, randomly generated string that is added to each password as part of the hashing process”.
How to store a password and salt?
1️⃣ A salt is not meant to be secret and it can be stored in plain text in the database. It is used to ensure the hash result is unique to each password.
2️⃣ The password can be stored in the database using the following format: hash(password + salt)
How to validate a password?
To validate a password, it can go through the following process:
1️⃣ A client enters the password.
2️⃣ The system fetches the corresponding salt from the database.
3️⃣ The system appends the salt to the password and hashes it. Let’s call the hashed value H1.
4️⃣ The system compares H1 and H2, where H2 is the hash stored in the database. If they are the same, the password is valid.
Over to you: what other mechanisms can we use to ensure password safety?
A friend recently went through the irksome experience of being signed out from a number of websites they use daily. This event will be familiar to millions of web users, and it is a tedious process to fix. It can involve trying to remember multiple long-forgotten passwords, or typing in the names of pets from childhood to answer security questions. SSO removes this inconvenience and makes life online better. But how does it work?
Basically, Single Sign-On (SSO) is an authentication scheme. It allows a user to log in to different systems using a single ID.
The diagram below illustrates how SSO works.
Step 1: A user visits Gmail, or any email service. Gmail finds the user is not logged in and so redirects them to the SSO authentication server, which also finds the user is not logged in. As a result, the user is redirected to the SSO login page, where they enter their login credentials.
Steps 2-3: The SSO authentication server validates the credentials, creates the global session for the user, and creates a token.
Steps 4-7: Gmail validates the token in the SSO authentication server. The authentication server registers the Gmail system, and returns “valid.” Gmail returns the protected resource to the user.
Step 8: From Gmail, the user navigates to another Google-owned website, for example, YouTube.
Steps 9-10: YouTube finds the user is not logged in, and then requests authentication. The SSO authentication server finds the user is already logged in and returns the token.
Step 11-14: YouTube validates the token in the SSO authentication server. The authentication server registers the YouTube system, and returns “valid.” YouTube returns the protected resource to the user.
The process is complete and the user gets back access to their account.
Over to you:
Question 1: have you implemented SSO in your projects? What is the most difficult part?
Question 2: what’s your favorite sign-in method and why?
Programming languages come and go. Some stand the test of time. Some already are shooting stars and some are rising rapidly on the horizon.
I draw a diagram by putting the top 38 most commonly used programming languages in one place, sorted by year. Data source: StackOverflow survey.
The diagram below illustrates the differences between IaaS (Infrastructure-as-a-Service), PaaS (Platform-as-a-Service), and SaaS (Software-as-a-Service).
For a non-cloud application, we own and manage all the hardware and software. We say the application is on-premises.
With cloud computing, cloud service vendors provide three kinds of models for us to use: IaaS, PaaS, and SaaS.
IaaS provides us access to cloud vendors' infrastructure, like servers, storage, and networking. We pay for the infrastructure service and install and manage supporting software on it for our application.
PaaS goes further. It provides a platform with a variety of middleware, frameworks, and tools to build our application. We only focus on application development and data.
SaaS enables the application to run in the cloud. We pay a monthly or annual fee to use the SaaS product.
Over to you: which IaaS/PaaS/SaaS products have you used? How do you decide which architecture to use?
Database isolation allows a transaction to execute as if there are no other concurrently running transactions.
The diagram below illustrates four isolation levels.
🔹Serializalble: This is the highest isolation level. Concurrent transactions are guaranteed to be executed in sequence.
🔹Repeatable Read: Data read during the transaction stays the same as the transaction starts.
🔹Read Committed: Data modification can only be read after the transaction is committed.
🔹Read Uncommitted: The data modification can be read by other transactions before a transaction is committed.
The isolation is guaranteed by MVCC (Multi-Version Consistency Control) and locks.
The diagram below takes Repeatable Read as an example to demonstrate how MVCC works:
There are two hidden columns for each row: transaction_id and roll_pointer. When transaction A starts, a new Read View with transaction_id=201 is created. Shortly afterward, transaction B starts, and a new Read View with transaction_id=202 is created.
Now transaction A modifies the balance to 200, a new row of the log is created, and the roll_pointer points to the old row. Before transaction A commits, transaction B reads the balance data. Transaction B finds that transaction_id 201 is not committed, it reads the next committed record(transaction_id=200).
Even when transaction A commits, transaction B still reads data based on the Read View created when transaction B starts. So transaction B always reads the data with balance=100.
Over to you: have you seen isolation levels used in the wrong way? Did it cause serious outages?
I was doing some log parsing today and totally forgot what commands to use. After some Googling, I found this awesome cheat sheet by Thomas Roccia.
Log parsing commands are useful for:
🔹Searching patterns in text files
🔹Analyzing network packets
🔹Parsing fields from delimited logs
🔹Replacing strings in a file
🔹Sorting a file
🔹Displaying differences in files by comparing line by line