Topics 13
Last updated
Last updated
Data modeling can be challenging because of various business needs. That’s why we wrote a comprehensive e-book that goes through 8 different scenarios and shows how to model them in Redis along with code snippets. In it, you will learn:
The Embedded Pattern (1-to-1, 1-to-many, and many-to-many)
The Partial Embed Pattern (1-to-1, 1-to-many, and many-to-many)
The Aggregate Pattern
The Polymorphic Pattern
The Bucket Pattern
The Revision Pattern
The Tree and Graph Pattern
The Schema Version Pattern
The diagram below shows the detail.
Step 1 - The client sends an HTTP request to the API gateway.
Step 2 - The API gateway parses and validates the attributes in the HTTP request.
Step 3 - The API gateway performs allow-list/deny-list checks.
Step 4 - The API gateway talks to an identity provider for authentication and authorization.
Step 5 - The rate limiting rules are applied to the request. If it is over the limit, the request is rejected.
Steps 6 and 7 - Now that the request has passed basic checks, the API gateway finds the relevant service to route to by path matching.
Step 8 - The API gateway transforms the request into the appropriate protocol and sends it to backend microservices.
Steps 9-12: The API gateway can handle errors properly, and deals with faults if the error takes a longer time to recover (circuit break). It can also leverage ELK (Elastic-Logstash-Kibana) stack for logging and monitoring. We sometimes cache data in the API gateway.
The diagram below shows the quick comparison between REST and GraphQL.
GraphQL is a query language for APIs developed by Meta. It provides a complete description of the data in the API and gives clients the power to ask for exactly what they need.
GraphQL servers sit in between the client and the backend services.
GraphQL can aggregate multiple REST requests into one query. GraphQL server organizes the resources in a graph.
GraphQL supports queries, mutations (applying data modifications to resources), and subscriptions (receiving notifications on schema modifications).
Choosing the right database is often the most important decision we'll ever make.
We are talking about a database for a real growing business, where a bad choice would lead to extended downtime, customer impact, and even data loss.
This take is probably a bit controversial.
First, are we positive that we need a different database?
Is the existing database breaking at the seams? Maybe the p95 latency is through the roof. Maybe the working set is overflowing the available memory, and even the most basic requests need to go to the disk.
Whatever the issues are, make sure they are not easily solvable.
Let’s read the database manual of our current database system. There could be a configuration knob or two that we can tweak to give us a bit more breathing room.
Can we put a cache in front of it, and give us a few more months of runway?
Can we add read replicas to shed some read load?
Can we shard the database, or partition the data in some way?
The bottom line is this: Migrating live production data is risky and costly. We better be damn sure that there is no way to keep using the current database.
We have exhausted all avenues for the current database.
How do we go about choosing the next one?
We developers are naturally drawn to the new and shiny, like moths to flame. When it comes to databases, though, boring is good.
We should prefer the ones that have been around for a long time, and have been battle tested.
Software engineering at scale is about tradeoffs. When it comes to databases, it is even more true.
Instead of reading the shiny brochures, go read the manual. There is usually a page called “Limits”. That page is a gem.
Learn as much as possible about the candidate now. The investment is relatively small at this juncture.
Once we narrow down the database options, what’s next?
Create a realistic test bench for the candidates using our data, with our real-world access patterns.
During benchmarking, pay attention to the outliers. Measure P99 of everything. The average is not meaningful.
After everything checks out, plan the migration carefully. Write out a detailed step-by-step migration plan.
Picking the right database is not glamorous, and there is a lot of hard work involved. Migrating to a new database in the real world could take years at a high scale.
Good luck.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives.
An example of a Bloom filter, representing the set {x, y, z} . The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z} , because it hashes to one bit-array position containing 0. For this figure, m = 18 and k = 3.
Have you noticed that the largest incidents are usually caused by something very small?
A minor error starts the snowball effect that keeps building up. Suddenly, everything is down.
Here are 8 cloud design patterns to reduce the damage done by failures.
Timeout
Retry
Circuit breaker
Rate limiting
Load shedding
Bulkhead
Back pressure
Let it crash
These patterns are usually not used alone. To apply them effectively, we need to understand why we need them, how they work, and their limitations.
The study below runs 10 benchmark problems in 28 languages [1]. It measures the runtime, memory usage, and energy consumption of each language. This take might be controversial.
“This paper presents a study of the runtime, memory usage, and energy consumption of twenty-seven well-known software languages. We monitor the performance of such languages using ten different programming problems, expressed in each of the languages. Our results show interesting findings, such as slower/faster languages consuming less/more energy, and how memory usage influences energy consumption. We show how to use our results to provide software engineers support to decide which language to use when energy efficiency is a concern”. [2]
The diagram below lists 5 common ways. 👇
ACL (Access Control List) ACL is a list of rules that specifies which users are granted or denied access to a particular resource.
Pros - Easy to understand. Cons - error-prone, maintenance cost is high
DAC (Discretionary Access Control) This is based on ACL. It grants or restricts object access via an access policy determined by an object's owner group.
Pros - Easy and flexible. Linux file system supports DAC. Cons - Scattered permission control, too much power for the object’s owner group.
MAC (Mandatory Access Control) Both resource owners and resources have classification labels. Different labels are granted with different permissions. Pros - strict and straightforward. Cons - not flexible.
ABAC (Attribute-based access control) Evaluate permissions based on attributes of the Resource owner, Action, Resource, and Environment. Pros - flexible Cons - the rules can be complicated, and the implementation is hard. It is not commonly used.
RBAC (Role-based Access Control) Evaluate permissions based on roles Pros - flexible in assigning roles.
According to Wikipedia, the drawing appeared in The New York Times, and Microsoft CEO Satya Nadella cited that it was what persuaded him to change Microsoft's culture.
Link to the drawing: https://lnkd.in/eh7RiMKa
The drawing was published in 2011. More than 10 years have passed. How relevant is this now?
How do we generate unique IDs in distributed systems? How do we avoid ID conflicts?
The diagram below shows 5 ways. 👇
Assume the design requirements of distributed unique ID are:
Globally unique.
Availability. The ID generator must be available under high concurrency.
Ordered. The IDs are sorted by certain rules. For example, sorted by time.
Distributed. The ID generator doesn’t rely on a centralized service.
Security. Depending on the use case, some IDs cannot be just incremental integers, which might expose sensitive information. For example, people might guess the total user number correctly by looking at the sequence IDs.