-
Notifications
You must be signed in to change notification settings - Fork 23.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add specification for a new cluster implementation. #10875
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HLD still seems good to me, just commenting about the details. I haven't had much time to really think about it until today.
- Has a consistent source of truth for all cluster metadata | ||
- Requires the shards to reconcile their state according to the above source | ||
- Maintains the ability to scale down to a bare minimum 3 redis-server processes | ||
- Is called Flotilla :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop the flotilla name now? I think it will just cause confusion moving forward.
@@ -0,0 +1,450 @@ | |||
# Redis Cluster v2 Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still no conversation about how to keep Functions
consistent across the cluster. I think that is still an important item to try to solve. It doesn't sound all that difficult to store it in the TDs, but it will require interleaving functions with replication traffic. (Mixing CP and DP functionality, which isn't great)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a difficult one.
If we perceive functions (and lua as well prob.) as configuration then it makes sense to use the TD/FC layers to distribute them.
It might be worth getting some acutal usage stats on Lua (doubt there's enough on Functions atm) - would current usage easily lend itself to a 'configuration' pattern of infrequent changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Functions
are an essential part of the cluster V2 design. Given that Functions
are non-sharded user data, I feel they warrant their own host servers so they can scale independently. A container-registry like design could be an option IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike EVAL (which is the user's app responsibility), we should think of Functions as a light weight Modules.
They're an extending the API of the database and should be propagated to all nodes at function registration time and later when new nodes join.
Unlike EVAL, the user's app shouldn't be the one that loads them, but rather the admin, so either the admin tools or the cluster itself, must be the ones responsible to make sure they exist on all nodes.
I suppose that the difference between functions and modules in that respect is that for modules we don't want to offer any code to distribute the binaries to different hosts and load them (we should leave that to some orchestration system that deploys the software and creates the config files), but for functions we want a lower entry barrier.
We need to think about this more, and when we do, let's consider getting ACL into that mix too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that Functions are non-sharded user data, I feel they warrant their own host servers so they can scale independently.
I think this was just a choice made out of practically. We wanted functions in 7, but we didn't really have a good way to expose "non-sharded" Redis data outside of asking users to apply them onto all ndoes. I think if we had time, we would have tried to solve this problem, but it's hard. I think clusterv2 should at least be able to solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think propagating by the control path, in this case, is probably what we want. In a lot of ways it is like an external operator going in and calling a command on every node in the cluster and making sure it applied correctly. The expectation should be that functions are low throughput. I would even posit we could be aggressive and say you can only propagate one function at a time, just too artificially slow down the rate of propagation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treating them as configuration makes a lot of sense to me, but since we don't have sufficient knowledge on how functions will be used in the wild - maybe defer the decision?
In the next couple of months we may see a more 'lua-ish' usage which would then completely break once we GA Flotilla.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could defer that decision, but i'd like to point out two things:
first, we should be the ones to decide how functions are used in the wild, and adding mechanisms around their propagation could assist. standing aside to see what users would do, is likely to promote a misuse.
of course in the context of Flotilla, the above statement is not relevant since it's not yet available to users, so indeed we can take the time. (i just like to avoid taking a similar approach elsewhere).
secondly, arguably sooner or later, we'll need to propagate some (non-cluster related) configuration (be it ACL, Functions or something else) through the cluster, so i think it may be better to include this topic in the design in an early stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree wrt ACLs and other configs - will add to the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on adding a section for propagating configuration in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting!!!
I read the rendered doc (... -> View file) and spotted some Markdown formatting issues.
Elaborate on fencing behavior Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Fix typo Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>
Typo fix Co-authored-by: Tian <skylypig@gmail.com>
I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks |
Redis already supports multiple users, and the new design does not alter that. If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment. |
Thanks for clarification. |
We do intend for this system to support un-sharded configurations though, in which we could / would support multiple databases and SELECT (and terminate Sentinel some day). |
After a rough read the point most concern me is the two tier architecture seems too flexible, that users may be hard to understand it, and abuse the deployment. There're five deployment modes provided, but user may fall into difficulty in choosing, which one is best for my scenario? And it also bring some complexity into implementation, for instance, we should support the transition between the five modes, which means there are 4*5=20 cases to be consider, maybe some cases can be eliminated, but it is still substantial number. Besides, some optimization may only effect only some deployment modes, for example, for the situation FC and TD and deplyed together, the raft group may be shared between the two modules, but for others not. So if we should do this optization is undetermined since it's only available for some deployments. If we put the FC and TD in one deployment unit by design, the variant would be far less. Thus only two deployment modes is exposed to users, Native and Midrange. For the users who is beginner, for a fast setup, or deploys a small cluster, Native is sufficient, as well as for the advanced users, Midrange could be a better choice, there is no puzzle which one I should take. With the raft membership change and configuration change, we can easily implement the transition between the two modes, as well as add or remove a control plane node. |
@ushachar I reviewed all discussions and description of the cluster v2 spec. I have one concern (aka you could understand maybe this is a bonus for this design). Thus, one block happens when clients want to migrated data from non-cluster mode to cluster mode, for the same name data between 2 users, there is no smart way to save in the cluster mode redis instance. It creates trouble. I think one way to support the multi-tenant is using namespace or simialar concept. |
I consider this out of scope for this design - it requires changes in additional areas and this is a big enough task as-is... |
|
||
### Roles | ||
|
||
While the Flotilla spec mostly discusses data nodes (and not Failover Coordinator & Topology Director implementation), in our reference implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can discuss control plane implementation separately. But I fear there might be some influence on the data plane design.
For example handling of FC and TD failovers: will we maintain a replica for FC and TD? if so can we face a condition of epoch downgrading?
multiple times administrators place replicas in separate geo-location for survivability - maybe we should consider having a replica-FC to better support these cases? (same goes for TD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does not address this aspect of the TD/FC internal implementation other then specifying they need to be CP entities (so epoch downgrades cannot happen).
In the upcoming reference implementation we'll have a module which incorporates Raft with 3+ instances to achieve the required guarantees and provide the ability for an admin to use different AZs for HA.
- The heartbeat results from all the nodes in the same shard (and their 'tick' value) | ||
- If needed, the updated cluster topology. | ||
|
||
If a replica node detects that its primary has not sent a heartbeat for more than N ticks it checks if it is the most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case the replica-primary keepalive is broken BUT the primary-FC heartbit is O.K - will this also be validated by the FC?
there are cases were the replication stream is broken (bug/route issues etc...) but that does not mean we should failover...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FC does not validate anything in the heartbeat beyond epoch correctness -- it's the responsibility of the individual nodes to trigger failovers.
In the case you outlined:
- If the replica <-> FC connectivity is ok the replica will see that the primary is still alive (since the FC shares that info in the heartbeat response) and not initiates failover
- If the replica <-> FC connectivity is down the replica can't promote itself even if it wanted to
|
||
### Failover | ||
|
||
Every K milliseconds, every node sends a heartbeat to the Failover Coordinator which owns it. This heartbeat includes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many issues related to cluster failover due to long running commands (LUA scripts, large sets joins etc...) I wonder if we want to improve the heartbeat mechanism to overcome these issues. Maybe this is a separate discussion but I think that redesign the cluster should allow users to run long commands without having to keep adjusting the timeouts values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I've mentioned before and want to address. I want the heartbeating mechanism to be pulled off the main thread of Redis so we can heartbeat to the FCs inside of commands that take single digit seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to do, and won't impact the rest of the design -- but is it always desirable? I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread.
The ideal solution would probably be to have a heartbeat thread which also monitors the main one and can decide to stop heartbeating if the main thread is unresponsive for a long enough time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to do, and won't impact the rest of the design
100% correct, it's an implementation decision.
but is it always desirable?
I think so, we've had quite a bit of issue reliably determining the state of the execution thread in Redis since it is synchronous.
I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread. I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread.
This is mostly what I'm expecting. I suppose we would have two thresholds.
- No heartbeat received after X seconds: The process is bricked for some reason and we should declare it unhealthy so failover can happen.
- Too many heartbeats of the same type received for Y seconds: Something is stuck and not making progress fast enough, maybe a sync flush was executed. We should still failover eventually, but give some more room here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually speaking, failure detection is orthogonal to the cluster V2 spec discussion so I agree it can wait or be explored in parallel.
I think the key question in the domain of failure detection is "progress", i.e., what do we mean by "progress", especially when a script is involved? It is not a trivial problem IMO to 100% reliably determine if an arbitrary command is running in an infinite loop, or in other words, making "progress", or not. Therefore, I think there should always be an upper bound on how long a command can run. Whether or not we need a separate thread to pump h/b is a secondary question to me as I think the main thread is the best place to implement the logic to determine "progress". The separate thread is more like the IO threads, which can pick up the "progress" breadcrumb left by the main thread and report it with the h/b. It is just there to help reduce the work on the main thread but the main thread could just do it on its own as well.
* Describe how ClusterV2 will be implemented based on that. | ||
--> | ||
|
||
#### Potential Enhancements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely thought we discussed this at somepoint, but what is the authentication between data nodes and TD/FC?
Why not merge this spec in the redis-doc repo? That's where we have the legacy cluster spec and other specs and it'd show up on the redis.io website. |
We actually have https://github.com/redis/redis-specifications for this purpose. |
Yeah, I don't think we should merge this document into main redis repository. |
While Flotilla can be implemented using any sort of strongly consistent system, the reference implementation we propose | ||
uses Redis to host the Topology Director & Failover Coordinator logic. This allows for different deployment modes depending on | ||
the needs and orchestrating abilities of the administrator: | ||
- Full Blown: Dedicated nodes holding the Topology Directors and multiple Failover Coordinators dividing the data nodes among them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Full Blown
mode, do we have option that one TD can support multiple clusters? Thus we can divide system into two layer: computing and storage. @ushachar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the (very) late reply - missed this comment.
It's possible to enhance the spec to support this mode, but not something that's actually planned at this time due to the additional complexity it introduces.
(In any case, it does not allow dividing the compute/storage requirement - it only reduces the control plane overhead when orchestrating lots of small DBs)
|
||
#### Failover Coordinator Tick | ||
|
||
In order to avoid relying on multiple clocks when making failover decisions, the node relies on a monotonically increasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ushachar I'm not sure if I understand the tick correctly here? Suppose we have a shard with three nodes A, B, and C.
A, B, C will send heartbeat packets to FC every k milliseconds, including:
Replication offset (for replicas)
Last known cluster topology epoch
Last known Shard topology epoch
Role (primary/replica)
At this time, FC will record a map: node->tick , such as {A:4, B:4, C:1}, assuming that the tick at this time is 4, but C has not reported a heartbeat for a long time.
Then FC will add: {A:4, B:4, C:1} field to the reply of the heartbeat packet to let other nodes know that C has not been updated for k*(4-1) = k*3 ms.
It means tick will only be incremented on FC and assigned to different Nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an implementation detail, but the tick gets saved as part of the heartbeat, and not in a separate map.
If A & B both send a heartbeat within the relevant period, but C hasn't updated in a while, then yes - the heartbeat response will contain the information that A&B were heard from in tick 4, and C was heard from in tick 1.
This PR reworks the clustering code to support multiple clustering implementations, specifically, the current "legacy" clustering implementation or, although not part of this PR, flotilla (see #10875). Which implementation is used could be a compile-time flag (will be added later). Legacy clustering functionality remains unchanged. The basic idea is as follows. The header cluster.h now contains function declarations that define the "Cluster API." These are the contract and interface between any clustering implementation and the rest of the Redis source code. Some of the function definitions are shared between all clustering implementations. These functions are in cluster.c. The functions and data structures specific to legacy clustering are in cluster-legacy.c/h. One consequence of this is that the structs clusterNode and clusterState which were previously "public" to the rest of Redis are now hidden behind the Cluster API. The PR is divided up into commits, each with a commit message explaining the changes. some are just mass rename or moving code between files (may not require close inspection / review), others are manual changes. One other, related change is: - The "failover" command is now plugged into the Cluster API so that the clustering implementation can (a) enable/disable the command to begin with and if enabled (b) perform the actual failover. The "failover" command remains disabled for legacy clustering.
Uri Shachar seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
The new approach (Fully outlined in CLUSTER.md) has been reviewed by the core Redis team and is now ready for a wider community review.
It is proposed in order to:
entire clustering functionality at compile time.
This new approach is:
administrative commands (SETSLOT, FORGET, MEET,...)
And is intended to fully replace the current clustering implementation after a deprecation period in which the two
implementations will live side-by-side with compile time flag to chose between the two.
Currently drawings are placed under /images (both source and rendered jpgs).