Add specification for a new cluster implementation. #10875

ushachar · 2022-06-17T15:03:13Z

The new approach (Fully outlined in CLUSTER.md) has been reviewed by the core Redis team and is now ready for a wider community review.
It is proposed in order to:

Provide predictable behavior & correctness guarantees across the various possible failure modes
Create a minimal, well-defined internal interface to the clustering code so alternative approaches can easily replace the
entire clustering functionality at compile time.
Support a significantly higher number of shards
Significantly reduce failover time to under 10 seconds (at worse) and under 5 seconds in some configurations
Allow for alternative consistency mode of operation
Long term: Support a clustered, but un-sharded deployment mode with no cross-slot limitations, which will allow us to deprecate all Sentinel code
Long term: Enable centralized management for global configuration settings & Redis Functions

This new approach is:

Based on a two-tier strongly consistent, consensus-based control plane
Maintains full backwards compatibility with data path Cluster commands (MOVED, NODES, SLOTS, ...), but not
administrative commands (SETSLOT, FORGET, MEET,...)
Has a consistent source of truth for all cluster metadata
Requires the shards to reconcile their state according to the above source
Maintains the ability to scale down to a bare minimum 3 redis-server processes
Is called Flotilla :)

And is intended to fully replace the current clustering implementation after a deprecation period in which the two
implementations will live side-by-side with compile time flag to chose between the two.

Currently drawings are placed under /images (both source and rendered jpgs).

madolson

HLD still seems good to me, just commenting about the details. I haven't had much time to really think about it until today.

madolson · 2022-06-17T16:37:05Z

CLUSTER.md

+- Has a consistent source of truth for all cluster metadata
+- Requires the shards to reconcile their state according to the above source
+- Maintains the ability to scale down to a bare minimum 3 redis-server processes
+- Is called Flotilla :)


Can we drop the flotilla name now? I think it will just cause confusion moving forward.

CLUSTER.md

madolson · 2022-06-18T00:26:09Z

CLUSTER.md

@@ -0,0 +1,450 @@
+# Redis Cluster v2 Proposal


Still no conversation about how to keep Functions consistent across the cluster. I think that is still an important item to try to solve. It doesn't sound all that difficult to store it in the TDs, but it will require interleaving functions with replication traffic. (Mixing CP and DP functionality, which isn't great)

This is a difficult one.
If we perceive functions (and lua as well prob.) as configuration then it makes sense to use the TD/FC layers to distribute them.

It might be worth getting some acutal usage stats on Lua (doubt there's enough on Functions atm) - would current usage easily lend itself to a 'configuration' pattern of infrequent changes.

I don't think Functions are an essential part of the cluster V2 design. Given that Functions are non-sharded user data, I feel they warrant their own host servers so they can scale independently. A container-registry like design could be an option IMO.

Unlike EVAL (which is the user's app responsibility), we should think of Functions as a light weight Modules.
They're an extending the API of the database and should be propagated to all nodes at function registration time and later when new nodes join.
Unlike EVAL, the user's app shouldn't be the one that loads them, but rather the admin, so either the admin tools or the cluster itself, must be the ones responsible to make sure they exist on all nodes.
I suppose that the difference between functions and modules in that respect is that for modules we don't want to offer any code to distribute the binaries to different hosts and load them (we should leave that to some orchestration system that deploys the software and creates the config files), but for functions we want a lower entry barrier.
We need to think about this more, and when we do, let's consider getting ACL into that mix too.

Given that Functions are non-sharded user data, I feel they warrant their own host servers so they can scale independently.

I think this was just a choice made out of practically. We wanted functions in 7, but we didn't really have a good way to expose "non-sharded" Redis data outside of asking users to apply them onto all ndoes. I think if we had time, we would have tried to solve this problem, but it's hard. I think clusterv2 should at least be able to solve this problem.

I think propagating by the control path, in this case, is probably what we want. In a lot of ways it is like an external operator going in and calling a command on every node in the cluster and making sure it applied correctly. The expectation should be that functions are low throughput. I would even posit we could be aggressive and say you can only propagate one function at a time, just too artificially slow down the rate of propagation.

Treating them as configuration makes a lot of sense to me, but since we don't have sufficient knowledge on how functions will be used in the wild - maybe defer the decision?
In the next couple of months we may see a more 'lua-ish' usage which would then completely break once we GA Flotilla.

maybe we could defer that decision, but i'd like to point out two things:

first, we should be the ones to decide how functions are used in the wild, and adding mechanisms around their propagation could assist. standing aside to see what users would do, is likely to promote a misuse.

of course in the context of Flotilla, the above statement is not relevant since it's not yet available to users, so indeed we can take the time. (i just like to avoid taking a similar approach elsewhere).

secondly, arguably sooner or later, we'll need to propagate some (non-cluster related) configuration (be it ACL, Functions or something else) through the cluster, so i think it may be better to include this topic in the design in an early stage.

Agree wrt ACLs and other configs - will add to the design.

+1 on adding a section for propagating configuration in general.

CLUSTER.md

zuiderkwast

Very interesting!!!

I read the rendered doc (... -> View file) and spotted some Markdown formatting issues.

CLUSTER.md

Elaborate on fencing behavior Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>

Fix typo Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>

Typo fix Co-authored-by: Tian <skylypig@gmail.com>

hwware · 2022-06-20T20:29:00Z

I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks

ushachar · 2022-06-20T20:37:17Z

I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks

Redis already supports multiple users, and the new design does not alter that.

If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment.

hwware · 2022-06-20T20:40:53Z

I just quickly take a look the new design for cluster V2, and I need take sometime to consume it. But I have one question: it looks like no concept or plan for supporting multi tenancy or multi users is mentioned. Could someone confirn this? Or there is no plan to support multi tenancy in open source version? Thanks

Redis already supports multiple users, and the new design does not alter that.

If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment.

Thanks for clarification.

oranagra · 2022-06-21T06:36:48Z

If by multi-tenancy you mean multi-database (with the SELECT command), that's only supported with unclustered Redis, and this new spec does not talk about adding it to clustered mode at the moment.

We do intend for this system to support un-sharded configurations though, in which we could / would support multiple databases and SELECT (and terminate Sentinel some day).
@ushachar please ack, and see if the text needs any addition to make it clear.

CLUSTER.md

uvletter · 2022-06-22T17:59:36Z

After a rough read the point most concern me is the two tier architecture seems too flexible, that users may be hard to understand it, and abuse the deployment. There're five deployment modes provided, but user may fall into difficulty in choosing, which one is best for my scenario? And it also bring some complexity into implementation, for instance, we should support the transition between the five modes, which means there are 4*5=20 cases to be consider, maybe some cases can be eliminated, but it is still substantial number. Besides, some optimization may only effect only some deployment modes, for example, for the situation FC and TD and deplyed together, the raft group may be shared between the two modules, but for others not. So if we should do this optization is undetermined since it's only available for some deployments.

If we put the FC and TD in one deployment unit by design, the variant would be far less. Thus only two deployment modes is exposed to users, Native and Midrange. For the users who is beginner, for a fast setup, or deploys a small cluster, Native is sufficient, as well as for the advanced users, Midrange could be a better choice, there is no puzzle which one I should take. With the raft membership change and configuration change, we can easily implement the transition between the two modes, as well as add or remove a control plane node.

CLUSTER.md

…ments

hwware · 2022-08-03T19:05:25Z

@ushachar I reviewed all discussions and description of the cluster v2 spec. I have one concern (aka you could understand maybe this is a bonus for this design).
Now in redis non-cluster mode, it support multiply databases, thus, in business level, multiply user could read and write the same redis node instance by multiply databases (db0 for user1, db1 for user2, db2 for user3 ...).
But in current redis cluster design, and in new cluster design, it looks like it still only support one database. Multiply users can not save their business data in the same redis instance or even they could save data in the instance, but there is no way to distinguish the user data. (you can think this is one kind mode of multi-tenant in one single database cluster)

Thus, one block happens when clients want to migrated data from non-cluster mode to cluster mode, for the same name data between 2 users, there is no smart way to save in the cluster mode redis instance. It creates trouble.

I think one way to support the multi-tenant is using namespace or simialar concept.
Do you think about the new cluste could add this concept or you want to clients or third-party developers to implement this feature by themselves in the future?
Thanks

ushachar · 2022-08-04T13:58:28Z

@ushachar I reviewed all discussions and description of the cluster v2 spec. I have one concern (aka you could understand maybe this is a bonus for this design). Now in redis non-cluster mode, it support multiply databases, thus, in business level, multiply user could read and write the same redis node instance by multiply databases (db0 for user1, db1 for user2, db2 for user3 ...). But in current redis cluster design, and in new cluster design, it looks like it still only support one database....
...

I consider this out of scope for this design - it requires changes in additional areas and this is a big enough task as-is...
Flotilla won't make it harder to add later in the game.

ranshid · 2022-08-07T14:05:32Z

CLUSTER.md

+
+### Roles
+
+While the Flotilla spec mostly discusses data nodes (and not Failover Coordinator & Topology Director implementation), in our reference implementation


I agree that we can discuss control plane implementation separately. But I fear there might be some influence on the data plane design.
For example handling of FC and TD failovers: will we maintain a replica for FC and TD? if so can we face a condition of epoch downgrading?
multiple times administrators place replicas in separate geo-location for survivability - maybe we should consider having a replica-FC to better support these cases? (same goes for TD)

This PR does not address this aspect of the TD/FC internal implementation other then specifying they need to be CP entities (so epoch downgrades cannot happen).
In the upcoming reference implementation we'll have a module which incorporates Raft with 3+ instances to achieve the required guarantees and provide the ability for an admin to use different AZs for HA.

ranshid · 2022-08-07T14:14:49Z

CLUSTER.md

+- The heartbeat results from all the nodes in the same shard (and their 'tick' value)
+- If needed, the updated cluster topology.
+
+If a replica node detects that its primary has not sent a heartbeat for more than N ticks it checks if it is the most


in case the replica-primary keepalive is broken BUT the primary-FC heartbit is O.K - will this also be validated by the FC?
there are cases were the replication stream is broken (bug/route issues etc...) but that does not mean we should failover...

The FC does not validate anything in the heartbeat beyond epoch correctness -- it's the responsibility of the individual nodes to trigger failovers.
In the case you outlined:

If the replica <-> FC connectivity is ok the replica will see that the primary is still alive (since the FC shares that info in the heartbeat response) and not initiates failover

If the replica <-> FC connectivity is down the replica can't promote itself even if it wanted to

ranshid · 2022-08-07T14:52:53Z

CLUSTER.md

+
+### Failover
+
+Every K milliseconds, every node sends a heartbeat to the Failover Coordinator which owns it. This heartbeat includes:


There are many issues related to cluster failover due to long running commands (LUA scripts, large sets joins etc...) I wonder if we want to improve the heartbeat mechanism to overcome these issues. Maybe this is a separate discussion but I think that redesign the cluster should allow users to run long commands without having to keep adjusting the timeouts values.

This is something I've mentioned before and want to address. I want the heartbeating mechanism to be pulled off the main thread of Redis so we can heartbeat to the FCs inside of commands that take single digit seconds.

It's possible to do, and won't impact the rest of the design -- but is it always desirable? I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread.

The ideal solution would probably be to have a heartbeat thread which also monitors the main one and can decide to stop heartbeating if the main thread is unresponsive for a long enough time.

It's possible to do, and won't impact the rest of the design

100% correct, it's an implementation decision.

but is it always desirable?

I think so, we've had quite a bit of issue reliably determining the state of the execution thread in Redis since it is synchronous.

I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread. I'd argue that after a certain threshold we do want to failover a primary with a non-responsive main thread.

This is mostly what I'm expecting. I suppose we would have two thresholds.

No heartbeat received after X seconds: The process is bricked for some reason and we should declare it unhealthy so failover can happen.

Too many heartbeats of the same type received for Y seconds: Something is stuck and not making progress fast enough, maybe a sync flush was executed. We should still failover eventually, but give some more room here.

Conceptually speaking, failure detection is orthogonal to the cluster V2 spec discussion so I agree it can wait or be explored in parallel.

I think the key question in the domain of failure detection is "progress", i.e., what do we mean by "progress", especially when a script is involved? It is not a trivial problem IMO to 100% reliably determine if an arbitrary command is running in an infinite loop, or in other words, making "progress", or not. Therefore, I think there should always be an upper bound on how long a command can run. Whether or not we need a separate thread to pump h/b is a secondary question to me as I think the main thread is the best place to implement the logic to determine "progress". The separate thread is more like the IO threads, which can pick up the "progress" breadcrumb left by the main thread and report it with the h/b. It is just there to help reduce the work on the main thread but the main thread could just do it on its own as well.

…des by Flotilla

madolson · 2022-08-29T15:31:33Z

CLUSTER.md

+* Describe how ClusterV2 will be implemented based on that.
+-->
+
+#### Potential Enhancements


I vaguely thought we discussed this at somepoint, but what is the authentication between data nodes and TD/FC?

CLUSTER.md

zuiderkwast · 2022-08-29T16:03:22Z

Why not merge this spec in the redis-doc repo? That's where we have the legacy cluster spec and other specs and it'd show up on the redis.io website.

oranagra · 2022-08-29T17:38:39Z

We actually have https://github.com/redis/redis-specifications for this purpose.

madolson · 2022-08-29T23:41:01Z

Yeah, I don't think we should merge this document into main redis repository.

lsn1994 · 2022-11-14T10:09:17Z

CLUSTER.md

+While Flotilla can be implemented using any sort of strongly consistent system, the reference implementation we propose
+uses Redis to host the Topology Director & Failover Coordinator logic. This allows for different deployment modes depending on 
+the needs and orchestrating abilities of the administrator:
+- Full Blown: Dedicated nodes holding the Topology Directors and multiple Failover Coordinators dividing the data nodes among them.


for Full Blown mode, do we have option that one TD can support multiple clusters? Thus we can divide system into two layer: computing and storage. @ushachar

Apologies for the (very) late reply - missed this comment.
It's possible to enhance the spec to support this mode, but not something that's actually planned at this time due to the additional complexity it introduces.
(In any case, it does not allow dividing the compute/storage requirement - it only reduces the control plane overhead when orchestrating lots of small DBs)

yangbodong22011 · 2023-07-06T12:11:52Z

CLUSTER.md

+
+#### Failover Coordinator Tick
+
+In order to avoid relying on multiple clocks when making failover decisions, the node relies on a monotonically increasing


@ushachar I'm not sure if I understand the tick correctly here? Suppose we have a shard with three nodes A, B, and C.

A, B, C will send heartbeat packets to FC every k milliseconds, including:

Replication offset (for replicas) Last known cluster topology epoch Last known Shard topology epoch Role (primary/replica)

At this time, FC will record a map: node->tick , such as {A:4, B:4, C:1}, assuming that the tick at this time is 4, but C has not reported a heartbeat for a long time.

Then FC will add: {A:4, B:4, C:1} field to the reply of the heartbeat packet to let other nodes know that C has not been updated for k*(4-1) = k*3 ms.

It means tick will only be incremented on FC and assigned to different Nodes.

It's an implementation detail, but the tick gets saved as part of the heartbeat, and not in a separate map.

If A & B both send a heartbeat within the relevant period, but C hasn't updated in a while, then yes - the heartbeat response will contain the information that A&B were heard from in tick 4, and C was heard from in tick 1.

This PR reworks the clustering code to support multiple clustering implementations, specifically, the current "legacy" clustering implementation or, although not part of this PR, flotilla (see #10875). Which implementation is used could be a compile-time flag (will be added later). Legacy clustering functionality remains unchanged. The basic idea is as follows. The header cluster.h now contains function declarations that define the "Cluster API." These are the contract and interface between any clustering implementation and the rest of the Redis source code. Some of the function definitions are shared between all clustering implementations. These functions are in cluster.c. The functions and data structures specific to legacy clustering are in cluster-legacy.c/h. One consequence of this is that the structs clusterNode and clusterState which were previously "public" to the rest of Redis are now hidden behind the Cluster API. The PR is divided up into commits, each with a commit message explaining the changes. some are just mass rename or moving code between files (may not require close inspection / review), others are manual changes. One other, related change is: - The "failover" command is now plugged into the Cluster API so that the clustering implementation can (a) enable/disable the command to begin with and if enabled (b) perform the actual failover. The "failover" command remains disabled for legacy clustering.

CLAassistant · 2024-03-24T23:13:16Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Uri Shachar
❌ ushachar

Uri Shachar seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Add a specification for a new cluster implementation.

8582504

madolson mentioned this pull request Jun 17, 2022

Split the clustering interface between control plane and data plane #10878

Closed

madolson reviewed Jun 18, 2022

View reviewed changes

This was referenced Jun 18, 2022

Redis Cluster V2 project #8948

Open

Gossiped node deletion #10861

Closed

uvletter reviewed Jun 18, 2022

View reviewed changes

CLUSTER.md Outdated Show resolved Hide resolved

zuiderkwast reviewed Jun 19, 2022

View reviewed changes

ushachar and others added 3 commits June 19, 2022 21:44

Update CLUSTER.md

9be8994

Elaborate on fencing behavior Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>

Update CLUSTER.md

71db4f8

Fix typo Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>

Update CLUSTER.md

76283c1

Typo fix Co-authored-by: Tian <skylypig@gmail.com>

zuiderkwast mentioned this pull request Jun 20, 2022

"CLUSTER MIGRATE slot node" command #2807

Open

PingXie reviewed Jun 21, 2022

View reviewed changes

PingXie reviewed Jul 3, 2022

View reviewed changes

CLUSTER.md Show resolved Hide resolved

CLUSTER.md Outdated Show resolved Hide resolved

ushachar and others added 4 commits July 16, 2022 18:23

Added requested clarifying comments and a couple of potential enhance…

10c4a38

…ments

Added some more clarifications

2753b3e

Mention and placeholder for a section on troubleshooting

8ac0a60

Add some specific cases which we need troubleshooting options for.

2551557

ranshid reviewed Aug 7, 2022

View reviewed changes

Uri Shachar added 2 commits August 10, 2022 13:06

Add outline of how global configuration might be propagated to all no…

fab0438

…des by Flotilla

Typo fix + add global config to ToC

632ef0b

madolson reviewed Aug 29, 2022

View reviewed changes

oranagra mentioned this pull request Oct 18, 2022

vote-only cluster instances #4517

Open

lsn1994 reviewed Nov 14, 2022

View reviewed changes

oranagra mentioned this pull request Apr 30, 2023

[NEW] Redis master-replica seamless switching #12097

Open

igorwwwwwwwwwwwwwwwwwwww mentioned this pull request Jun 26, 2023

[QUESTION] Redis Cluster with Node Aware or Rack Aware Replication #11306

Open

yangbodong22011 reviewed Jul 6, 2023

View reviewed changes

enjoy-binbin mentioned this pull request Nov 9, 2023

Cluster refactor, common API for different implementations #12742

Merged

oranagra mentioned this pull request Jan 23, 2024

Creating a system for large Redis changes #12979

Closed

madolson mentioned this pull request Jan 23, 2024

New steps for making incremental progress towards cluster V2 #12982

Closed

19 tasks

madolson mentioned this pull request May 1, 2024

[NEW] Cluster V2 Discussion valkey-io/valkey#384

Open


		### Roles

		While the Flotilla spec mostly discusses data nodes (and not Failover Coordinator & Topology Director implementation), in our reference implementation


		### Failover

		Every K milliseconds, every node sends a heartbeat to the Failover Coordinator which owns it. This heartbeat includes:


		#### Failover Coordinator Tick

		In order to avoid relying on multiple clocks when making failover decisions, the node relies on a monotonically increasing

Add specification for a new cluster implementation. #10875

Are you sure you want to change the base?

Add specification for a new cluster implementation. #10875

Conversation

ushachar commented Jun 17, 2022 • edited Loading

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson Jun 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

hwware commented Jun 20, 2022

ushachar commented Jun 20, 2022

hwware commented Jun 20, 2022

oranagra commented Jun 21, 2022

uvletter commented Jun 22, 2022

hwware commented Aug 3, 2022 • edited Loading

ushachar commented Aug 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast commented Aug 29, 2022

oranagra commented Aug 29, 2022

madolson commented Aug 29, 2022

lsn1994 Nov 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Mar 24, 2024 • edited Loading

ushachar commented Jun 17, 2022 •

edited

Loading

madolson Jun 18, 2022 •

edited

Loading

PingXie Jun 21, 2022 •

edited

Loading

hwware commented Aug 3, 2022 •

edited

Loading

madolson Aug 16, 2022 •

edited

Loading

lsn1994 Nov 14, 2022 •

edited

Loading

CLAassistant commented Mar 24, 2024 •

edited

Loading