Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improved multinode proxy #249

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

fix: Improved multinode proxy #249

wants to merge 6 commits into from

Conversation

LexLuthr
Copy link
Contributor

@LexLuthr LexLuthr commented Oct 4, 2024

Fixes #243 and #200
This PR bring in the changes from filecoin-project/lotus#11470 along with addressing some comments.

This PR also removes the forced check to ensure at least 2 chain nodes are reachable. We don't require this check as there is no network segregation or split-brain issues here. Any node will work as long as it is in sync. This check also has some unintended affects of rendering Curio cluster useless if a node goes down out of 2.

Copy link
Collaborator

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be really good to have a way to expose backing node health in the webui somehow, even just the ones on the node serving the webui

@LexLuthr
Copy link
Contributor Author

LexLuthr commented Oct 4, 2024

Would be really good to have a way to expose backing node health in the webui somehow, even just the ones on the node serving the webui

That would be a good idea. Let me do that.

@LexLuthr LexLuthr linked an issue Oct 4, 2024 that may be closed by this pull request
@LexLuthr LexLuthr requested a review from magik6k October 4, 2024 12:08
Copy link
Collaborator

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ErrorIsIn is a little sketchy, though I remember writing / reviewing it very deeply, so it's prooobably fine, but would be amazing to confirm.

deps/apiinfo.go Outdated Show resolved Hide resolved
deps/apiinfo.go Show resolved Hide resolved
@johansealstorage
Copy link

The multi node switching is working now.
Testing with 2 nodes:

  1. Switched off one node, curio switches to using the other healthy node.
  2. Start up the node that was switched off, and then switched off the second node. Curio continues to work fine by picking up the healthy node.
  3. Out of sync node, curio still operates with the one healthy node.

Testing with 3 nodes showed the same results as with 2 nodes. As long as there is at least one healthy node, curio continues to work fine.

Some log where 2 of the 3 nodes were stopped shows that curio still continued working after successfully switching to the last healthy chain node.

  |   | 2024-10-15 11:28:51.341 | 2024-10-15T09:28:51.139Z	WARN	webrpc	webrpc/sync_state.go:140	error creating jsonrpc client: cannot dial address ws://10.4.8.201:9001/rpc/v1 for dial tcp 10.4.8.201:9001: connect: no route to host: dial tcp 10.4.8.201:9001: connect: no route to host |  
  |   | 2024-10-15 11:28:52.093 | 2024-10-15T09:28:52.012Z	INFO	harmonytask	harmonytask/task_type_handler.go:182	Beginning work on Task	{"id": 3213004, "from": "poller", "name": "WinInclCheck"} |  
  |   | 2024-10-15 11:28:52.093 | 2024-10-15T09:28:52.034Z	WARN	webrpc	webrpc/sync_state.go:140	error creating jsonrpc client: cannot dial address ws://10.4.10.201:9001/rpc/v1 for dial tcp 10.4.10.201:9001: connect: no route to host: dial tcp 10.4.10.201:9001: connect: no route to host |  
  |   | 2024-10-15 11:28:52.117 | 2024-10-15T09:28:51.870Z	ERROR	curio/chain	deps/apiinfo.go:277	Failed after 5 attempts, last error: RPCConnectionError |  
  |   | 2024-10-15 11:28:52.117 | 2024-10-15T09:28:51.870Z	ERROR	harmonytask	harmonytask/task_type_handler.go:229	Do() returned error	{"type": "WinInclCheck", "id": "3213004", "error": "RPCConnectionError"} |  
  |   | 2024-10-15 11:28:52.117 | 2024-10-15T09:28:52.013Z	INFO	harmonytask	harmonytask/task_type_handler.go:157	did not accept task	{"task_id": "3213004", "reason": "already Taken", "name": "WinInclCheck"} |  
  |   | 2024-10-15 11:28:52.344 | 2024-10-15T09:28:52.138Z	DEBUG	harmonytask	harmonytask/harmonytask.go:360	skipped scheduling WinInclCheck type tasks on due to Did not accept WinInclCheck task: at max already |  
  |   | 2024-10-15 11:29:01.366 | 2024-10-15T09:29:01.256Z	INFO	harmonytask	harmonytask/task_type_handler.go:157	did not accept task	{"task_id": "3213805", "reason": "already Taken", "name": "WinPost"}

@johansealstorage
Copy link

Tested with the latest UI fix and all Chain nodes are now showing as expected on the UI. The health of each of then also follows the out of sync and the reachability. Curio switches nicely between the available ones.

One situation that might be confusing is if a new API is added to the base layer. The UI starts tracking its health, but curio in the background will not use that node until curio is restarted. That might be a bit of an out of sync situation of what the user sees, vs what curio is actually using. After restart of curio, everything comes into sync again and the UI agrees with what curio uses.

@LexLuthr LexLuthr enabled auto-merge (squash) October 15, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple Nodes support in Curio Chain Node failover not working
4 participants