Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not attempt to acquire missing data from peer network in #4458

Merged
merged 1 commit into from
Mar 15, 2023

Conversation

mtrippled
Copy link
Collaborator

@mtrippled mtrippled commented Mar 9, 2023

reporting mode.

High Level Overview of Change

This should fix a problem of dumping core when Cassandra is down for reporting mode servers.

Context of Change

Type of Change

  • [X ] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Copy link
Collaborator

@thejohnfreeman thejohnfreeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the key part is the conditional throw added at line 73 in src/ripple/shamap/impl/NodeFamily.cpp. Can we just make that a std::runtime_error with the information available, without adding any new parameters, enumerations, or constructors?

src/ripple/shamap/impl/NodeFamily.cpp Outdated Show resolved Hide resolved
src/ripple/shamap/impl/NodeFamily.cpp Outdated Show resolved Hide resolved
src/ripple/shamap/SHAMapMissingNode.h Outdated Show resolved Hide resolved
src/ripple/shamap/Family.h Outdated Show resolved Hide resolved
@mtrippled
Copy link
Collaborator Author

@thejohnfreeman Unfortunately, I discarded the original stack trace I was looking at when I made this patch, so i forget the exact code path being traversed. But assumedly it had to do with an RPC request based on a particular ledger, which then goes into the cassandra-backed nodestore.

@mtrippled
Copy link
Collaborator Author

mtrippled commented Mar 13, 2023

@thejohnfreeman
I found a stack trace that looks a lot like what I saw before writing this patch, it goes like this for the crashing thread:

Thread 1 (Thread 0x7f403ffff700 (LWP 1016441)):
#0  0x0000000001b139a9 in ripple::PeerSetImpl::addPeers(unsigned long, std::function<bool (std::shared_ptr<ripple::Peer> const&)>, std::function<void (std::shared_ptr<ripple::Peer> const&)>) (this=this@entry=0x7f3f1a9b9af0, limit=5, hasItem=..., onPeerAdded=...) at ../../src/ripple/overlay/impl/PeerSet.cpp:74
        overlay = <error reading variable>
        pairs = {<std::_Vector_base<std::pair<int, std::shared_ptr<ripple::Peer> >, std::allocator<std::pair<int, std::shared_ptr<ripple::Peer> > > >> = {_M_impl = {<std::allocator<std::pair<int, std::shared_ptr<ripple::Peer> > >> = {<__gnu_cxx::new_allocator<std::pair<int, std::shared_ptr<ripple::Peer> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}
        accepted = <optimized out>
#1  0x000000000163472c in ripple::InboundLedger::addPeers (this=<optimized out>) at /usr/include/c++/8/new:169
No locals.
#2  0x0000000001637754 in ripple::InboundLedger::init (this=this@entry=0x7f4030155c90, collectionLock=...) at ../../src/ripple/app/ledger/impl/InboundLedger.cpp:151
        sl = {_M_device = 0x7f4030155ca8, _M_owns = true}
#3  0x0000000001644b18 in ripple::InboundLedgersImp::acquire (this=0x52b5180, hash=..., seq=78311747, reason=<optimized out>) at ../../src/ripple/app/ledger/impl/InboundLedgers.cpp:108
        sl = {_M_device = 0x52b51d8, _M_owns = false}
        it = <optimized out>
        isNew = true
        inbound = {<std::__shared_ptr<ripple::InboundLedger, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::InboundLedger, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f4030155c90, _M_refcount = {_M_pi = 0x7f4030155c80}}, <No data fields>}
#4  0x0000000001d1aced in ripple::NodeFamily::acquire (this=0x518a990, hash=..., seq=78311747) at ../../src/ripple/shamap/impl/NodeFamily.cpp:104
No locals.
#5  0x0000000001d1af38 in ripple::NodeFamily::missingNode (this=0x518a990, seq=78311747) at ../../src/ripple/shamap/impl/NodeFamily.cpp:85
        lock = {_M_device = 0x518a9d8, _M_owns = false}
#6  0x0000000001d21836 in ripple::SHAMap::finishFetch (this=0x7f4088147230, hash=..., object=...) at ../../src/ripple/shamap/impl/SHAMap.cpp:181
        node = {<std::__shared_ptr<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f4030028640, _M_refcount = {_M_pi = 0x7f4030028630}}, <No data fields>}
#7  0x0000000001d218a7 in ripple::SHAMap::fetchNodeFromDB (this=0x7f4088147230, hash=...) at ../../src/ripple/shamap/impl/SHAMap.cpp:167
        obj = {<std::__shared_ptr<ripple::NodeObject, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::NodeObject, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}
#8  0x0000000001d21954 in ripple::SHAMap::fetchNodeNT (this=0x7f4088147230, hash=...) at ../../src/ripple/shamap/impl/SHAMap.cpp:265
        node = {<std::__shared_ptr<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}
#9  0x0000000001d219b6 in ripple::SHAMap::fetchNode (this=0x7f4088147230, hash=...) at ../../src/ripple/shamap/impl/SHAMap.cpp:274
        node = {<std::__shared_ptr<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}
#10 0x0000000001d21a95 in ripple::SHAMap::descend (this=0x7f4088147230, parent=..., branch=5) at ../../src/ripple/shamap/impl/SHAMap.cpp:329
        node = {<std::__shared_ptr<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}
#11 0x0000000001d21bba in ripple::SHAMap::descendThrow (this=0x7f4088147230, parent=..., branch=5) at ../../src/ripple/shamap/impl/SHAMap.cpp:297
        ret = {<std::__shared_ptr<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}
#12 0x0000000001d21d8a in ripple::SHAMap::walkTowardsKey (this=0x7f4088147230, id=..., stack=0x7f3f236f9700) at ../../src/ripple/shamap/impl/SHAMap.cpp:144
        inner = {<std::__shared_ptr<ripple::SHAMapInnerNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapInnerNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f4030028640, _M_refcount = {_M_pi = 0x7f4030028630}}, <No data fields>}
        branch = 5
        inNode = {<std::__shared_ptr<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ripple::SHAMapTreeNode, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f4030028640, _M_refcount = {_M_pi = 0x7f4030028630}}, <No data fields>}
        nodeID = {<ripple::CountedObject<ripple::SHAMapNodeID>> = {<No data fields>}, id_ = {static WIDTH = 8, data_ = {_M_elems = {3513813340, 0, 0, 0, 0, 0, 0, 0}}, static bytes = 32}, depth_ = 8}
#13 0x0000000001d22cd6 in ripple::SHAMap::upper_bound (this=0x7f4088147230, id=...) at ../../src/ripple/shamap/impl/SHAMap.cpp:612
        stack = {c = {<std::_Deque_base<std::pair<std::shared_ptr<ripple::SHAMapTreeNode>, ripple::SHAMapNodeID>, std::allocator<std::pair<std::shared_ptr<ripple::SHAMapTreeNode>, ripple::SHAMapNodeID> > >> = {_M_impl = {<std::allocator<std::pair<std::shared_ptr<ripple::SHAMapTreeNode>, ripple::SHAMapNodeID> >> = {<__gnu_cxx::new_allocator<std::pair<std::shared_ptr<ripple::SHAMapTreeNode>, ripple::SHAMapNodeID> >> = {<No data fields>}, <No data fields>}, _M_map = 0x7f403002df50, _M_map_size = 8, _M_start = {_M_cur = 0x7f40302b0400, _M_first = 0x7f40302b0400, _M_last = 0x7f40302b05f8, _M_node = 0x7f403002df68}, _M_finish = {_M_cur = 0x7f4030583e80, _M_first = 0x7f4030583e80, _M_last = 0x7f4030584078, _M_node = 0x7f403002df70}}}, <No data fields>}}
#14 0x000000000161bafa in ripple::Ledger::succ (this=0x7f408871d870, key=..., last=...) at ../../src/ripple/app/ledger/Ledger.cpp:423

So the way I implemented this was when I saw that SHAMap::fetchNode throws SHAMapMissingNode if it can't find something, I then decided to make the lower-level call of missingNode throw the same thing. This is in hopes that as it the exception bubbles up, it gets handled sensibly.

But I didn't look at all the possible ways it bubbles up, because there are lots of callers for this code path.

@intelliot intelliot added this to the 1.10.1 milestone Mar 14, 2023
@mtrippled
Copy link
Collaborator Author

@thejohnfreeman I found a good place to catch this error, and simplified things so all your concerns should be addressed.

@thejohnfreeman
Copy link
Collaborator

We're getting closer, and I appreciate the change. Now notice this: We've changed the signature of missingNodeAcquireBySeq to accept a uint256 nodeHash, that is only used in the error message that is conditionally thrown at the top of the function, which is only called from one place, SHAMap::finishFetch. The fact it is only called in one place is what lets us easily prove that every std::exception it throws is caught. It would be nice if we could revert all the non-documentation changes to the missingNode overload set and just handle the error in SHAMap::finishFetch before it calls missingNode, but unfortunately Application is not available there to obtain the Config.

@mtrippled
Copy link
Collaborator Author

We're getting closer, and I appreciate the change. Now notice this: We've changed the signature of missingNodeAcquireBySeq to accept a uint256 nodeHash, that is only used in the error message that is conditionally thrown at the top of the function, which is only called from one place, SHAMap::finishFetch. The fact it is only called in one place is what lets us easily prove that every std::exception it throws is caught. It would be nice if we could revert all the non-documentation changes to the missingNode overload set and just handle the error in SHAMap::finishFetch before it calls missingNode, but unfortunately Application is not available there to obtain the Config.

Yeah, it's all down to that.

In general, I think singletons like Application, Config, Journal, should be globals. But that's for another patch.

@mtrippled mtrippled added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Mar 15, 2023
@intelliot intelliot merged commit f7b3ddd into XRPLF:develop Mar 15, 2023
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
…ctionality

* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
ximinez added a commit to ximinez/rippled that referenced this pull request Mar 15, 2023
…tpage

* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants