You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Nodes domain is an integration of all the other domains that enables Polykey to be a distributed and decentralised application. Right now it's design is causing its tests and its code to be quite brittle as other domains are being iterated on. This is slowing down our development process. At the same time, it has a number of functionalities that time to do such as pinging, these are now causing very slow tests up to 60 or 80 seconds for a single test to just do pinging.
The nodes domain require a significant refactoring to ensure stability and robustness and performance going into the future.
After a small review over the nodes codebase, here is the new proposed structure.
NodeManager
requires NodeConnectionManager
encapsulates NodeGraph (optional dependency)
provides high-level nodes management such as claimNode
NodeConnectionManager
manages the lifecycle of NodeConnection in a pool of node connections
provides a context higher-order function withConnection to run an asynchronous callback that uses NodeConnection
this means that all domains that currently rely on the NodeManager's "wrapper" functions that perform NodeConnection operations would instead create their own internal functions (within their own domains) that use this higher-order function withConnection to perform their connection functionality. NodeManager should not have any of this anymore
applies a TTL to each NodeConnection and closes the connection after timeout
resets TTL on every connection use
uses a lock to ensure connections cannot be timed out when they are in-use
NodeConnection
represents a connection to the agent service of another keynode
wrapper around GRPCClientAgent, and provides some lifecycle features specific to the node connection
class extension of GRPCClientAgent or es6 proxy or wrapper class
if wrapper class, then must provide access to the client property
used by other domains to call the agent service
NodeGraph
data structure implementation of Kademlia
stores the NodeId -> NodeAddress mappings in a keynode
persistent uses DB and may require transactions
Both NodeManager and NodeConnectionManager should be exposed at the high level in PolykeyAgent and provided to createClientService.
Furthermore, the following lists the current usage of NodeManager for NodeConnection operations (i.e. the places where NodeConnectionManager would instead need to be injected):
The NodeManager manages nodes, but does not manage the node connections.
It is expected that NodeManager requires NodeConnectionManager which is shared as singleton with other domains.
Operations relating to manipulating nodes should be on NodeManager such as claimNode and related operations.
Most of the time this will be called by GRPC handlers, but some domains like discovery might be using the nodes domain. It is also what encapsulates access and mutation to the NodeGraph.
NodeConnectionManager
The NodeConnectionManager manages a pool of node connections. This is what is being shared as a singleton to all the domains that need to contact a node and call their handlers. For example VaultManager would not use NodeManager but only NodeConnectionManager. This also includes notifications domain.
The connections are shared between each node connection.
classNodeConnectionManager{connections: Map<NodeId,{conn?: NodeConnection;lock: MutexInterface;}>;publicconstructor(nodeGraph){}publicgetNodeId();// lots of connection-related requests need this (get from KeyManager)// refreshes the timeout at the endpublicwithConnection(targetNodeId,asyncfunction());// these are used by `withConnection`// reusing terms from network domain:// connConnectTime - timeout for opening the connection// connTimeoutTime - TTL for the connection itselfprotectedcreateConnection();protecteddestroyConnection();}
In the withConnection call, this is the only way to acquire a connection to be used by other domains. If this is to be similar to the VaultManager, that mean that VaultManager.withVault would need to exist.
The locking side effects could be done in NodeConnectionManager or inside the NodeConnection. However since this problem also exists in the VaultManager, I believe we want to do the side-effects as much as possible in the NodeConnectionManager. If locking during creation and destruction needs to occur, then it has to occur via callbacks that are passed into the NodeConnection.
NodeConnection
classNodeConnectionextendsGRPCClientAgent{// timeout acts like a TTL// the timeout must be refreshed everytime it is used// and while `withConnection` is in use, we can use a lazy lock for thispublicconstructor(timeout){}// every call is available here// but this is a public API}
There may be a NodeConnectionInternal to have the actual methods while NodeConnection maybe a public interface similar to Vault and VaultInternal.
Usage
Any call to another keynode has to go through the NodeConnection. It is expected that callers will perform something like this:
Such usage would occur in both NodeManager and other domains such as VaultManager.
Note that the conn.client.doSomething() may be conn.doSomething() instead. This depends on how NodeConnection is written.
Either way it is expected that doSomething is the same gRPC call available in GRPCClientAgent.
This means it is up to the domains to manage and handle gRPC call behaviour which includes the usage of async generators, metadata and exceptions. Refer to #249 for details on this.
The reason we are doing this is that abstractions on the gRPC call side are domain-specific behaviour. We wouldn't want to centralise all of the domain call gRPC abstractions into the nodes domain, that would be quite brittle.
This means the vaults domain could if they wanted to create their own wrapper over the connection object such as:
this.nodeConnManager.withConn(nodeId,async(conn: NodeConnection)=>{// vaultClient adds vault-specific abstractions on the gRPC callsconstclient=vaultClient(conn.client);awaitclient.doSomethingWithVaults();});
Right now we shouldn't do any gRPC abstractions until #249 is addressed.
Lifecycle
The withConn method provides a "resource context". It's a higher order function that takes the asynchronous action that is to be performed with the connection.
It has a similar pattern to our poll, retryAuth, transact, transaction, _transaction wrapper functions.
The advantage of this pattern is that the node connection will not be closed (by the TTL) while the action is being executed.
The withConn will need to use the RWLock pattern as provided in src/utils.ts. It has to acquire a read lock when executing the action.
This enables multiple withConn to be executing concurrently without blocking each other.
At the end of each withConn, the TLL should be reset.
When the TLL is expired, it may attempt to close the connection. To do this, a write lock is attempted. If the lock is already locked, then we skip. If we acquire the write lock, we can close the connection.
If the client were to be closed or terminated due to other reasons, then an exception should be thrown when accessing the client or its methods.
This would be ErrorNodeConnectionNotRunning or ErrorGRPCClientAgentNotRunning or ErrorGRPCClientNotRunning.
Other domains should not be keeping around references to NodeConnection it should only be used inside the withConn context.
Hole Punching Relay
Hole punching is done between the ForwardProxy and ReverseProxy. Hole punching is not done explicitly in the NodeConnectionManager or NodeConnection.
Instead, hole punching relay is what is done by NodeConnectionManager. This means you relay a hole punching message across the network so that the target Node will perform reverse hole punching in case we need it. This process should be done at the same time as a connection is established.
This means NodeConnectionManager requires the NodeGraph, the same NodeGraph that is used by NodeManager.
The graph is needed to do optimal routing of the the hole punching relay message.
This means NodeConnectionManager would need to be used by createAgentService as the current keynode may be in the middle of a chain of relaying operations.
Public Interface
The public interface of NodeConnection can be a limited version of the internal class. This can hide methods like destroy from the users of NodeConnection. This ensures that only NodeConnectionManager can destroy the connection.
Dealing with Termination/Errors
The underlying GRPCClientAgent may close/terminate for various reasons. In order to capture this behaviour, you must hook into the state changes in the Client object. This will require changes to src/grpc/GRPCClient.ts, as it currently does not expose the Client object. Discussions on how to do this are in: #224 (comment)
NodeGraph
NodeGraph is a complex data structure. It appears to be needed by both NodeManager and NodeConnectionManager, but itself also needs to perform operations on NodeManager. Whenever there is mutual recursion, it can be factored out as another common dependency "absract the commonality". However if this is not suitable, then it can be done by passing NodeManager as this into NodeGraph.
However if NodeConnectionManager needs NodeGraph as well, then this is not possible unless NodeConnectionManager was also produced by NodeManager.
Further investigation is needed into the NodeGraph to understand how we can turn it into a data structure. Perhaps we can use a pattern of hooks where NodeGraph exposes a number of hook points which can be configured by NodeManager and thus allow NodeGraph to "callback" NodeManager and NodeConnectionManager.
If NodeGraph can be made purely a data structure/database, then:
The getClosestLocalNodes and getClosestGlobalNodes could be part of the NodeManager, and not the NodeGraph.
The seed nodes marshalling could occur in the NodeManager, and instead you just plug it with actual values.
There's only 3 places where NodeGraph is calling out to the world. All 3 can be done in NodeManager instead. That means NodeGraph no longer needs NodeConnectionManager.
Testing
For NodeConnectionManager and NodeConnection, it's essential that we don't duplicate tests for GRPCClientAgent as that would be handled by tests/agent. It must focus on just testing the connection related functionality. This is where mocks should be useful here, since we want to delay integration testing until the very end.
Function mocks - as used in tests/bin/utils.retryAuth.test.ts, function mocks can be very useful way of providing a dummy function that we can later instrument. This can be useful if we pursue a hook pattern with NodeGraph.
Module/class mocks - as used in tests/bin/utils.retryAuth.test.ts (jest.mock('prompts')) this is high level mocking that should only be used when no other methods can be done. This is more complicated, but you can mock anything this way.
DI mocking, all of the dependencies talked about here should be DI-able, this should be used first before making use of Module/class mocks.
The tests for the nodes domain should be robust, reproducible, not flaky and complete within 10 - 30 seconds.
Larger integration tests should be done at a higher level, which will help us achieve #264.
This should ensure that nodes domain is more stable and does not break every time we change stuff in other domains.
Factor out the commonality to prevent mutual recursion in NodeGraph and NodeManager and NodeConnectionManager
Introduce timer mocking to nodes testing.
Implement mocking for NodeConnection and NodeConnectionManager testing. Remove any tests that test agent service handling or forward proxy and reverse proxy usage. Rely on the other domain tests to handle this.
Fix up all domains that use NodeConnection to acquire their connections from NodeConnectionManager.
Specification
The Nodes domain is an integration of all the other domains that enables Polykey to be a distributed and decentralised application. Right now it's design is causing its tests and its code to be quite brittle as other domains are being iterated on. This is slowing down our development process. At the same time, it has a number of functionalities that time to do such as pinging, these are now causing very slow tests up to 60 or 80 seconds for a single test to just do pinging.
The nodes domain require a significant refactoring to ensure stability and robustness and performance going into the future.
After a small review over the nodes codebase, here is the new proposed structure.
NodeConnectionManagerNodeGraph(optional dependency)claimNodeNodeConnectionin a pool of node connectionswithConnectionto run an asynchronous callback that usesNodeConnectionNodeManager's "wrapper" functions that performNodeConnectionoperations would instead create their own internal functions (within their own domains) that use this higher-order functionwithConnectionto perform their connection functionality.NodeManagershould not have any of this anymoreNodeConnectionand closes the connection after timeoutGRPCClientAgent, and provides some lifecycle features specific to the node connectionGRPCClientAgentor es6 proxy or wrapper classclientpropertyNodeId->NodeAddressmappings in a keynodeBoth
NodeManagerandNodeConnectionManagershould be exposed at the high level inPolykeyAgentand provided tocreateClientService.Furthermore, the following lists the current usage of
NodeManagerforNodeConnectionoperations (i.e. the places whereNodeConnectionManagerwould instead need to be injected):DiscoveryNodeManagerNodeGraph- this is a special case that will need some potential refactoring, see Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225NotificationsManagerVaultManagerThe following diagram depicts these dependencies:
NodeManager
The
NodeManagermanages nodes, but does not manage the node connections.It is expected that
NodeManagerrequiresNodeConnectionManagerwhich is shared as singleton with other domains.Operations relating to manipulating nodes should be on
NodeManagersuch asclaimNodeand related operations.Most of the time this will be called by GRPC handlers, but some domains like discovery might be using the nodes domain. It is also what encapsulates access and mutation to the
NodeGraph.NodeConnectionManager
The
NodeConnectionManagermanages a pool of node connections. This is what is being shared as a singleton to all the domains that need to contact a node and call their handlers. For exampleVaultManagerwould not useNodeManagerbut onlyNodeConnectionManager. This also includes notifications domain.The connections are shared between each node connection.
Notice the lock being used to ensure that connection establishment and connection closure is serialised. Refer to:
https://gist.github.com/CMCDragonkai/f58f08e7eaab0430ed4467ca35527a42 for example. This pattern is also used in the vault manager.
A draft mock of the class:
In the
withConnectioncall, this is the only way to acquire a connection to be used by other domains. If this is to be similar to theVaultManager, that mean thatVaultManager.withVaultwould need to exist.The locking side effects could be done in
NodeConnectionManageror inside theNodeConnection. However since this problem also exists in theVaultManager, I believe we want to do the side-effects as much as possible in theNodeConnectionManager. If locking during creation and destruction needs to occur, then it has to occur via callbacks that are passed into theNodeConnection.NodeConnection
There may be a
NodeConnectionInternalto have the actual methods whileNodeConnectionmaybe a public interface similar toVaultandVaultInternal.Usage
Any call to another keynode has to go through the
NodeConnection. It is expected that callers will perform something like this:Such usage would occur in both
NodeManagerand other domains such asVaultManager.Note that the
conn.client.doSomething()may beconn.doSomething()instead. This depends on howNodeConnectionis written.Either way it is expected that
doSomethingis the same gRPC call available inGRPCClientAgent.This means it is up to the domains to manage and handle gRPC call behaviour which includes the usage of async generators, metadata and exceptions. Refer to #249 for details on this.
The reason we are doing this is that abstractions on the gRPC call side are domain-specific behaviour. We wouldn't want to centralise all of the domain call gRPC abstractions into the nodes domain, that would be quite brittle.
This means the vaults domain could if they wanted to create their own wrapper over the connection object such as:
Right now we shouldn't do any gRPC abstractions until #249 is addressed.
Lifecycle
The
withConnmethod provides a "resource context". It's a higher order function that takes the asynchronous action that is to be performed with the connection.It has a similar pattern to our
poll,retryAuth,transact,transaction,_transactionwrapper functions.The advantage of this pattern is that the node connection will not be closed (by the TTL) while the action is being executed.
The
withConnwill need to use theRWLockpattern as provided insrc/utils.ts. It has to acquire a read lock when executing the action.This enables multiple
withConnto be executing concurrently without blocking each other.At the end of each
withConn, theTLLshould be reset.When the TLL is expired, it may attempt to close the connection. To do this, a write lock is attempted. If the lock is already locked, then we skip. If we acquire the write lock, we can close the connection.
If the client were to be closed or terminated due to other reasons, then an exception should be thrown when accessing the
clientor its methods.This would be
ErrorNodeConnectionNotRunningorErrorGRPCClientAgentNotRunningorErrorGRPCClientNotRunning.Other domains should not be keeping around references to
NodeConnectionit should only be used inside thewithConncontext.Hole Punching Relay
Hole punching is done between the
ForwardProxyandReverseProxy. Hole punching is not done explicitly in theNodeConnectionManagerorNodeConnection.Instead, hole punching relay is what is done by
NodeConnectionManager. This means you relay a hole punching message across the network so that the target Node will perform reverse hole punching in case we need it. This process should be done at the same time as a connection is established.This means
NodeConnectionManagerrequires theNodeGraph, the sameNodeGraphthat is used byNodeManager.The graph is needed to do optimal routing of the the hole punching relay message.
This means
NodeConnectionManagerwould need to be used bycreateAgentServiceas the current keynode may be in the middle of a chain of relaying operations.Public Interface
The public interface of
NodeConnectioncan be a limited version of the internal class. This can hide methods likedestroyfrom the users ofNodeConnection. This ensures that onlyNodeConnectionManagercan destroy the connection.Dealing with Termination/Errors
The underlying
GRPCClientAgentmay close/terminate for various reasons. In order to capture this behaviour, you must hook into the state changes in theClientobject. This will require changes tosrc/grpc/GRPCClient.ts, as it currently does not expose theClientobject. Discussions on how to do this are in: #224 (comment)NodeGraph
NodeGraph is a complex data structure. It appears to be needed by both
NodeManagerandNodeConnectionManager, but itself also needs to perform operations onNodeManager. Whenever there is mutual recursion, it can be factored out as another common dependency "absract the commonality". However if this is not suitable, then it can be done by passingNodeManagerasthisintoNodeGraph.However if
NodeConnectionManagerneedsNodeGraphas well, then this is not possible unlessNodeConnectionManagerwas also produced byNodeManager.Further investigation is needed into the
NodeGraphto understand how we can turn it into a data structure. Perhaps we can use a pattern of hooks whereNodeGraphexposes a number of hook points which can be configured byNodeManagerand thus allowNodeGraphto "callback"NodeManagerandNodeConnectionManager.If
NodeGraphcan be made purely a data structure/database, then:getClosestLocalNodesandgetClosestGlobalNodescould be part of theNodeManager, and not theNodeGraph.NodeManager, and instead you just plug it with actual values.There's only 3 places where
NodeGraphis calling out to the world. All 3 can be done inNodeManagerinstead. That meansNodeGraphno longer needsNodeConnectionManager.Testing
For
NodeConnectionManagerandNodeConnection, it's essential that we don't duplicate tests forGRPCClientAgentas that would be handled bytests/agent. It must focus on just testing the connection related functionality. This is where mocks should be useful here, since we want to delay integration testing until the very end.There are 4 kinds of mocks we have to use here.
tests/bin/utils.retryAuth.test.ts, function mocks can be very useful way of providing a dummy function that we can later instrument. This can be useful if we pursue a hook pattern withNodeGraph.tests/bin/utils.retryAuth.test.ts(jest.mock('prompts')) this is high level mocking that should only be used when no other methods can be done. This is more complicated, but you can mock anything this way.Module/classmocks.The tests for the nodes domain should be robust, reproducible, not flaky and complete within 10 - 30 seconds.
Larger integration tests should be done at a higher level, which will help us achieve #264.
This should ensure that nodes domain is more stable and does not break every time we change stuff in other domains.
Additional context
Tasks
NodeConnectionManagerNodeConnectionintoNodeConnectionManagerandNodeManagerdestroycall ofNodeConnection(the same process as construction): Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225 (comment)NodeGraphandNodeManagerandNodeConnectionManagerNodeConnectionandNodeConnectionManagertesting. Remove any tests that test agent service handling or forward proxy and reverse proxy usage. Rely on the other domain tests to handle this.NodeConnectionto acquire their connections fromNodeConnectionManager.