Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs

### Specification

The Nodes domain is an integration of all the other domains that enables Polykey to be a distributed and decentralised application. Right now it's design is causing its tests and its code to be quite brittle as other domains are being iterated on. This is slowing down our development process. At the same time, it has a number of functionalities that time to do such as pinging, these are now causing very slow tests up to 60 or 80 seconds for a single test to just do pinging.

The nodes domain require a significant refactoring to ensure stability and robustness and performance going into the future.

After a small review over the nodes codebase, here is the new proposed structure.

* NodeManager
  - requires `NodeConnectionManager`
  - encapsulates `NodeGraph` (optional dependency)
  - provides high-level nodes management such as `claimNode`
* NodeConnectionManager
  - manages the lifecycle of `NodeConnection` in a pool of node connections
  - provides a context higher-order function `withConnection` to run an asynchronous callback that uses `NodeConnection`
     * this means that all domains that currently rely on the `NodeManager`'s "wrapper" functions that perform `NodeConnection` operations would instead create their own internal functions (within their own domains) that use this higher-order function `withConnection` to perform their connection functionality. `NodeManager` should not have any of this anymore
  - applies a TTL to each `NodeConnection` and closes the connection after timeout
  - resets TTL on every connection use
  - uses a lock to ensure connections cannot be timed out when they are in-use
* NodeConnection
  - represents a connection to the agent service of another keynode
  - wrapper around `GRPCClientAgent`, and provides some lifecycle features specific to the node connection
  - class extension of `GRPCClientAgent` or es6 proxy or wrapper class
  - if wrapper class, then must provide access to the `client` property
  - used by other domains to call the agent service
* NodeGraph
  - data structure implementation of Kademlia
  - stores the `NodeId` -> `NodeAddress` mappings in a keynode
  - persistent uses DB and may require transactions

Both `NodeManager` and `NodeConnectionManager` should be exposed at the high level in `PolykeyAgent` and provided to `createClientService`.

Furthermore, the following lists the current usage of `NodeManager` for `NodeConnection` operations (i.e. the places where `NodeConnectionManager` would instead need to be injected):
* `Discovery`
* `NodeManager`
* `NodeGraph` - this is a special case that will need some potential refactoring, see https://github.com/MatrixAI/js-polykey/issues/225#:~:text=in%3A%20%23224%20(comment)-,NodeGraph,-NodeGraph%20is%20a
* `NotificationsManager`
* `VaultManager`

The following diagram depicts these dependencies:

```
                       ┌───────────────────────┐            ┌─────────┐
┌─────────────┐        │                       ◄────────────┤Discovery│
│             │        │ NodeConnectionManager │            └─────────┘
│ NodeManager ├────────►                       │
│             │        │     ┌─────────────┐   │       ┌────────────────────┐
│ ┌─────────┐ │        │    ┌┴────────────┐│   ◄───────┤NotificationsManager│
│ │NodeGraph├─┼────────►  ┌─┴────────────┐├┘   │       └────────────────────┘
│ └─────────┘ │        │  │NodeConnection├┘    │
│             │        │  └──────────────┘     │           ┌────────────┐
└─────────────┘        │                       ◄───────────┤VaultManager│
                       └───────────────────────┘           └────────────┘
```

#### NodeManager

The `NodeManager` manages nodes, but does not manage the node connections.

It is expected that `NodeManager` requires `NodeConnectionManager` which is shared as singleton with other domains.

Operations relating to manipulating nodes should be on `NodeManager` such as `claimNode` and related operations.

Most of the time this will be called by GRPC handlers, but some domains like discovery might be using the nodes domain. It is also what encapsulates access and mutation to the `NodeGraph`.

#### NodeConnectionManager

The `NodeConnectionManager` manages a pool of node connections. This is what is being shared as a singleton to all the domains that need to contact a node and call their handlers. For example `VaultManager` would not use `NodeManager` but only `NodeConnectionManager`. This also includes notifications domain.

The connections are shared between each node connection.

Notice the lock being used to ensure that connection establishment and connection closure is serialised. Refer to: 
https://gist.github.com/CMCDragonkai/f58f08e7eaab0430ed4467ca35527a42 for example. This pattern is also used in the vault manager.

A draft mock of the class:

```ts
class NodeConnectionManager {

  connections: Map<NodeId, {
    conn?: NodeConnection;
    lock: MutexInterface;
  }>;

  public constructor(nodeGraph) {}

  public getNodeId(); // lots of connection-related requests need this (get from KeyManager)

  // refreshes the timeout at the end
  public withConnection(targetNodeId, async function());

  // these are used by `withConnection`
  // reusing terms from network domain:
  // connConnectTime - timeout for opening the connection
  // connTimeoutTime  - TTL for the connection itself
  protected createConnection();
  protected destroyConnection();

}
```

In the `withConnection` call, this is the only way to acquire a connection to be used by other domains. If this is to be similar to the `VaultManager`, that mean that `VaultManager.withVault` would need to exist.

The locking side effects could be done in `NodeConnectionManager` or inside the `NodeConnection`. However since this problem also exists in the `VaultManager`, I believe we want to do the side-effects as much as possible in the `NodeConnectionManager`. If locking during creation and destruction needs to occur, then it has to occur via callbacks that are passed into the `NodeConnection`.

#### NodeConnection

```ts
class NodeConnection extends GRPCClientAgent {
  // timeout acts like a TTL
  // the timeout must be refreshed everytime it is used
  // and while `withConnection` is in use, we can use a lazy lock for this
  public constructor(timeout) {
  }

  // every call is available here
  // but this is a public API
}
```

There may be a `NodeConnectionInternal` to have the actual methods while `NodeConnection` maybe a public interface similar to `Vault` and `VaultInternal`.

##### Usage

Any call to another keynode has to go through the `NodeConnection`. It is expected that callers will perform something like this:

```ts
async function doSomethingWithAnotherNode(nodeId) {
  await this.nodeConnManager.withConn(
    nodeId, 
    async (conn: NodeConnection) => {
      await conn.client.doSomething();
    }
  );
}
```

Such usage would occur in both `NodeManager` and other domains such as `VaultManager`.

Note that the `conn.client.doSomething()` may be `conn.doSomething()` instead. This depends on how `NodeConnection` is written.

Either way it is expected that `doSomething` is the same gRPC call available in `GRPCClientAgent`.

This means it is up to the domains to manage and handle gRPC call behaviour which includes the usage of async generators, metadata and exceptions. Refer to #249 for details on this.

The reason we are doing this is that abstractions on the gRPC call side are domain-specific behaviour. We wouldn't want to centralise all of the domain call gRPC abstractions into the nodes domain, that would be quite brittle.

This means the vaults domain could if they wanted to create their own wrapper over the connection object such as:

```ts
this.nodeConnManager.withConn(
    nodeId, 
    async (conn: NodeConnection) => {
        // vaultClient adds vault-specific abstractions on the gRPC calls
        const client = vaultClient(conn.client);
        await client.doSomethingWithVaults();
    }
);
```

Right now we shouldn't do any gRPC abstractions until #249 is addressed.

##### Lifecycle

The `withConn` method provides a "resource context". It's a higher order function that takes the asynchronous action that is to be performed with the connection.

It has a similar pattern to our `poll`, `retryAuth`, `transact`, `transaction`, `_transaction` wrapper functions.

The advantage of this pattern is that the node connection will not be closed (by the TTL) while the action is being executed.

The `withConn` will need to use the `RWLock` pattern as provided in `src/utils.ts`. It has to acquire a read lock when executing the action.

This enables multiple `withConn` to be executing concurrently without blocking each other.

At the end of each `withConn`, the `TLL` should be reset.

When the TLL is expired, it may attempt to close the connection. To do this, a write lock is attempted. If the lock is already locked, then we skip. If we acquire the write lock, we can close the connection.

If the client were to be closed or terminated due to other reasons, then an exception should be thrown when accessing the `client` or its methods.

This would be `ErrorNodeConnectionNotRunning` or `ErrorGRPCClientAgentNotRunning` or `ErrorGRPCClientNotRunning`.

Other domains should not be keeping around references to `NodeConnection` it should only be used inside the `withConn` context.

##### Hole Punching Relay

Hole punching is done between the `ForwardProxy` and `ReverseProxy`. Hole punching is not done explicitly in the `NodeConnectionManager` or `NodeConnection`.

Instead, hole punching _relay_ is what is done by `NodeConnectionManager`. This means you relay a hole punching message across the network so that the target Node will perform reverse hole punching in case we need it. This process should be done at the same time as a connection is established.

This means `NodeConnectionManager` requires the `NodeGraph`, the same `NodeGraph` that is used by `NodeManager`.

The graph is needed to do optimal routing of the the hole punching relay message.

This means `NodeConnectionManager` would need to be used by `createAgentService` as the current keynode may be in the middle of a chain of relaying operations.

##### Public Interface

The public interface of `NodeConnection` can be a limited version of the internal class. This can hide methods like `destroy` from the users of `NodeConnection`. This ensures that only `NodeConnectionManager` can destroy the connection.

##### Dealing with Termination/Errors

The underlying `GRPCClientAgent` may close/terminate for various reasons. In order to capture this behaviour, you must hook into the state changes in the `Client` object. This will require changes to `src/grpc/GRPCClient.ts`, as it currently does not expose the `Client` object. Discussions on how to do this are in: https://github.com/matrixai/js-polykey/issues/224#issuecomment-966770905

#### NodeGraph

NodeGraph is a complex data structure. It appears to be needed by both `NodeManager` and `NodeConnectionManager`, but itself also needs to perform operations on `NodeManager`. Whenever there is mutual recursion, it can be factored out as another common dependency "absract the commonality". However if this is not suitable, then it can be done by passing `NodeManager` as `this` into `NodeGraph`.

However if `NodeConnectionManager` needs `NodeGraph` as well, then this is not possible unless `NodeConnectionManager` was also produced by `NodeManager`.

Further investigation is needed into the `NodeGraph` to understand how we can turn it into a data structure. Perhaps we can use a pattern of hooks where `NodeGraph` exposes a number of hook points which can be configured by `NodeManager` and thus allow `NodeGraph` to "callback" `NodeManager` and `NodeConnectionManager`.

If `NodeGraph` can be made purely a data structure/database, then:

1. The `getClosestLocalNodes` and `getClosestGlobalNodes` could be part of the `NodeManager`, and not the `NodeGraph`.
2. The seed nodes marshalling could occur in the `NodeManager`, and instead you just plug it with actual values.

There's only 3 places where `NodeGraph` is calling out to the world. All 3 can be done in `NodeManager` instead. That means `NodeGraph` no longer needs `NodeConnectionManager`.

#### Testing

For `NodeConnectionManager` and `NodeConnection`, it's essential that we don't duplicate tests for `GRPCClientAgent` as that would be handled by `tests/agent`. It must focus on just testing the connection related functionality. This is where mocks should be useful here, since we want to delay integration testing until the very end.

There are 4 kinds of mocks we have to use here.

1. Timer mocks - many code relies on timeouts, we must use timer mocking https://jestjs.io/blog/2020/05/05/jest-26#new-fake-timers to ensure that don't wait for the timeout.
2. Function mocks - as used in `tests/bin/utils.retryAuth.test.ts`, function mocks can be very useful way of providing a dummy function that we can later instrument. This can be useful if we pursue a hook pattern with `NodeGraph`.
3. Module/class mocks - as used in `tests/bin/utils.retryAuth.test.ts` (`jest.mock('prompts')`) this is high level mocking that should only be used when no other methods can be done. This is more complicated, but you can mock **anything** this way.
4. DI mocking, all of the dependencies talked about here should be DI-able, this should be used first before making use of `Module/class` mocks.

The tests for the nodes domain should be robust, reproducible, not flaky and complete within 10 - 30 seconds.

Larger integration tests should be done at a higher level, which will help us achieve #264.

This should ensure that nodes domain is more stable and does not break every time we change stuff in other domains.

### Additional context

* #281 - Async Init Fixes that is applied in CARL & Session Management MR: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213
* #194 - Seed Nodes Integration, application of DNS resolution is relevant here
* #249 - GRPC API Review
* #264 - Concurrent Testing Scalability

### Tasks

1. Create `NodeConnectionManager`
2. Refactor `NodeConnection` into `NodeConnectionManager` and `NodeManager`
   1. Add lock acquisition to `destroy` call of `NodeConnection` (the same process as construction): https://github.com/MatrixAI/js-polykey/issues/225#issuecomment-984262851
3. Factor out the commonality to prevent mutual recursion in `NodeGraph` and `NodeManager` and `NodeConnectionManager`
4. Introduce timer mocking to nodes testing.
5. Implement mocking for `NodeConnection` and `NodeConnectionManager` testing. Remove any tests that test agent service handling or forward proxy and reverse proxy usage. Rely on the other domain tests to handle this.
6. Fix up all domains that use `NodeConnection` to acquire their connections from `NodeConnectionManager`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225

Specification

NodeManager

NodeConnectionManager

NodeConnection

Usage

Lifecycle

Hole Punching Relay

Public Interface

Dealing with Termination/Errors

NodeGraph

Testing

Additional context

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225

Description

Specification

NodeManager

NodeConnectionManager

NodeConnection

Usage

Lifecycle

Hole Punching Relay

Public Interface

Dealing with Termination/Errors

NodeGraph

Testing

Additional context

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions