bug: MCP stdio child processes silently fail TLS handshake without explicit CA env propagation

### Agent Diagnostic

- Loaded `debug-openshell-cluster` skill.
- Investigated the MCP spawn path. `@modelcontextprotocol/sdk`'s `StdioClientTransport` constructs the child process via `child_process.spawn()` and passes only the env block declared in `mcp.servers.<name>.env` — parent process env is not inherited by default.
- OpenShell sandboxes route outbound HTTPS through the proxy at `10.200.0.1:3128`, which re-signs traffic with an internal CA mounted into pods at `/etc/openshell-tls/openshell-ca.pem`.
- Without `NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pem` in the MCP child's env block, Node's `https.request` (used by most Node-based MCP servers and by any internal MSAL/OAuth refresh) silently fails the TLS handshake.
- Failure mode is asymmetric: initial child spawn appears to succeed; first tool call works if the server's cached access token is still valid; later token refresh hits `network_error` and silently cascades to `Not connected` in the gateway log with no surfaced error explaining the actual cause.

### Description

Any Node-based MCP stdio server that performs HTTPS calls or token refresh inside an OpenShell sandbox requires a hand-crafted ~10-line env block to bridge the sandbox's MITM TLS proxy CA model into the child process. The required env block is essentially the same across every server we've deployed:

```
NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pem
SSL_CERT_FILE=/etc/openshell-tls/ca-bundle.pem
REQUESTS_CA_BUNDLE=/etc/openshell-tls/ca-bundle.pem
CURL_CA_BUNDLE=/etc/openshell-tls/ca-bundle.pem
HTTPS_PROXY=http://10.200.0.1:3128
HTTP_PROXY=http://10.200.0.1:3128
NO_PROXY=localhost,127.0.0.1
```

Operators discover this only after a server fails silently in production: the first tool call succeeds with cached credentials, then any later token refresh hits the bare `Not connected` failure with no error message pointing at the missing CA. Diagnosis took us hours the first time.

### Reproduction Steps

1. In a sandbox configured with the standard CONNECT proxy + MITM TLS rewrite, install any Node-based MCP server that calls an HTTPS endpoint requiring periodic token refresh (OAuth / device code / MSAL).
2. Add the server to `mcp.servers.<name>` with only the server's documented env block — no OpenShell-specific additions.
3. Trigger a tool call once with a freshly minted token. **Observed:** succeeds.
4. Wait for the token to expire (or invalidate the cache), then trigger another tool call.
5. **Observed:** `Not connected` in the gateway log; no actionable error explaining why the refresh failed.
6. **Expected:** Either a clear error message that points at the CA / proxy mismatch, or sandbox-side env propagation that makes this work transparently.

### Environment

- OpenShell: v0.0.31 host CLI, v0.0.31 supervisor
- OS: Ubuntu 24.04
- Docker: 27.x
- Cluster image: `ghcr.io/nvidia/openshell/cluster:0.0.16`
- MCP runtime: Node 22, `@modelcontextprotocol/sdk` stdio transport

### Proposed direction

Two complementary options. Either alone improves the situation; both together close the gap entirely.

1. **Inject a baseline env block by default** into every MCP stdio child spawned inside a sandbox: `NODE_EXTRA_CA_CERTS`, `SSL_CERT_FILE`, `REQUESTS_CA_BUNDLE`, `CURL_CA_BUNDLE`, `HTTPS_PROXY`, `HTTP_PROXY`, `NO_PROXY` should all be set to the sandbox-correct values automatically. The user's `env` block then *overrides* on top. Makes the common case work with zero operator knowledge.

2. **Surface the failure clearly.** Today the proxy-side handshake denial is invisible — even a one-line stderr in the supervisor log saying "TLS handshake from PID X failed; CA bundle mismatch suspected" would cut time-to-diagnose from hours to minutes.

We have a workaround in production today (the manual env block above), but every new MCP server deployed requires re-discovering this — and the failure isn't loud enough for the next operator to figure out without help. Happy to convert this to a PR if a maintainer signals direction.

### Agent-First Checklist

- [x] I pointed my agent at the repo and had it investigate this issue
- [x] I loaded relevant skills (`debug-openshell-cluster`)
- [x] My agent could not resolve this — the diagnostic above explains the root cause; resolution needs an upstream code change.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: MCP stdio child processes silently fail TLS handshake without explicit CA env propagation #886

Agent Diagnostic

Description

Reproduction Steps

Environment

Proposed direction

Agent-First Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: MCP stdio child processes silently fail TLS handshake without explicit CA env propagation #886

Description

Agent Diagnostic

Description

Reproduction Steps

Environment

Proposed direction

Agent-First Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions