Agent Diagnostic
- Loaded
debug-openshell-cluster skill.
- Investigated the MCP spawn path.
@modelcontextprotocol/sdk's StdioClientTransport constructs the child process via child_process.spawn() and passes only the env block declared in mcp.servers.<name>.env — parent process env is not inherited by default.
- OpenShell sandboxes route outbound HTTPS through the proxy at
10.200.0.1:3128, which re-signs traffic with an internal CA mounted into pods at /etc/openshell-tls/openshell-ca.pem.
- Without
NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pem in the MCP child's env block, Node's https.request (used by most Node-based MCP servers and by any internal MSAL/OAuth refresh) silently fails the TLS handshake.
- Failure mode is asymmetric: initial child spawn appears to succeed; first tool call works if the server's cached access token is still valid; later token refresh hits
network_error and silently cascades to Not connected in the gateway log with no surfaced error explaining the actual cause.
Description
Any Node-based MCP stdio server that performs HTTPS calls or token refresh inside an OpenShell sandbox requires a hand-crafted ~10-line env block to bridge the sandbox's MITM TLS proxy CA model into the child process. The required env block is essentially the same across every server we've deployed:
NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pem
SSL_CERT_FILE=/etc/openshell-tls/ca-bundle.pem
REQUESTS_CA_BUNDLE=/etc/openshell-tls/ca-bundle.pem
CURL_CA_BUNDLE=/etc/openshell-tls/ca-bundle.pem
HTTPS_PROXY=http://10.200.0.1:3128
HTTP_PROXY=http://10.200.0.1:3128
NO_PROXY=localhost,127.0.0.1
Operators discover this only after a server fails silently in production: the first tool call succeeds with cached credentials, then any later token refresh hits the bare Not connected failure with no error message pointing at the missing CA. Diagnosis took us hours the first time.
Reproduction Steps
- In a sandbox configured with the standard CONNECT proxy + MITM TLS rewrite, install any Node-based MCP server that calls an HTTPS endpoint requiring periodic token refresh (OAuth / device code / MSAL).
- Add the server to
mcp.servers.<name> with only the server's documented env block — no OpenShell-specific additions.
- Trigger a tool call once with a freshly minted token. Observed: succeeds.
- Wait for the token to expire (or invalidate the cache), then trigger another tool call.
- Observed:
Not connected in the gateway log; no actionable error explaining why the refresh failed.
- Expected: Either a clear error message that points at the CA / proxy mismatch, or sandbox-side env propagation that makes this work transparently.
Environment
- OpenShell: v0.0.31 host CLI, v0.0.31 supervisor
- OS: Ubuntu 24.04
- Docker: 27.x
- Cluster image:
ghcr.io/nvidia/openshell/cluster:0.0.16
- MCP runtime: Node 22,
@modelcontextprotocol/sdk stdio transport
Proposed direction
Two complementary options. Either alone improves the situation; both together close the gap entirely.
-
Inject a baseline env block by default into every MCP stdio child spawned inside a sandbox: NODE_EXTRA_CA_CERTS, SSL_CERT_FILE, REQUESTS_CA_BUNDLE, CURL_CA_BUNDLE, HTTPS_PROXY, HTTP_PROXY, NO_PROXY should all be set to the sandbox-correct values automatically. The user's env block then overrides on top. Makes the common case work with zero operator knowledge.
-
Surface the failure clearly. Today the proxy-side handshake denial is invisible — even a one-line stderr in the supervisor log saying "TLS handshake from PID X failed; CA bundle mismatch suspected" would cut time-to-diagnose from hours to minutes.
We have a workaround in production today (the manual env block above), but every new MCP server deployed requires re-discovering this — and the failure isn't loud enough for the next operator to figure out without help. Happy to convert this to a PR if a maintainer signals direction.
Agent-First Checklist
Agent Diagnostic
debug-openshell-clusterskill.@modelcontextprotocol/sdk'sStdioClientTransportconstructs the child process viachild_process.spawn()and passes only the env block declared inmcp.servers.<name>.env— parent process env is not inherited by default.10.200.0.1:3128, which re-signs traffic with an internal CA mounted into pods at/etc/openshell-tls/openshell-ca.pem.NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pemin the MCP child's env block, Node'shttps.request(used by most Node-based MCP servers and by any internal MSAL/OAuth refresh) silently fails the TLS handshake.network_errorand silently cascades toNot connectedin the gateway log with no surfaced error explaining the actual cause.Description
Any Node-based MCP stdio server that performs HTTPS calls or token refresh inside an OpenShell sandbox requires a hand-crafted ~10-line env block to bridge the sandbox's MITM TLS proxy CA model into the child process. The required env block is essentially the same across every server we've deployed:
Operators discover this only after a server fails silently in production: the first tool call succeeds with cached credentials, then any later token refresh hits the bare
Not connectedfailure with no error message pointing at the missing CA. Diagnosis took us hours the first time.Reproduction Steps
mcp.servers.<name>with only the server's documented env block — no OpenShell-specific additions.Not connectedin the gateway log; no actionable error explaining why the refresh failed.Environment
ghcr.io/nvidia/openshell/cluster:0.0.16@modelcontextprotocol/sdkstdio transportProposed direction
Two complementary options. Either alone improves the situation; both together close the gap entirely.
Inject a baseline env block by default into every MCP stdio child spawned inside a sandbox:
NODE_EXTRA_CA_CERTS,SSL_CERT_FILE,REQUESTS_CA_BUNDLE,CURL_CA_BUNDLE,HTTPS_PROXY,HTTP_PROXY,NO_PROXYshould all be set to the sandbox-correct values automatically. The user'senvblock then overrides on top. Makes the common case work with zero operator knowledge.Surface the failure clearly. Today the proxy-side handshake denial is invisible — even a one-line stderr in the supervisor log saying "TLS handshake from PID X failed; CA bundle mismatch suspected" would cut time-to-diagnose from hours to minutes.
We have a workaround in production today (the manual env block above), but every new MCP server deployed requires re-discovering this — and the failure isn't loud enough for the next operator to figure out without help. Happy to convert this to a PR if a maintainer signals direction.
Agent-First Checklist
debug-openshell-cluster)