Skip to content

bug: MCP stdio child processes silently fail TLS handshake without explicit CA env propagation #886

@mjamiv

Description

@mjamiv

Agent Diagnostic

  • Loaded debug-openshell-cluster skill.
  • Investigated the MCP spawn path. @modelcontextprotocol/sdk's StdioClientTransport constructs the child process via child_process.spawn() and passes only the env block declared in mcp.servers.<name>.env — parent process env is not inherited by default.
  • OpenShell sandboxes route outbound HTTPS through the proxy at 10.200.0.1:3128, which re-signs traffic with an internal CA mounted into pods at /etc/openshell-tls/openshell-ca.pem.
  • Without NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pem in the MCP child's env block, Node's https.request (used by most Node-based MCP servers and by any internal MSAL/OAuth refresh) silently fails the TLS handshake.
  • Failure mode is asymmetric: initial child spawn appears to succeed; first tool call works if the server's cached access token is still valid; later token refresh hits network_error and silently cascades to Not connected in the gateway log with no surfaced error explaining the actual cause.

Description

Any Node-based MCP stdio server that performs HTTPS calls or token refresh inside an OpenShell sandbox requires a hand-crafted ~10-line env block to bridge the sandbox's MITM TLS proxy CA model into the child process. The required env block is essentially the same across every server we've deployed:

NODE_EXTRA_CA_CERTS=/etc/openshell-tls/openshell-ca.pem
SSL_CERT_FILE=/etc/openshell-tls/ca-bundle.pem
REQUESTS_CA_BUNDLE=/etc/openshell-tls/ca-bundle.pem
CURL_CA_BUNDLE=/etc/openshell-tls/ca-bundle.pem
HTTPS_PROXY=http://10.200.0.1:3128
HTTP_PROXY=http://10.200.0.1:3128
NO_PROXY=localhost,127.0.0.1

Operators discover this only after a server fails silently in production: the first tool call succeeds with cached credentials, then any later token refresh hits the bare Not connected failure with no error message pointing at the missing CA. Diagnosis took us hours the first time.

Reproduction Steps

  1. In a sandbox configured with the standard CONNECT proxy + MITM TLS rewrite, install any Node-based MCP server that calls an HTTPS endpoint requiring periodic token refresh (OAuth / device code / MSAL).
  2. Add the server to mcp.servers.<name> with only the server's documented env block — no OpenShell-specific additions.
  3. Trigger a tool call once with a freshly minted token. Observed: succeeds.
  4. Wait for the token to expire (or invalidate the cache), then trigger another tool call.
  5. Observed: Not connected in the gateway log; no actionable error explaining why the refresh failed.
  6. Expected: Either a clear error message that points at the CA / proxy mismatch, or sandbox-side env propagation that makes this work transparently.

Environment

  • OpenShell: v0.0.31 host CLI, v0.0.31 supervisor
  • OS: Ubuntu 24.04
  • Docker: 27.x
  • Cluster image: ghcr.io/nvidia/openshell/cluster:0.0.16
  • MCP runtime: Node 22, @modelcontextprotocol/sdk stdio transport

Proposed direction

Two complementary options. Either alone improves the situation; both together close the gap entirely.

  1. Inject a baseline env block by default into every MCP stdio child spawned inside a sandbox: NODE_EXTRA_CA_CERTS, SSL_CERT_FILE, REQUESTS_CA_BUNDLE, CURL_CA_BUNDLE, HTTPS_PROXY, HTTP_PROXY, NO_PROXY should all be set to the sandbox-correct values automatically. The user's env block then overrides on top. Makes the common case work with zero operator knowledge.

  2. Surface the failure clearly. Today the proxy-side handshake denial is invisible — even a one-line stderr in the supervisor log saying "TLS handshake from PID X failed; CA bundle mismatch suspected" would cut time-to-diagnose from hours to minutes.

We have a workaround in production today (the manual env block above), but every new MCP server deployed requires re-discovering this — and the failure isn't loud enough for the next operator to figure out without help. Happy to convert this to a PR if a maintainer signals direction.

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (debug-openshell-cluster)
  • My agent could not resolve this — the diagnostic above explains the root cause; resolution needs an upstream code change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions