Skip to content

L7 proxy placeholder-token rewriting doesn't cover WSS payloads — Discord gateway fails with 4004 #913

@asp12x

Description

@asp12x

Agent Diagnostic

agent-diagnostic-output.txt

Description

Environment

  • OpenShell: 0.0.26
  • NemoClaw: 2026.4.2 (stack consumer, using blueprint policies/presets/discord.yaml)
  • Host: DGX Spark (ARM64), Kubernetes-deployed OpenShell gateway

Summary

OpenShell's L7 proxy rewrites placeholder tokens (openshell:resolve:env:*) at egress for TLS-terminated REST traffic. For gateway.discord.gg the NemoClaw blueprint policy sets tls: skip (per #544, pass-through is required to keep long-lived WSS sessions working). Result: the placeholder flows unchanged inside the WSS IDENTIFY payload; Discord closes with opcode 4004 (auth failed); the bot never connects.

Reproduce

  1. nemoclaw onboard --non-interactive with a valid DISCORD_BOT_TOKEN
  2. Provider <sandbox>-discord-bridge is created and attached to the sandbox; sandbox env has DISCORD_BOT_TOKEN=openshell:resolve:env:DISCORD_BOT_TOKEN
  3. OpenClaw attempts to connect to wss://gateway.discord.gg
  4. Gateway closes immediately with opcode 4004 (see attached gateway.log, search for 4004)

Confirmation it's a payload-rewrite gap, not a policy/network problem

Writing the real Discord bot token directly into /sandbox/.openclaw/openclaw.json (field channels.discord.accounts.default.token), bypassing the placeholder system for this field, produces a successful IDENTIFY and the bot connects. No policy changes required; the only variable is whether the literal placeholder string or the real token arrives in the WSS IDENTIFY payload.

Proposed directions

  1. Add WSS MITM + JSON-payload-aware rewriting for known channel protocols (Discord IDENTIFY op 2, d.token field), so tls: skip can be removed for gateway.discord.gg.
  2. OR: expose an in-sandbox secret-resolution gRPC endpoint (e.g. reachable via OPENSHELL_ENDPOINT) that clients can call to resolve openshell:resolve:env:* explicitly. OpenClaw (and other consumers) could then resolve at config-read time instead of relying on egress rewriting.
  3. OR: when a provider is attached to a sandbox and the target channel is known to use WSS, let OpenShell inject the real credential value into the child env var directly at sandbox start (documenting the security trade-off that the credential is then at-rest in the sandbox env rather than only in the provider store).

Related

Attachments

  • gateway.log — 344-line log from inside the sandbox (micky pod, /tmp/gateway.log). Shows the 4004 pattern before the manual workaround and the quiet "awaiting gateway readiness" (implicit READY) after.
  • openshell-status.txtopenshell status
  • openshell-doctor-check.txtopenshell doctor check
  • openshell-doctor-logs.txtopenshell doctor logs --lines 200

openshell-issue-bundle.zip

Reproduction Steps

  1. Deploy a NemoClaw stack (v2026.4.2) with OpenShell 0.0.26 on ARM64 (DGX Spark, k3s).
    Blueprint uses policies/presets/discord.yaml which pins gateway.discord.gg to tls: skip
    (per feat(sandbox): auto-detect TLS and terminate unconditionally for credential injection #544 — required to keep long-lived WSS sessions alive).

  2. Create a Discord bot credential:
    nemoclaw credentials set DISCORD_BOT_TOKEN

  3. Onboard the sandbox (this creates provider <sandbox>-discord-bridge and attaches it):
    nemoclaw onboard --non-interactive

  4. Confirm the sandbox env contains the placeholder, not the real token:
    kubectl exec -n nemoclaw deploy/micky -- printenv DISCORD_BOT_TOKEN

    => openshell:resolve:env:DISCORD_BOT_TOKEN

  5. Start OpenClaw inside the sandbox so it connects to Discord:
    kubectl exec -n nemoclaw deploy/micky -- openclaw start

  6. Observe gateway.log — OpenClaw opens wss://gateway.discord.gg, sends IDENTIFY op 2
    with d.token set to the literal string "openshell:resolve:env:DISCORD_BOT_TOKEN",
    Discord closes the socket with opcode 4004 (Authentication Failed).

Expected: IDENTIFY carries the resolved bot token; gateway sends READY; bot comes online.
Actual: IDENTIFY carries the literal placeholder string; gateway closes with 4004.

Workaround (confirms payload-rewrite gap, not a policy/network problem):
Edit /sandbox/.openclaw/openclaw.json inside the pod, set
channels.discord.accounts.default.token to the real token value from
~/.nemoclaw/credentials.json. OpenClaw hot-reloads, IDENTIFY now carries the real
token, Discord sends READY, bot connects. No policy changes required.

Environment

OpenShell: 0.0.26
NemoClaw: 2026.4.2 (blueprint: policies/presets/discord.yaml)
OpenClaw: 2026.4.9
Host: NVIDIA DGX Spark (GB10 Grace Blackwell, ARM64 / aarch64)
OS: Ubuntu 24.04 LTS (kernel 6.11, CUDA 13.0)
Runtime: k3s (single-node), containerd
Sandbox pod: micky (namespace: nemoclaw)
Storage: local-path PVC workspace-micky (2Gi) mounted at /sandbox
Client: Discord gateway via ws npm library (raw tls.connect, ignores HTTPS_PROXY)
Policy: gateway.discord.gg → tls: skip (L4 CONNECT pass-through)
Network: OpenShell L7 proxy at 10.200.0.1:3128 (CONNECT + TLS-MITM for REST egress)

Logs

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions