Skip to content

SiluPanda/llm-dedup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-dedup

Coalesce identical in-flight LLM requests into a single upstream call.

npm version npm downloads license node


Description

llm-dedup is an in-flight request coalescing layer for LLM APIs. When multiple callers fire the same request concurrently, only one upstream call is made -- all callers receive an independent deep copy of the same result. This eliminates the thundering herd problem (also known as cache stampede or dogpile effect) that occurs when many identical requests arrive before any response exists in a cache.

The pattern is analogous to Go's singleflight.Do: the first caller for a given key becomes the owner and executes the actual LLM call; subsequent callers with the same key become subscribers and wait for the owner's result. When the owner's Promise settles, all subscribers are resolved (or rejected) with a structuredClone of the result, and the registry entry is removed immediately. The registry is entirely ephemeral -- entries exist only for the duration of an in-flight request (typically 1--30 seconds) and are never persisted.

Key characteristics:

  • Zero runtime dependencies.
  • Sub-millisecond overhead per dedup check (SHA-256 hash + Map lookup).
  • Deep cloning of responses prevents mutation across subscribers.
  • Configurable timeout, cancellation, and overflow behaviors.
  • Background sweep automatically cancels abandoned entries.
  • Built-in statistics for monitoring coalescing effectiveness.

Installation

npm install llm-dedup

Requires Node.js 18 or later. No external runtime dependencies.


Quick Start

import { createDedup, canonicalizeKey } from 'llm-dedup';

const dedup = createDedup();

async function chat(params: Record<string, unknown>) {
  const key = canonicalizeKey(params);
  return dedup.execute(key, () => openai.chat.completions.create(params));
}

// Two concurrent calls with the same parameters produce one upstream call.
const [r1, r2] = await Promise.all([
  chat({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Hello' }] }),
  chat({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Hello' }] }),
]);
// r1 and r2 are deep-equal but reference-distinct copies of the same result.

// Always close when shutting down to clear background timers.
await dedup.close();

Features

  • Request coalescing -- Concurrent identical requests share a single upstream call. The first caller executes; all others subscribe.
  • Canonical key hashing -- canonicalizeKey extracts only semantically relevant LLM parameters (messages, model, temperature, top_p, max_tokens, frequency_penalty, presence_penalty, seed, tools, tool_choice, response_format, stop, system), sorts object keys recursively, and returns a deterministic SHA-256 hex digest. Fields like stream, user, and api_key are excluded.
  • Deep clone isolation -- Each subscriber receives a structuredClone of the owner's result, preventing cross-caller mutation.
  • Subscriber timeout -- Subscribers that wait longer than maxWaitMs are detached. In reject mode, the subscriber's Promise rejects with DedupTimeoutError. In fallthrough mode, the subscriber re-executes fn() independently.
  • Per-call timeout override -- Each execute call can override the instance-level maxWaitMs via ExecuteOptions.
  • Subscriber overflow -- When an in-flight entry reaches maxSubscribers, additional callers bypass coalescing and execute their own upstream call.
  • Cancellation -- Cancel a specific in-flight entry or all entries. Subscribers receive DedupCancelError.
  • Abandon sweep -- A background interval automatically cancels entries that exceed abandonTimeoutMs, preventing leaked entries from hung upstream calls.
  • Statistics -- Track total requests, unique executions, coalesced count, coalesced rate, timeouts, errors, current in-flight count, and peak in-flight count.
  • Custom normalizer -- Apply a text transformation to the canonical JSON string before hashing, enabling custom matching logic (case folding, whitespace collapsing, etc.).
  • TypeScript-first -- Full type declarations shipped in the package.

API Reference

createDedup(options?): LLMDedup

Factory function that creates and returns an LLMDedup instance.

import { createDedup } from 'llm-dedup';

const dedup = createDedup({
  maxWaitMs: 30000,
  timeoutBehavior: 'reject',
  maxSubscribers: 100,
  abandonTimeoutMs: 120000,
});

Options (LLMDedupOptions)

Option Type Default Description
maxWaitMs number 30000 Maximum milliseconds a subscriber waits before timeout. Set to 0 to disable subscriber timeouts.
timeoutBehavior 'reject' | 'fallthrough' 'reject' On timeout: reject with DedupTimeoutError, or re-execute fn() as an independent call.
maxSubscribers number 100 Maximum coalesced callers per in-flight entry. Excess callers become independent owners.
abandonTimeoutMs number 120000 Entries older than this are automatically cancelled by the background sweep.
tokenEstimator (text: string) => number undefined Custom token estimation function.
normalizer (text: string) => string undefined Applied to the canonical JSON string before SHA-256 hashing.
logger { warn: (m: string) => void; debug?: (m: string) => void } undefined Optional logger for warnings and debug messages.

Return Value (LLMDedup)

The returned object exposes the following methods:

Method Signature Description
execute <T>(key: string, fn: () => Promise<T>, options?: ExecuteOptions) => Promise<T> Run fn() or join an existing in-flight call for key.
getInflight () => InflightInfo[] Return metadata for all currently in-flight entries.
stats () => DedupStats Return aggregate dedup statistics.
resetStats () => void Reset all cumulative statistics to zero.
cancelInflight (key: string) => void Cancel a specific in-flight entry, rejecting all subscribers.
cancelAll () => void Cancel all in-flight entries.
close () => Promise<void> Cancel all entries, stop background timers, and mark the instance as closed.

execute<T>(key, fn, options?)

The core method. If no in-flight entry exists for key, fn() is called and the caller becomes the owner. If an in-flight entry already exists, the caller subscribes and receives a structuredClone of the owner's result when it settles.

const result = await dedup.execute('my-key', async () => {
  return await llmClient.chat.completions.create(params);
});

ExecuteOptions

Field Type Default Description
signal AbortSignal undefined An AbortSignal for caller-side cancellation.
maxWaitMs number Instance default Override the instance-level subscriber timeout for this call.

canonicalizeKey(params, normalizer?): string

Derive a deterministic SHA-256 hex key from LLM request parameters.

The function extracts only output-affecting fields (messages, model, temperature, top_p, max_tokens, frequency_penalty, presence_penalty, seed, tools, tool_choice, response_format, stop, system) from the input object, recursively sorts all object keys, and computes a SHA-256 hash. Fields like stream, user, api_key, and request_id are excluded, so requests that differ only in non-semantic fields produce the same key.

If params is not a plain object (e.g., a raw string), it is wrapped under a _raw key before hashing.

import { canonicalizeKey } from 'llm-dedup';

const key = canonicalizeKey({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
  temperature: 0.7,
  stream: true,  // ignored
  user: 'alice', // ignored
});

Parameters

Parameter Type Description
params unknown The LLM request parameters object (or any value).
normalizer (text: string) => string Optional. Applied to the canonical JSON string before hashing.

Returns

A 64-character lowercase hexadecimal SHA-256 digest string.


sortedStringify(value): string

Recursively serialize a value to a JSON string with all object keys sorted alphabetically. Arrays preserve their element order. This is the serialization step used internally by canonicalizeKey.

import { sortedStringify } from 'llm-dedup';

sortedStringify({ z: 1, a: 2, m: 3 });
// '{"a":2,"m":3,"z":1}'

sortedStringify({ b: { y: 1, x: 2 }, a: 0 });
// '{"a":0,"b":{"x":2,"y":1}}'

hashString(text): string

Compute a SHA-256 hash of the input string and return it as a 64-character lowercase hex digest.

import { hashString } from 'llm-dedup';

hashString('test');
// '9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08'

DedupTimeoutError

Thrown (or used to reject a subscriber's Promise) when a subscriber exceeds the configured maxWaitMs and timeoutBehavior is 'reject'.

Property Type Description
name string 'DedupTimeoutError'
code string 'DEDUP_TIMEOUT'
key string The dedup key that timed out.
waitedMs number How long the subscriber waited before timeout.
message string 'Dedup timeout after {waitedMs}ms for key: {key}'

DedupCancelError

Thrown (or used to reject a subscriber's Promise) when an in-flight entry is cancelled via cancelInflight, cancelAll, close, or the abandon sweep.

Property Type Description
name string 'DedupCancelError'
code string 'DEDUP_CANCEL'
key string The dedup key that was cancelled.
message string 'Dedup request cancelled for key: {key}'

DedupStats

Returned by dedup.stats().

Field Type Description
total number Total number of execute calls.
unique number Number of calls that became owners (executed fn).
coalesced number Number of calls that subscribed to an existing in-flight entry.
coalescedRate number coalesced / total (0 when total is 0).
timeouts number Number of subscriber timeouts.
errors number Number of owner fn rejections.
currentInflight number Number of currently in-flight entries.
peakInflight number Highest number of concurrent in-flight entries observed.

InflightInfo

Returned by dedup.getInflight() for each in-flight entry.

Field Type Description
key string The dedup key.
subscriberCount number Number of subscribers currently waiting.
elapsedMs number Milliseconds since the owner registered.
createdAt number Unix timestamp (ms) when the owner registered.

Configuration

Timeout Behavior

When a subscriber waits longer than maxWaitMs, the behavior depends on the timeoutBehavior setting:

'reject' (default) -- The subscriber's Promise rejects with a DedupTimeoutError. The caller is responsible for handling the error (e.g., retrying).

const dedup = createDedup({ maxWaitMs: 5000, timeoutBehavior: 'reject' });

'fallthrough' -- The subscriber detaches and re-executes fn() independently. This is useful when you want a best-effort dedup that never blocks callers beyond the timeout window.

const dedup = createDedup({ maxWaitMs: 5000, timeoutBehavior: 'fallthrough' });

Subscriber Overflow

When the number of subscribers on a single in-flight entry reaches maxSubscribers, additional callers bypass coalescing and execute fn() as independent owners. This prevents a single slow request from accumulating an unbounded number of waiting callers.

const dedup = createDedup({ maxSubscribers: 50 });

Abandon Sweep

A background setInterval runs at half the abandonTimeoutMs interval (capped at 60 seconds). Entries older than abandonTimeoutMs are automatically cancelled, and their subscribers receive DedupCancelError. The timer is unreffed so it does not prevent Node.js process exit.

const dedup = createDedup({ abandonTimeoutMs: 60000 });

Custom Normalizer

Apply a transformation to the canonical JSON string before hashing. This can increase match rates for requests that differ only in non-semantic ways (e.g., case differences in model names).

const dedup = createDedup({
  normalizer: (text) => text.toLowerCase(),
});

The normalizer can also be passed directly to canonicalizeKey:

const key = canonicalizeKey(params, (text) => text.toLowerCase());

Error Handling

Owner errors propagate to all subscribers

If the owner's fn() rejects, the same error is propagated to every subscriber. Each subscriber's Promise rejects with the original error object.

try {
  await dedup.execute(key, fn);
} catch (err) {
  // This error is the same whether this caller was the owner or a subscriber.
}

Typed error classes

Use instanceof checks to distinguish dedup-specific errors from upstream LLM errors:

import { DedupTimeoutError, DedupCancelError } from 'llm-dedup';

try {
  await dedup.execute(key, fn);
} catch (err) {
  if (err instanceof DedupTimeoutError) {
    console.log(`Timed out after ${err.waitedMs}ms for key: ${err.key}`);
    console.log(`Error code: ${err.code}`); // 'DEDUP_TIMEOUT'
  } else if (err instanceof DedupCancelError) {
    console.log(`Cancelled for key: ${err.key}`);
    console.log(`Error code: ${err.code}`); // 'DEDUP_CANCEL'
  } else {
    // Upstream LLM error from the owner's fn()
    throw err;
  }
}

Closed instance

Calling execute after close() immediately rejects with Error('LLMDedup is closed').


Advanced Usage

Layering with a response cache

llm-dedup is complementary to response caching. The cache handles sequential duplicates (same request minutes apart); llm-dedup handles concurrent duplicates (same request within the same response window). Layer dedup on top of the cache so that concurrent cache misses for the same prompt result in a single upstream call:

import { createDedup, canonicalizeKey } from 'llm-dedup';

const dedup = createDedup();

async function chat(params: Record<string, unknown>) {
  const key = canonicalizeKey(params);

  // Check cache first
  const cached = await cache.get(key);
  if (cached) return cached;

  // Dedup concurrent cache misses
  const result = await dedup.execute(key, () =>
    openai.chat.completions.create(params)
  );

  // Store in cache for future sequential duplicates
  await cache.set(key, result);
  return result;
}

Monitoring coalescing effectiveness

Use stats() to track how well dedup is performing:

setInterval(() => {
  const s = dedup.stats();
  console.log(
    `Dedup: ${s.total} total, ${s.unique} unique, ${s.coalesced} coalesced ` +
    `(${(s.coalescedRate * 100).toFixed(1)}%), ${s.currentInflight} in-flight, ` +
    `peak ${s.peakInflight}`
  );
}, 60000);

Inspecting in-flight entries

Use getInflight() to see what is currently being deduplicated:

const inflight = dedup.getInflight();
for (const entry of inflight) {
  console.log(
    `Key: ${entry.key}, subscribers: ${entry.subscriberCount}, ` +
    `elapsed: ${entry.elapsedMs}ms`
  );
}

Per-call timeout override

Override the instance-level timeout for a specific call:

const result = await dedup.execute(key, fn, { maxWaitMs: 5000 });

Graceful shutdown

Always call close() when shutting down to stop background timers and reject pending subscribers:

process.on('SIGTERM', async () => {
  await dedup.close();
  process.exit(0);
});

Using with Anthropic

The same pattern works with any LLM provider. canonicalizeKey recognizes Anthropic's system field:

import { createDedup, canonicalizeKey } from 'llm-dedup';
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const dedup = createDedup();

async function chat(params: Anthropic.MessageCreateParams) {
  const key = canonicalizeKey(params);
  return dedup.execute(key, () => client.messages.create(params));
}

TypeScript

llm-dedup is written in TypeScript and ships type declarations (dist/index.d.ts). All public interfaces and error classes are exported:

import {
  createDedup,
  canonicalizeKey,
  sortedStringify,
  hashString,
  DedupTimeoutError,
  DedupCancelError,
} from 'llm-dedup';

import type {
  LLMDedupOptions,
  LLMDedup,
  ExecuteOptions,
  InflightInfo,
  DedupStats,
} from 'llm-dedup';

The execute method is generic. The return type is inferred from the fn parameter:

// result is inferred as OpenAI.ChatCompletion
const result = await dedup.execute(key, () =>
  openai.chat.completions.create(params)
);

License

MIT

About

Coalesce semantically similar in-flight LLM requests

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors