From 57f731c2f0ef1a49c6061706c11e13d8566aff54 Mon Sep 17 00:00:00 2001 From: Rustem Shaydullin Date: Sat, 18 Apr 2026 04:17:49 +0500 Subject: [PATCH] Add Documentation --- README.md | 10 +- doc/architecture.md | 147 +++++++++++ doc/domain-and-party.md | 210 ++++++++++++++++ doc/index.md | 88 ++++++- doc/limits-and-accounting.md | 233 +++++++++++++++++ doc/provider-proxy.md | 207 +++++++++++++++ doc/risk-and-repair.md | 163 ++++++++++++ doc/route_pins.md | 146 +++++++++-- doc/routing.md | 175 +++++++++++++ doc/state-machines.md | 472 +++++++++++++++++++++++++++++++++++ 10 files changed, 1823 insertions(+), 28 deletions(-) create mode 100644 doc/architecture.md create mode 100644 doc/domain-and-party.md create mode 100644 doc/limits-and-accounting.md create mode 100644 doc/provider-proxy.md create mode 100644 doc/risk-and-repair.md create mode 100644 doc/routing.md create mode 100644 doc/state-machines.md diff --git a/README.md b/README.md index 42ef3625..b07eb10e 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,14 @@ $ make wdeps-test ## Documentation -@TODO Please write a couple of words about what your project does and how it does it. +Hellgate is the core payment-processing service of the platform. It orchestrates +the full lifecycle of invoices, payments, refunds, chargebacks and recurrent +paytools on top of a set of hierarchical, event-sourced state machines, and +drives external payment providers through the `proxy-provider` Thrift protocol. + +The documentation for business logic and internal mechanics lives under +[`doc/`](doc/index.md) — start at [`doc/index.md`](doc/index.md) for the full +table of contents covering architecture, state machines, routing, limits and +accounting, provider integration, risk/repair, and domain/party resolution. [1]: http://erlang.org/doc/man/shell.html diff --git a/doc/architecture.md b/doc/architecture.md new file mode 100644 index 00000000..cbc2d3ff --- /dev/null +++ b/doc/architecture.md @@ -0,0 +1,147 @@ +# Architecture overview + +## What Hellgate is + +Hellgate (sometimes referred to as *payment processing* or the *processing +core*) is the service that owns the business-logic view of every invoice and +payment in the platform. Everything that happens to money — creating an +invoice, authorising a card, capturing a hold, paying out after settlement, +refunding, charging back, re-trying against a different provider — is a +transition on a Hellgate state machine. + +Hellgate is intentionally *not* an API gateway: it is invoked by the +customer-facing API (capi/capi-pcidss and similar) over Woody/Thrift, and it +consumes a set of backend services in turn. It is the source of truth for the +*state* of each invoice and payment; balances live in the accounter (shumway) +and provider-side data lives on the providers. + +## OTP applications + +The release is composed of five OTP applications under +[`apps/`](../apps): + +| Application | Responsibility | +| ------------------ | -------------- | +| `hellgate` | All business logic: state machines, sessions, routing hooks, limits, accounting, risk, repair, invoice templates. | +| `hg_proto` | Thrift service definitions, Woody service wrapper, protocol helpers. This is the module that mounts the Woody servers and marshals/unmarshals Thrift terms. | +| `hg_client` | Woody client for the public invoicing and invoice-templating APIs. Used from tests and ad-hoc tooling. | +| `hg_progressor` | Progressor (the newer event-sourced automaton backend) integration: wraps Progressor RPC, encodes/decodes events, propagates OpenTelemetry context, and exposes a `Processor` callback that Progressor invokes to run Hellgate machines. | +| `routing` | Routing logic as a standalone app: candidate gathering, scoring, rejection tracking, route explanations. The `hellgate` app calls into it but keeps no routing state itself. | + +## External services + +Hellgate is one piece of a wider microservice ecosystem. The dependencies it +consumes are shown below with the Hellgate module that wraps each one: + +| Service | Purpose | Wrapper module | +| ----------------- | --------------------------------------------------------- | -------------- | +| DMT (`dmt_client`) | Versioned domain configuration (providers, terminals, proxies, payment institutions, routing rules, fees, limits, categories, currencies). Every domain lookup in Hellgate goes through `hg_domain`. | [hg_domain.erl](../apps/hellgate/src/hg_domain.erl) | +| party-management | Party/shop configuration, operability, contracts. | [hg_party.erl](../apps/hellgate/src/hg_party.erl) | +| limiter / liminator | Turnover limit enforcement (`Get`, `Hold`, `Commit`, `Rollback`). | [hg_limiter.erl](../apps/hellgate/src/hg_limiter.erl), [hg_limiter_client.erl](../apps/hellgate/src/hg_limiter_client.erl) | +| shumway (accounter)| Double-entry accounting. Hellgate submits and commits posting plans. | [hg_accounting.erl](../apps/hellgate/src/hg_accounting.erl) | +| bender | Deterministic ID generation. Hellgate uses Bender-style IDs for invoices, payments, refunds, chargebacks. | Called through `hg_client`/party-management; no dedicated wrapper module. | +| cubasty (customer)| Storage for saved/recurrent payment resources. | [hg_customer_client.erl](../apps/hellgate/src/hg_customer_client.erl) | +| fault-detector | Rolling provider availability and conversion statistics. Used to mark dead adapters as unrouteable. | [hg_fault_detector_client.erl](../apps/hellgate/src/hg_fault_detector_client.erl) | +| proxy-provider | One Woody endpoint per provider adapter; implements `ProcessPayment`, `HandlePaymentCallback`, `GenerateToken`. | [hg_proxy_provider.erl](../apps/hellgate/src/hg_proxy_provider.erl), [hg_session.erl](../apps/hellgate/src/hg_session.erl) | +| proxy-inspector | Risk scoring and card-token blacklists. | [hg_inspector.erl](../apps/hellgate/src/hg_inspector.erl) | +| machinegun | Legacy event-sourced automaton backend. | Abstracted behind `hg_machine`. | +| progressor | Current event-sourced automaton backend (default). | [hg_progressor.erl](../apps/hg_progressor/src/hg_progressor.erl) | + +## Backends: Machinegun, Progressor, Hybrid + +All persistent state in Hellgate lives in an event-sourced automaton. The +backend selector lives in [hg_machine.erl:230](../apps/hellgate/src/hg_machine.erl): + +```erlang +call_automaton(Function, Args) -> + call_automaton(Function, Args, + application:get_env(hellgate, backend, machinegun)). +``` + +- `machinegun` — legacy backend using Thrift automaton RPC. +- `progressor` — newer native backend ([`config/sys.config`](../config/sys.config) + sets this in production). +- `hybrid` — route some namespaces to Machinegun and others to Progressor via + [`hg_hybrid.erl`](../apps/hg_progressor/src/hg_hybrid.erl). This is the + migration mode. + +Regardless of backend, Hellgate is the *processor*: the backend tells it +"here is a machine's current history and the incoming signal/call", Hellgate +returns `{events, action, auxst}`, and the backend persists the new events. + +## End-to-end flow of a payment + +A simplified trace of `CreateInvoice → StartPayment → captured`: + +```mermaid +sequenceDiagram + autonumber + participant C as Client (capi) + participant W as hg_woody_service_wrapper + participant I as hg_invoice + participant P as hg_invoice_payment + participant R as hg_routing + participant L as hg_limiter + participant CF as hg_cashflow / hg_accounting + participant S as hg_session + participant Pr as proxy-provider + participant A as automaton backend + + C->>W: Thrift: CreateInvoice + W->>I: start invoice machine + I->>A: append ?invoice_created + C->>W: Thrift: StartPayment + W->>I: call + I->>P: delegate + P->>L: check + hold shop / payment limits + P->>R: gather_routes (+ fault detector, blacklist, pins) + R-->>P: chosen route + P->>CF: finalize cashflow + plan in shumway + P->>S: create session, ProcessPayment + S->>Pr: ProcessPayment + Pr-->>S: intent = finish | sleep | suspend + Note over S,Pr: async callbacks go to
ProviderProxyHost:ProcessPaymentCallback + S-->>P: session result + P->>L: commit payment limits + P->>CF: commit posting plan + P->>A: append events + W-->>C: response +``` + +On provider failure the same payment can cascade to the next candidate route +(see [Routing](routing.md) and [State machines](state-machines.md#cascade-and-retries)), +so a single business-level payment may correspond to several sessions. + +> [!IMPORTANT] +> Hellgate pins a domain revision at the start of the call and passes it +> through routing, term resolution and accounting. A config change landing +> mid-payment will not affect the decision — the payment stays on its +> original view of the world. + +## Thrift service surface + +Hellgate *exposes* these services (see [`hg_proto.erl`](../apps/hg_proto/src/hg_proto.erl)): + +| Path | Interface | Purpose | +| ----------------------------------------- | ------------------------------------------------- | ------- | +| `/v1/processing/invoicing` | `dmsl_payproc_thrift:Invoicing` | Invoice / payment / refund / chargeback operations. | +| `/v1/processing/invoice_templating` | `dmsl_payproc_thrift:InvoiceTemplating` | Invoice template lifecycle + term computation. | +| `/v1/stateproc/` | `mg_proto_state_processing_thrift:Processor` | Machine processor callback invoked by Machinegun. | +| `/v1/proxyhost/provider` | `dmsl_proxy_provider_thrift:ProviderProxyHost` | Provider-facing host callback API (`ProcessPaymentCallback`, `GetPayment`, session updates). | + +The Progressor backend replaces the `/v1/stateproc/...` Machinegun callback +with a Progressor-native `Processor` callback served by +[`hg_progressor_handler.erl`](../apps/hg_progressor/src/hg_progressor_handler.erl). + +## Namespaces + +Each kind of machine has a dedicated namespace (the `namespace/0` callback of +the `hg_machine` behaviour). The most important ones are: + +- `invoice` — an invoice and its nested payments, refunds, chargebacks +- `invoice_template` — reusable invoice templates +- `recurrent_paytools` — tokenised payment methods for recurrent billing + +Callbacks from providers are routed to `invoice` machines through a +tag-to-machine binding stored in +[`hg_machine_tag`](../apps/hellgate/src/hg_machine_tag.erl). diff --git a/doc/domain-and-party.md b/doc/domain-and-party.md new file mode 100644 index 00000000..6ec48a20 --- /dev/null +++ b/doc/domain-and-party.md @@ -0,0 +1,210 @@ +# Domain, party and varset + +Hellgate is stateless with respect to configuration: every decision that +depends on merchant settings, provider terms, fees, limits, routing rules, +available payment methods or acceptable currencies is resolved against the +**domain**, a versioned configuration store owned by DMT. This page +documents how that lookup works and which hooks feed it. + +## Domain (DMT) + +Module: [hg_domain.erl](../apps/hellgate/src/hg_domain.erl). + +All domain access flows through `hg_domain:get/2`: + +```erlang +get(Revision, Ref) -> + try extract_data(dmt_client:checkout_object(Revision, Ref)) + catch throw:#domain_conf_v2_ObjectNotFound{} -> + error({object_not_found, {Revision, Ref}}) + end. +``` + +- `Revision` is either the symbolic `latest` or a concrete integer version. + Hellgate *pins* a revision at the beginning of a payment flow and passes + it through the whole call chain — routing, term evaluation and + accounting all use the same revision so a config change mid-payment + cannot corrupt the outcome. +- `Ref` is one of the domain reference tuples: `{party_config, …}`, + `{shop_config, …}`, `{provider, ProviderRef}`, `{terminal, TerminalRef}`, + `{proxy, ProxyRef}`, `{limit_config, …}`, `{category, …}`, + `{currency, …}`, `{payment_institution, …}`, `{inspector, …}`, and so on. +- `dmt_client` is the shared DMT RPC client; Hellgate does not cache + outside of its short-lived per-request context. + +Every event that records a decision tied to the domain (route selection, +cash flow, limits) also records the revision used, so the entire decision +can be reconstructed deterministically. + +## Party and shop + +Module: [hg_party.erl](../apps/hellgate/src/hg_party.erl). + +A **party** owns one or more **shops**. Both are addressed by +`party_config_ref()` and `shop_config_ref()` respectively and stored as +domain objects: + +```erlang +get_party(PartyConfigRef) -> + checkout(PartyConfigRef, get_party_revision()). + +get_shop(ShopConfigRef, PartyConfigRef, Revision) -> + try dmt_client:checkout_object(Revision, {shop_config, ShopConfigRef}) of + #domain_conf_v2_VersionedObject{ + object = {shop_config, #domain_ShopConfigObject{ + data = #domain_ShopConfig{party_ref = PartyConfigRef} = ShopConfig + }} + } -> + {ShopConfigRef, ShopConfig}; + _ -> undefined + catch throw:#domain_conf_v2_ObjectNotFound{} -> undefined + end. +``` + +Notice that `get_shop/3` validates that the shop belongs to the given +party — this is the main cross-check that keeps one party from touching +another's shop by guessing its ID. + +Party objects carry: + +- Owner metadata and contact details +- A list of shops +- Contract terms and KYC status +- Suspension and activation state +- Blocking status (fraud, AML, etc.) + +Shops carry their own set of turnover limits, category, currency, +accepted payment tools and account references. Most of the per-merchant +behaviour a payment will see is ultimately sourced from the shop config. + +### Operability checks + +Before doing anything that mutates money, Hellgate asserts via +[`hg_invoice_utils`](../apps/hellgate/src/hg_invoice_utils.erl) that the +party and shop are *operable* — not blocked, not suspended, contract +active. A failing check aborts the operation with a clear error instead +of creating a dangling machine. + +## Varset + +Module: [hg_varset.erl](../apps/hellgate/src/hg_varset.erl). + +The varset is a small map of the variables the domain uses to reduce +selectors. Think of it as the "question" we're asking the domain. It is +assembled as the payment progresses, with later stages adding more keys: + +```erlang +-type varset() :: #{ + category => dmsl_domain_thrift:'CategoryRef'(), + currency => dmsl_domain_thrift:'CurrencyRef'(), + cost => dmsl_domain_thrift:'Cash'(), + payment_tool => dmsl_domain_thrift:'PaymentTool'(), + party_config_ref => dmsl_domain_thrift:'PartyConfigRef'(), + shop_id => dmsl_base_thrift:'ID'(), + risk_score => hg_inspector:risk_score(), + flow => instant | {hold, dmsl_domain_thrift:'HoldLifetime'()}, + wallet_id => dmsl_base_thrift:'ID'() +}. +``` + +When it is handed to DMT, `prepare_varset/1` converts it into the Thrift +`#payproc_Varset{}` struct DMT selectors evaluate against. + +### Where the varset drives behaviour + +- **Routing** (`hg_routing:gather_routes/5`): filters routing rules and + prohibitions, producing the candidate list. +- **Term resolution**: fees, 3DS requirements, allowed payment methods, + hold lifetimes and other per-operation rules are selected from the + party/shop/provider terms against the varset. +- **Payment institution resolution** + (`hg_payment_institution:compute_payment_institution/3`): picks system + and external accounts by currency and varset. +- **Inspector**: the inspector is selected from the domain using the same + varset, so a shop can use different risk engines for different + categories or payment tools. + +The varset is the single bottleneck through which every "what does +config say here?" question in Hellgate has to pass. This is the reason a +design change that adds, say, a new routing dimension starts with a new +varset key. + +```mermaid +flowchart LR + I[Invoice + Payer] --> V[varset] + R[(domain revision)] --> V + RS[risk_score] --> V + V --> PI[payment institution
reduction] + V --> T[term selectors
fees, 3DS, limits] + V --> RT[routing rules
+ prohibitions] + V --> ACC[external account
selection] +``` + +> [!IMPORTANT] +> The varset is cumulative: later stages add keys. Earlier stages must +> not depend on keys that are only filled in later (e.g. routing has a +> `risk_score` because the inspector runs first; it does **not** have a +> `provider_ref` because routing is what sets it). + +## Payment institution + +Module: [hg_payment_institution.erl](../apps/hellgate/src/hg_payment_institution.erl). + +A payment institution is the top-level config blob for "a way of +accepting payments" — typically one per legal entity / licence / scheme. +It owns: + +- Routing rules (policies + prohibitions) +- Default cash flow postings +- System account references per currency +- External account sets (selected by varset) +- Inspector and proxy references + +`compute_payment_institution/3` reduces the referenced payment +institution against the varset and returns the concrete struct used by +routing, term resolution and accounting. Any per-request domain +variability lives inside that reduction; downstream code just sees the +resolved values. + +## Payment tools + +Module: [hg_payment_tool.erl](../apps/hellgate/src/hg_payment_tool.erl). + +Thin helper to extract the `PaymentTool` from a `Payer` variant (direct +card, recurrent token, payment terminal, digital wallet, crypto, etc.). +The payment tool is what enters the varset under `payment_tool` and what +the provider adapter ultimately consumes. + +## Request context + +Module: [hg_context.erl](../apps/hellgate/src/hg_context.erl). + +Per-request auxiliary data (Woody deadline, trace id, party client, +domain revision, current log scope) is stashed in a small record kept in +the process dictionary via `save/1`, `load/0`, `cleanup/0`. Long-running +call chains (especially repair and the Progressor processor callback) +save-and-cleanup around the handler to keep request scopes from leaking. + +## Putting it together + +A concrete example of how party + DMT + varset come together on +`CreatePayment`: + +1. The handler resolves the party and shop from the invoice + (`hg_party`) and asserts they are operable. +2. It builds an initial varset from the invoice and the payer. +3. It pins the current domain revision. +4. It calls the inspector (`hg_inspector`) to get a risk score; the + score goes into the varset. +5. It resolves the payment institution + (`hg_payment_institution:compute_payment_institution/3`) and reduces + routing rules against the varset. +6. Routing (`hg_routing`) produces a candidate list; cash flow + (`hg_cashflow`) is reduced against the same varset once a candidate + is chosen. +7. Terms (fees, limits) are reduced against the same varset before + limits are held and the provider call is issued. + +The same revision + varset pair is threaded through every subsequent +state transition, so replaying a payment's history is deterministic even +if the domain has moved on. diff --git a/doc/index.md b/doc/index.md index c6bbec82..a68f30b5 100644 --- a/doc/index.md +++ b/doc/index.md @@ -1,3 +1,87 @@ -# Документация +# Hellgate documentation -1. [Работа пинов](route_pins.md) +Hellgate is the core payment processing service of the platform. It implements +the authoritative state machine for invoices, payments, refunds and +chargebacks, selects a payment route through the configured providers and +terminals, enforces merchant/provider turnover limits, drives provider adapters +over Woody/Thrift, and writes the resulting postings into the double-entry +accounter (shumway). + +Almost everything Hellgate does is expressed as deterministic event-sourced +transitions over the [`hg_machine`](../apps/hellgate/src/hg_machine.erl) +abstraction, backed by either [Progressor](../apps/hg_progressor) (the +production backend, pinned in [`config/sys.config`](../config/sys.config)) or +Machinegun (the code-level default and legacy backend). + +```mermaid +flowchart LR + Client([capi / capi-pcidss]) + subgraph Hellgate + direction TB + Invoicing[hg_invoice_handler] + Invoice[hg_invoice] + Payment[hg_invoice_payment] + Session[hg_session] + Routing[hg_routing] + Limits[hg_limiter] + Cashflow[hg_cashflow] + Invoicing --> Invoice --> Payment --> Session + Payment --> Routing + Payment --> Limits + Payment --> Cashflow + end + DMT[(DMT)] + PM[(party-management)] + Limiter[(limiter)] + Shumway[(shumway)] + FD[(fault-detector)] + Inspector[(proxy-inspector)] + Provider[(proxy-provider)] + Progressor[(Progressor / Machinegun)] + + Client -- Thrift --> Invoicing + Invoice <-- events --> Progressor + Routing --> DMT + Routing --> FD + Routing --> Inspector + Payment --> PM + Limits --> Limiter + Cashflow --> Shumway + Session --> Provider + Provider -. async callback .-> Invoice +``` + +> [!NOTE] +> Docs last verified against `fb3cabd4`. Claims that reference specific +> Erlang type names or module lines may drift after refactors — when in +> doubt, follow the linked source. + +## Business-domain documentation + +1. [Architecture overview](architecture.md) — what Hellgate is, which OTP + applications live here, which external services it depends on, and how a + single API call flows through the system. +2. [State machines](state-machines.md) — the `hg_machine` behaviour, invoice / + payment / refund / chargeback / session lifecycles, retries, cascades, + recurrent paytools. +3. [Routing](routing.md) — how route candidates (provider + terminal pairs) are + gathered from the domain, filtered by prohibitions, fault detector and + blacklist, scored and chosen. +4. [Route pins](route_pins.md) — how a payer is pinned to a specific candidate + within an equal-priority group. +5. [Limits and accounting](limits-and-accounting.md) — turnover limits + (hold/commit/rollback), cash flow computation, allocation, and shumway + posting plans. +6. [Providers, sessions and callbacks](provider-proxy.md) — sessions, the + provider proxy protocol, async callbacks via tags, timeout behaviour, token + generation for recurrent paytools. +7. [Risk, repair and operations](risk-and-repair.md) — inspector integration, + risk scores, blacklists, the repair API for stuck machines. +8. [Domain, party and varset](domain-and-party.md) — how configuration from + party-management and DMT is resolved at each step through the varset. + +## Erlang/OTP and build docs + +The top-level [README](../README.md) and [`CLAUDE.md`](../CLAUDE.md) document +the build and test commands, Erlang coding conventions, and the Docker / +Docker-Compose workflow for running the full dependency stack. diff --git a/doc/limits-and-accounting.md b/doc/limits-and-accounting.md new file mode 100644 index 00000000..61bf6787 --- /dev/null +++ b/doc/limits-and-accounting.md @@ -0,0 +1,233 @@ +# Limits and accounting + +Hellgate handles two orthogonal financial concerns on every payment: + +- **Turnover limits** — policy controls that restrict how much money can flow + through a given dimension (shop, provider, terminal, card, etc.) over a + rolling window. +- **Accounting** — double-entry postings against the accounter service + (shumway) that reflect what actually moved. + +Both subsystems are designed around a three-phase pattern (`hold` / `commit` +/ `rollback`) so that Hellgate can reserve capacity and posting intent *before* +the provider call and finalise it *after* we know the outcome. + +## Turnover limits + +Module: [hg_limiter.erl](../apps/hellgate/src/hg_limiter.erl) (plus the +Woody client wrapper in +[hg_limiter_client.erl](../apps/hellgate/src/hg_limiter_client.erl)). + +### Where limits come from + +Turnover limits are declared in the domain and attached to two kinds of +objects: + +- **Shops** — `#domain_ShopConfig{turnover_limits = [...]}` +- **Providers / Provision terms** — + `#domain_PaymentsProvisionTerms{turnover_limits = {value, [...]}}` + +Each reference points at a `#domain_LimitConfig{}` that the limiter service +owns. Hellgate is only responsible for knowing *which* limits apply to a +given operation at a given revision — the limiter enforces the numeric +ceiling. + +### The operation identity + +Holds are idempotent on an operation ID. Hellgate derives the operation ID +from stable properties of the payment: + +- Payment-level limits: `[provider_id, terminal_id, invoice_id, payment_id, iteration]` +- Shop-level limits: `[party_id, shop_id, invoice_id, payment_id]` +- Refund/chargeback flows: analogous lists that include the refund or + chargeback ID. + +The `iteration` component is the cascade attempt counter, which lets the +limiter distinguish "same payment, retried on a different route" from "new +payment" without double-counting. + +### Payment limits — hold / commit / rollback + +```erlang +check_limits([turnover_limit()], Invoice, Payment, Session | undefined, Route, Iter) + -> {ok, [turnover_limit_value()]} + | {error, {limit_overflow, [binary()], [turnover_limit_value()]}}. + +hold_payment_limits(Limits, Invoice, Payment, Session, Route, Iter). +commit_payment_limits(Limits, Invoice, Payment, Session, Route, Iter, BinaryOperationId | undefined). +rollback_payment_limits(Limits, Invoice, Payment, Session, Route, Iter, BinaryOperationId | undefined). +``` + +- `check_limits/6` is a *dry-run* — it returns the current limit values so the + payment can fail fast (the routing step calls this to reject overflowing + candidates). +- `hold_payment_limits/6` is the real reservation. It is called once the + route is chosen and the cash flow has been built, right before the provider + call. +- `commit_payment_limits/7` finalises the hold on capture success. +- `rollback_payment_limits/7` releases the hold on cascade / retry / final + failure so that the reserved capacity becomes available again. + +### Shop and refund limits + +Shop limits (`check_shop_limits/5`, `hold_shop_limits/5`, etc.) and refund +limits (`hold_refund_limits/5`, `commit_refund_limits/5`, +`rollback_refund_limits/5`) follow exactly the same three-phase contract. + +Refunds reverse capture holds: a refund hold effectively releases the +corresponding capture hold on the same limit bucket, so a fully refunded +payment becomes invisible to turnover limits (as intended). + +## Cash flow + +Module: [hg_cashflow.erl](../apps/hellgate/src/hg_cashflow.erl), plus +helpers in [hg_cashflow_utils.erl](../apps/hellgate/src/hg_cashflow_utils.erl). + +### The model + +A *cash flow* is a list of postings: + +```erlang +-type posting() :: #domain_CashFlowPosting{ + source = account(), % merchant | provider | system | external + destination = account(), + volume = cash_volume(), % fixed | share | product + details = binary() % human-readable description +}. +``` + +Volumes are computed, not static: + +```erlang +?fixed(Cash) % literal amount +?share(P, Q, Of, Rounding) % P/Q of another amount +?product([Op, V1, V2, ...]) % composition +``` + +`Of` is a reference to another amount in the same flow (usually the payment +amount) so that commission-style postings stay correct when the base changes. + +### Finalisation + +`hg_cashflow:finalize/3` takes the abstract template, a context containing +the monetary parameters, and an `AccountMap` that resolves each abstract +account to a concrete account ID: + +```erlang +compute_postings(CF, Context, AccountMap) -> + [ + ?final_posting( + construct_final_account(Source, AccountMap), + construct_final_account(Destination, AccountMap), + compute_volume(Volume, Context), + Details + ) + || ?posting(Source, Destination, Volume, Details) <- CF + ]. +``` + +The account map is built by +[`hg_accounting:collect_account_map/1`](../apps/hellgate/src/hg_accounting.erl) +and has four concrete halves: + +- **Merchant accounts** — settlement and guarantee from the shop config. +- **Provider accounts** — the chosen provider's settlement account for the + payment currency. +- **System accounts** — settlement and subagent accounts from the payment + institution for the payment currency. +- **External accounts** — income/outcome accounts selected from the payment + institution's external account sets via the varset. + +### Reversal + +Refunds and chargeback steps reverse a flow by swapping source and +destination on every posting: + +```erlang +revert(CF) -> + [?final_posting(Destination, Source, Volume, revert_details(Details)) + || ?final_posting(Source, Destination, Volume, Details) <- CF]. +``` + +## Accounting (shumway) + +Module: [hg_accounting.erl](../apps/hellgate/src/hg_accounting.erl). + +The accounter is a straightforward double-entry ledger. Hellgate drives it +with *posting plans*: + +- `plan(CashFlow, Context)` — submit staged postings. The accounter computes + the effect on each account's balance but does not make it visible yet. +- `commit(PlanLog, PlanIDs)` — materialise the staged batches. +- `rollback(PlanLog, PlanIDs)` — discard them. + +Each payment can produce several plan IDs over its life (authorisation, +capture, refund, chargeback stages), and the payment state machine carries +the plan log forward so that commits and rollbacks target the right batches. + +Accounting follows the state machine, not the other way around: posts are +staged when the corresponding activity starts (e.g. +`processing_accounter`) and committed when the activity resolves +successfully (`finalizing_accounter`). This keeps the ledger consistent +with what the state machine believes and lets us recover from a crash +between stage and commit by replaying events. + +## Allocation + +Module: [hg_allocation.erl](../apps/hellgate/src/hg_allocation.erl). + +Allocation splits a payment across multiple recipients (think marketplace +sub-merchants). The domain types (`AllocationPrototype`, `Allocation`, +`AllocationTransaction`) and the arithmetic (`sub/2`, etc.) are in place, +but the feature is currently turned off: + +```erlang +calculate(_Prototype, _Party, _Shop, _Cost, _Terms) -> + {error, allocation_not_allowed}. +``` + +When enabled, allocation interleaves with cash flow: each allocation +transaction produces an additional sub-flow that is accounted for in +shumway and can be refunded independently. + +## Putting it together + +On a happy-path payment the finance subsystems execute in this order: + +```mermaid +sequenceDiagram + autonumber + participant P as hg_invoice_payment + participant L as limiter + participant CF as hg_cashflow + participant S as shumway + participant Pr as provider + + Note over P,L: routing stage + P->>L: check_limits (dry-run) + P->>L: hold_shop_limits + P->>L: hold_payment_limits + P->>CF: finalize cash flow + P->>S: plan (stage postings) + P->>Pr: ProcessPayment + alt success + Pr-->>P: success + trx + P->>L: commit_shop_limits + commit_payment_limits + P->>S: commit(plan) + else failure + Pr-->>P: failure + P->>S: rollback(plan) + P->>L: rollback_payment_limits + Note over P: maybe cascade + end +``` + +Refunds and chargebacks run the same pattern on their own plans and +inverted cash flows, so every terminal state leaves the ledger and the +limiter in a consistent place. + +> [!WARNING] +> The operation ID passed to the limiter must remain stable across retries +> of the *same* attempt and must change across cascades. `iter` exists +> specifically so that a cascade to a new route is a new operation, not a +> double-count of the old one. diff --git a/doc/provider-proxy.md b/doc/provider-proxy.md new file mode 100644 index 00000000..48d279b8 --- /dev/null +++ b/doc/provider-proxy.md @@ -0,0 +1,207 @@ +# Providers, sessions and callbacks + +Hellgate never talks directly to an acquirer, a 3-D Secure directory server +or an alternative-payment-method back end. Every external provider sits +behind a *provider proxy* — a separate Woody service that implements the +`proxy-provider` Thrift protocol and translates between the generic +Hellgate model and the provider's own API. + +This page describes how Hellgate invokes those adapters, how it receives +their callbacks, and how sessions keep everything consistent. + +## The adapter protocol + +Three RPCs on the `proxy-provider` interface are invoked from Hellgate: + +```erlang +process_payment(ProxyContext, Route). +handle_payment_callback(Payload, ProxyContext, Route). +generate_token(ProxyContext, Route). +``` + +Implementation lives in +[`hg_proxy_provider.erl`](../apps/hellgate/src/hg_proxy_provider.erl). + +Each call carries a `ProxyContext` — a self-contained snapshot of payment +info, previous session state (opaque to Hellgate), and merged +`ProxyOptions`. The options are collected by +`hg_proxy_provider:collect_proxy_options/1` which merges three layers, +terminal-specific first, provider-additional next, and proxy-definition +defaults last: + +```erlang +lists:foldl(fun(undefined, M) -> M; (M1, M) -> maps:merge(M1, M) end, #{}, [ + Terminal#domain_Terminal.options, + Proxy#domain_Proxy.additional, + ProxyDef#domain_ProxyDefinition.options +]). +``` + +This layering lets the same provider be reused across terminals while +still allowing per-terminal overrides. + +The adapter reply is a `provider intent`: + +- `{finish, FinishIntent}` — the adapter is done. Outcomes include + success (with a transaction reference to record), failure (propagated + back into the payment's failure channel), or a user-visible reason. +- `{sleep, SleepIntent}` — poll again after a timer. Hellgate sets a + machine timer; when it fires, the session resumes. +- `{suspend, SuspendIntent}` — register a tag and wait for an async + callback. + +See the `hg_session` module for how intents are decoded into activity +transitions. + +## Sessions + +Module: [hg_session.erl](../apps/hellgate/src/hg_session.erl). + +A session is one conversation with one adapter about one target state. The +struct is kept small on purpose because it is persisted as part of the +payment's event history: + +- `target` — the final status we want to drive this conversation towards + (`processed`, `captured`, `cancelled`, `refunded`). +- `status` — `active | suspended | finished`. +- `tags` — the currently registered callback tags for this session. +- `route` — the `(provider, terminal)` we're talking to. +- `proxy_state` — an opaque binary the adapter asks us to pass back on + every RPC. Hellgate never inspects it. +- `interaction` — a structured description of any UI the user sees (3DS + redirect, OTP page, QR code, …); also feeds into cascade logic. +- `ui_occurred` — a latching boolean set the first time the user + interacts. Cascade will not retry after this flips. +- `timings` — timestamps captured for diagnostics and SLAs. +- `repair_scenario` — optional manual override injected via the repair API. + +Lifecycle: + +1. `create/0` + `set_payment_info/2` — blank session, payment info attached. +2. `process/1` — first RPC to the adapter. Hellgate interprets the intent + and decides whether to finish, sleep or suspend. +3. `apply_event/3` — on each subsequent event (timer fire, callback arrival, + manual repair), advance the session state. +4. `deduce_activity/1` — derive the next payment activity (`flow_waiting`, + `processing_capture`, `finalizing_session`, …) from the session's + current status. + +## Tags and async callbacks + +Async delivery is necessary because many provider flows are inherently +out-of-band (3DS redirects, bank push notifications, offline bank transfer +confirmation). Hellgate uses a *tag* as the rendezvous point between a +suspended session and an incoming callback. + +```mermaid +sequenceDiagram + autonumber + participant P as hg_invoice_payment + participant S as hg_session + participant Tag as hg_machine_tag + participant Pr as proxy-provider (adapter) + participant U as Payer / upstream + participant Host as hg_proxy_host_provider + + P->>S: process + S->>Pr: ProcessPayment + Pr-->>S: intent = suspend, tag = T + S->>Tag: put_binding(invoice, T, PaymentID, InvoiceID) + Note over S: session.status = suspended + U-->>Pr: completes 3DS / OTP / offline step + Pr->>Host: ProcessPaymentCallback(T, payload) + Host->>Tag: get_binding(invoice, T) + Tag-->>Host: (InvoiceID, PaymentID) + Host->>P: process_callback + P->>S: apply event, resume + S-->>P: session.status = finished (or suspended again) +``` + +Module: [hg_machine_tag.erl](../apps/hellgate/src/hg_machine_tag.erl). + +At session creation (or when the adapter returns a `suspend` intent) a tag +is registered: + +```erlang +put_binding(<<"invoice">>, Tag, PaymentID, InvoiceID). +``` + +The tag is handed to the adapter, which embeds it in whatever +user-facing URL the payer hits or passes it to the upstream system that +will later confirm the operation. When the adapter calls back into +Hellgate it invokes `ProcessPaymentCallback` on the +[`ProviderProxyHost`](../apps/hg_proto/src/hg_proto.erl) Thrift service +(mounted at `/v1/proxyhost/provider` by `hg_proto:get_service_spec/2`), +passing the tag **inside the Thrift payload** — the URL path itself is +fixed. The handler +[`hg_proxy_host_provider`](../apps/hellgate/src/hg_proxy_host_provider.erl) +then: + +1. Resolves `tag → (InvoiceID, PaymentID)` via + [`hg_machine_tag:get_binding/2`](../apps/hellgate/src/hg_machine_tag.erl). +2. Calls `hg_invoice:process_callback/2` on the invoice machine. +3. The invoice routes the callback into the correct payment/session. +4. The session either finishes (success/failure), sleeps again, or stays + suspended under a new tag. + +This layer is the reason the callback endpoint is *host-side* — the +adapter is a client of Hellgate for the callback, not the other way +round. + +## Timeout behaviour + +Each session declares a `timeout_behaviour()` from the domain. In broad +strokes: + +- Immediate — the adapter promised to respond synchronously; no polling + needed. +- Polling — the adapter is slow but pollable; Hellgate sets a timer and + calls `process/1` again when it fires. +- Callback — the adapter will drive completion via an async callback; the + timer is used as a fail-safe if the callback never arrives. + +Timers are implemented by the `set_timer` action returned from the +machine's `process_signal/2` and are honoured by the automaton backend. + +## Fault detector integration + +Every adapter RPC is reported to the fault detector. The client module +[`hg_fault_detector_client`](../apps/hellgate/src/hg_fault_detector_client.erl) +registers operations on start and finish and queries rolling statistics +at routing time. The statistics feed two decisions: + +- The route's availability (`alive`/`dead`) — a `dead` route is rejected + from the candidate list. +- The route's expected conversion rate — used as part of the scoring tuple + when two candidates tie on priority. + +Because statistics are reported per operation, a provider that is healthy +for authorisations but broken for refunds will be marked dead only for the +broken flow. + +> [!CAUTION] +> The `proxy_state` binary returned by an adapter is opaque to Hellgate and +> is persisted verbatim into the session event. Adapters must treat it as +> their own forward-compatible serialisation format — a non-backwards- +> compatible change will break in-flight sessions on replay. + +## Generating recurrent tokens + +For recurrent paytools the flow is similar, but the RPC is +`generate_token/2`. The response becomes the paytool's permanent payment +resource — subsequent payments against the same paytool skip the +cardholder-interactive stages entirely and drive the adapter through its +"use a previously-tokenised card" path. + +## Provider-side diagnostics + +Two modules exist purely to make provider conversations observable: + +- [`hg_proxy.erl`](../apps/hellgate/src/hg_proxy.erl) — low-level call + options helper shared across proxy types (provider and inspector). +- [`hg_profiler.erl`](../apps/hellgate/src/hg_profiler.erl) and the + `hg_timings` helper record per-session timings that are later attached + to events and surfaced in payment state. + +Together with the fault detector and the `interaction` field on sessions +this gives a fairly granular operational picture of every provider call. diff --git a/doc/risk-and-repair.md b/doc/risk-and-repair.md new file mode 100644 index 00000000..41a5302e --- /dev/null +++ b/doc/risk-and-repair.md @@ -0,0 +1,163 @@ +# Risk, repair and operations + +Two adjacent subsystems sit around the payment state machine to keep the +system safe and recoverable: + +- [Risk inspection](#risk-inspection) — assigns every payment a coarse + risk level and screens against card-level blacklists before routing. +- [Repair](#repair) — a controlled backdoor that lets operators nudge a + stuck payment into a known-good state. + +## Risk inspection + +Module: [hg_inspector.erl](../apps/hellgate/src/hg_inspector.erl). + +Risk inspection is a Woody call to the `proxy-inspector` service. Hellgate +picks the inspector definition from the domain (`#domain_Inspector{}`), +merges proxy options the same way it does for providers, and calls +`InspectPayment`: + +```erlang +inspect(Shop, Invoice, Payment, Inspector) -> + Context = #proxy_inspector_Context{ + payment = get_payment_info(Shop, Invoice, Payment), + options = maps:merge(ProxyDef#domain_ProxyDefinition.options, + Proxy#domain_Proxy.additional) + }, + {ok, RiskScore} = issue_call('InspectPayment', {Context}, + hg_proxy:get_call_options(Proxy, Revision), + FallBackRiskScore, Deadline), + RiskScore. +``` + +Notable properties: + +- **Risk scores are coarse.** The inspector returns `low | medium | high` + — a bucket, not a number. Downstream code uses the bucket as part of the + varset for routing and for term resolution. +- **Fallback is explicit.** If the inspector times out, returns `undefined` + or errors out, Hellgate returns the configured `fallback_risk_score`. + Payments never get stuck waiting for the inspector. +- **Deadlines are honoured.** The call runs under the Woody deadline from + the current request, so slow inspectors degrade to their fallback rather + than holding the machine hostage. + +```mermaid +flowchart TD + A[risk_scoring activity] --> B[resolve inspector from domain] + B --> C[call proxy-inspector:InspectPayment
under request deadline] + C --> D{response within
deadline?} + D -- yes, score --> R[add score to varset] + D -- yes, undefined --> F[use fallback_risk_score] + D -- timeout / error --> F + F --> R + R --> N[routing] +``` + +### Blacklist checks + +The inspector also serves the per-route blacklist that the routing layer +consults. `hg_inspector:check_blacklist/1` builds a context of + +```erlang +#proxy_inspector_BlackListContext{ + first_id = ProviderID, + second_id = TerminalID, + field_name = <<"CARD_TOKEN">>, + value = Token +} +``` + +and calls `IsBlacklisted`. A `true` return knocks the route out of the +candidate list with reason `in_blacklist` (see +[Routing → Stage 3](routing.md#stage-3--blacklist-filtering)). Payments +without a token (e.g. alternative payment methods) skip the check. + +### How the score is used + +The risk score flows into the payment pipeline at two points: + +- It is added to the **varset** (`hg_varset.erl`). Routing rules, term + selectors and fee selectors can therefore branch on risk — for + example, routing only low-risk payments to a cheap provider, or + charging a premium on high-risk ones. +- It gates 3-D Secure / step-up selection through the domain-level + payment method conditions. + +The inspector is *not* an allow/deny gate in itself — Hellgate relies on +domain configuration (routing + terms) to turn the score into a policy. + +## Repair + +Module: [hg_invoice_repair.erl](../apps/hellgate/src/hg_invoice_repair.erl). + +Every state machine exposes a `repair/3` entry point (via the `hg_machine` +behaviour). For invoices the repair surface is designed specifically to +unstick payments that have ended up in an inconsistent state — usually +because a provider vanished, a session timed out in an unusual way, or +a manual intervention is required to reconcile with an acquirer. + +### Repair scenarios + +Operators submit a `#payproc_InvoiceRepairScenario{}` which is one of: + +| Scenario | Effect | +| ------------------------------ | ------ | +| `fail_pre_processing` | Force the payment into `failed` with a supplied `#domain_Failure{}` *before* any side-effect (no route chosen, no limit held, no provider called). | +| `skip_inspector` | Substitute a supplied `risk_score` for the inspector call. Useful when the inspector itself is misbehaving. | +| `fail_session` | Inject a session *failure* into the in-flight session: the payment sees the session as if the provider had returned the given failure and trx info. | +| `fulfill_session` | Inject a session *success* into the in-flight session: the payment sees the session as if the provider had returned success and the given trx info. | +| `complex` | A list of scenarios to try in order; the first one whose activity matches fires. | + +### Safety checks + +`hg_invoice_repair:check_activity_compatibility/2` enforces that a scenario +only runs when the payment is in a compatible activity. For example, +`fail_pre_processing` is rejected once the payment is past routing; +`repair_session` is rejected if there is no session to repair. This keeps +repair from silently skipping state (and money) it should not touch. + +The repair path also requires an explicit revision input — operators must +state the domain revision they are repairing against — which prevents a +drifted config from bleeding into a repair that was prepared against an +older view of the world. + +> [!CAUTION] +> Repair is a privileged operation. `fulfill_session` in particular writes +> a *success* event for a payment the provider never acknowledged — it +> must only be used when a real reconciliation with the acquirer has +> confirmed the outcome. Getting this wrong means booking money that +> didn't move. + +### Typical uses + +- A provider adapter is permanently offline: use `fail_session` with an + appropriate failure so the payment fails properly and limits roll back. +- An acquirer confirmed out-of-band that a transaction succeeded but the + provider's callback never arrived: use `fulfill_session` with the real + trx info. +- A misconfigured inspector keeps returning `undefined` for a specific + payment shape: use `skip_inspector` with an appropriate fallback. +- A known-bad invoice has to be terminated before it has any side effects: + use `fail_pre_processing`. + +Every repair operation goes through the same event-sourcing pipeline as a +normal call, so the repair itself is fully auditable — the emitted events +indicate that the new state came from a repair scenario rather than a +provider or customer action. + +## Operations summary + +Taken together, risk inspection and repair make the state machine robust +against two common failure modes in payment processing: + +- *Up-front uncertainty* — the inspector gives a cheap, bounded screen + before committing the payment to an expensive route, and degrades + gracefully if the inspector is unreachable. +- *Tail-end stuckness* — if a payment has wandered off the happy path but + the correct resolution is known, repair lets an operator apply it + without bypassing event sourcing, accounting or limits. + +Everything else — routing failures, provider timeouts, transient retry +loops — is meant to resolve itself via cascade and retry without human +intervention. diff --git a/doc/route_pins.md b/doc/route_pins.md index b62040b5..6132d582 100644 --- a/doc/route_pins.md +++ b/doc/route_pins.md @@ -1,35 +1,131 @@ -# Пины роутов +# Route pins -## Какая задача +> This document replaces the original Russian design note. The mechanism +> described here is implemented in +> [`hg_route`](../apps/routing/src/hg_route.erl) and +> [`hg_routing`](../apps/routing/src/hg_routing.erl) (see +> `gather_pin_info/2`, `select_better_route/2`, +> `select_better_pinned_route/2`). -У нас есть 2 и более роутов с одинаковым приоритетом -и какой-то там разбивкой по весу. Например 3 роута с весами 33:33:33. +## The problem -К нам приходит плательщик. Он оплачивает какую-то услугу -и этот платеж проходит через конкретный терминал конкретного провайдера. -Проще говоря он выбрал один из кандидатов (роутов) из списка с одинаковым приоритетом. +Consider a payment institution with three equal-priority routes sharing the +same weight — `33 : 33 : 33`. A payer arrives, pays for some service, and the +payment ends up going through, say, the second route. We would like every +*subsequent* payment from the same payer to reach the same route, without +pinning the *other* payers to it. -Теперь мы хотим чтобы этот плательщик в будущем ходим через тот же самый роут. +Naïvely re-randomising the weight on every payment would move returning +payers around between routes. Assigning a sticky route globally per merchant +would destroy the load split the merchant chose. Pins solve the problem at a +finer granularity: the *payer* (identified by a configurable set of features) +is stuck to whichever equal-priority candidate it first ended up on. -Плательщика определяем каким-то там способом. +## The mechanism -## Решение +Each route candidate in the domain can declare a +`#domain_RoutingPin{features = [...]}` — a set of payer characteristics the +candidate considers "identifying". The features currently recognised by +Hellgate are: -Мы в каждом роут кандидате можем указать -[список характеристик](https://github.com/valitydev/damsel/blob/master/proto/domain.thrift#L2850-L2856) -по которым мы будем определять какой именно плательщик к нам пришел. +| Feature | Source | +| -------------- | ------------------------------------------------- | +| `currency` | payment currency | +| `payment_tool` | the full payment tool (card BIN, wallet, etc.) | +| `client_ip` | payer's IP address (may be `undefined`) | +| `email` | payer's email (may be `undefined`) | +| `card_token` | tokenised card identifier (may be `undefined`) | -Когда к нам приходит запрос на проведение платежа, то мы собираем все указанные -в конкретном кандидате характеристики и вычисляем хэш этих характеристик. -Этот хэш учитывается при сортировке роутов по самым желаемым. +At routing time `hg_routing:gather_pin_info/2` walks the declared features +for the candidate and extracts their current values from the routing +context. The result is a plain map `#{feature => value}` which travels +along with the candidate as its `pin` (see +[`hg_route:pin/1`](../apps/routing/src/hg_route.erl)). -Если как в примере выше у нас 3 роут кандидата с одинаковым весом -и список характеристик (например смотрим только на имейл) совпадает, -то мы лочим роут с этим значением характеристики. -Все последующие платежи с этими значениями будут проходить по тому роуту, что был использован -в первой операции. Соответственно вес у нас в одном приоритете становится 100:0:0. +During scoring, Hellgate sorts equal-priority candidates by +`#domain_PaymentRouteScores{}`. The critical hook is in +[`hg_routing:select_better_route/2`](../apps/routing/src/hg_routing.erl): -Если же один из этих роутов имеет другой набор характеристик, например имейл и IP адрес клиента, то он участвует -в локе пинов с роутами у которых такой же набор характеристик. В данном примере, так как он один, то распределение -становится 66:0:33. Если бы был еще один роут с тем же приоритетом и набором характеристик имейл и IP, то -распределение было бы 50:0:50:0 +```erlang +case {LeftPin, RightPin} of + _ when LeftPin /= ?ZERO, RightPin /= ?ZERO, RightPin == LeftPin -> + select_better_pinned_route(Left, Right); + _ -> + select_better_regular_route(Left, Right) +end +``` + +When two candidates carry the same pin value and that value is not the +zero sentinel (meaning they saw an identical payer fingerprint), the +regular weight-based random tie-break is replaced with a deterministic +one: + +```erlang +route_pin = erlang:phash2({Pin, hg_route:provider_ref(Route), + hg_route:terminal_ref(Route)}) +``` + +Because the pin value is shared, the only thing that differs between the +two hashes is `(provider_ref, terminal_ref)`. The `phash2` output is +stable across time and processes, so the *same* payer always picks the +*same* winner — exactly what the requirement asks for. + +Candidates whose feature sets do not overlap participate in their own +pinning group (their pin values differ, so the `RightPin == LeftPin` +clause never fires and they fall back to the normal random tie-break). + +## Worked example + +Three candidates A, B, C at the same priority with weights `33 : 33 : 33`. + +- If all three declare the feature set `{email}` and a payer with a fixed + email arrives, all three end up with identical `Pin` values. The + deterministic `phash2({Pin, P, T})` picks exactly one winner for that + email, so the effective distribution for that payer is + `100 : 0 : 0` — but a different email (or an unknown email) will pin + to a potentially different candidate. The aggregate split across the + population is still close to `33 : 33 : 33`. + +- If A and B declare `{email}` while C declares `{email, client_ip}`, the + `Pin` of C differs from that of A and B (unless by coincidence the IP + is absent for all payers, which is rare). A and B form one pin group; + C forms its own. For a given payer the collapse looks like `50 : 50 : ?` + within the first group and a separate pinning decision for C — hence + the "`66 : 0 : 33` → pin kicks in → `0 : 100 : 0` for A/B, C decided + independently" pattern. + +## Flow + +```mermaid +flowchart LR + Route[Candidate with
#domain_RoutingPin{features=...}] --> GP[gather_pin_info/2] + VS[Routing context
currency, email, ip, tool, token] --> GP + GP --> P[pin :: #{feature => value}] + P --> SCORE[score_routes] + SCORE --> SEL{pins equal
and non-zero?} + SEL -- yes --> DET[select_better_pinned_route
phash2 Pin,P,T] + SEL -- no --> REG[select_better_regular_route
weight-based random] + DET --> WIN([winner]) + REG --> WIN +``` + +## Practical notes + +> [!NOTE] +> A feature that is `undefined` in the routing context still contributes +> to the pin value — two candidates that both declare `{email, ip}` with +> `ip = undefined` are considered equivalent for pinning purposes. This +> is intentional: the absence of a signal is itself a signal. + +> [!IMPORTANT] +> Pins only break ties *within* an equal-priority, equal-weight group. +> They do not override fault-detector rejection, blacklist rejection, +> limit overflow or priority ordering. A dead provider is still dead even +> if the payer is "pinned" to it. + +> [!TIP] +> The explanation rendered by +> [`hg_routing_explanation`](../apps/routing/src/hg_routing_explanation.erl) +> surfaces a `"Pin wasn't the same as in chosen route"` message when pins +> participated in the decision — useful when debugging why an expected +> route was not taken. diff --git a/doc/routing.md b/doc/routing.md new file mode 100644 index 00000000..cba45546 --- /dev/null +++ b/doc/routing.md @@ -0,0 +1,175 @@ +# Routing + +Routing is the process of turning an abstract payment into a concrete +`(Provider, Terminal)` pair that will actually talk to a real acquirer. The +code lives in the `routing` OTP application under +[`apps/routing/src/`](../apps/routing/src) — `hellgate` calls into it but +keeps no routing state of its own. + +The main entry point is +[`hg_routing:gather_routes/5`](../apps/routing/src/hg_routing.erl); the +routing *context* (`hg_routing_ctx`) is the value threaded through every +stage, accumulating candidates, rejections and score information. + +```mermaid +flowchart LR + A[Payment institution
routing rules + varset] --> B[1. Gather candidates
policies − prohibitions] + B --> C[2. Fault detector
reject 'dead' providers] + C --> D[3. Blacklist
reject card tokens] + D --> E[4. Limits
reject overflowing] + E --> F[5. Score & choose
priority · weight · pins · fail rate] + F --> G{winner?} + G -- yes --> R[chosen route
+ choice context] + G -- no --> X[no_route_found] +``` + +## The shape of a route + +A route is the `routing` app's view of a `(Provider, Terminal)` candidate +plus everything needed to filter and score it: + +- `provider_ref` / `terminal_ref` +- `priority` — integer, lower is preferred +- `weight` — used for tie-breaking within a priority band +- `pins` — payer-identifying characteristics (see [route_pins.md](route_pins.md)) +- `domain_revision` — the DMT revision the candidate was resolved at + +## Stage 1 — gathering candidates + +```erlang +gather_routes(Predestination, PaymentInstitution, Varset, Revision, GatherContext) -> + case PaymentInstitution#domain_PaymentInstitution.payment_routing_rules of + undefined -> hg_routing_ctx:new([]); + #domain_RoutingRules{policies = Policies, prohibitions = Prohibitions} -> + Candidates = get_candidates(Policies, Varset, Revision), + {Accepted, RejectedRoutes} = filter_routes( + collect_routes(Candidates, Predestination, Revision), + get_table_prohibitions(Prohibitions, Varset, Revision) + ), + hg_routing_ctx:new(Accepted) + end. +``` + +The steps are: + +1. Look up the payment institution for the invoice's shop. +2. Walk its `payment_routing_rules` — policies and prohibitions. Both are + DMT selector trees evaluated against the **varset** (category, currency, + cost, payment tool, risk score, party config ref, shop id, flow, wallet + id, etc.; see [Domain, party and varset](domain-and-party.md)). +3. Expand the matching policies into concrete `(Provider, Terminal)` + candidates. +4. Filter out any candidate that matches a prohibition. + +At this point `hg_routing_ctx` holds the *accepted* candidates and the +reason each rejected candidate dropped out. + +## Stage 2 — fault-detector filtering + +Even if the domain allows a candidate, the fault detector may have flagged +its provider as effectively dead. `hg_routing:filter_by_critical_provider_status/1` +pulls statistics from [`hg_fault_detector_client`](../apps/hellgate/src/hg_fault_detector_client.erl), +scores every candidate (availability + conversion rate) and rejects anything +whose availability status is `dead`: + +```erlang +{R1, {{dead, _} = AvailabilityStatus, _ConversionStatus}} -> + hg_route:to_rejected_route(R, {'ProviderDead', AvailabilityStatus}) +``` + +Scores (and the `dead`/`alive` split) are cached into the context so that +later stages can re-use them for final ranking without a second RPC. + +## Stage 3 — blacklist filtering + +[`hg_routing:filter_by_blacklist/2`](../apps/routing/src/hg_routing.erl) runs +every remaining candidate against the inspector's blacklist: + +```erlang +check_routes([Route|Rest], BlCtx) -> + % For each (provider, terminal, card_token) call proxy-inspector:IsBlacklisted +``` + +Blacklists are keyed on `(provider_id, terminal_id, CARD_TOKEN, )`. +Anything blacklisted is rejected with reason `in_blacklist`. + +Token-less candidates (e.g. when we do not yet have a payment tool, or the +payment is cash-on-delivery style) skip this check. + +## Stage 4 — limits + +Turnover limits can also eliminate candidates. Before selecting a route, +Hellgate evaluates the per-provider / per-terminal turnover limits and marks +overflowing candidates as rejected with reason `limit_overflow`. See +[Limits and accounting](limits-and-accounting.md) for the limiter model. + +## Stage 5 — scoring and choice + +Whatever survives is ranked. The individual route score is assembled from: + +- Priority (primary key — lower wins) +- Weight (secondary; interacts with pins) +- Fault-detector availability + conversion rate +- Blacklist status + +`hg_routing:choose_rated_route/1` sorts by the tuple +`(availability, priority, weight-driven pin score, fail rate)` and picks the +head. It returns both the chosen route and a *choice context* that records +which alternative would otherwise have won and why each loser lost — this is +the payload consumed by +[`hg_routing_explanation`](../apps/routing/src/hg_routing_explanation.erl) to +produce human-readable "why did we pick this route" output surfaced via the +payproc API. + +## Route pins + +Full details in [route_pins.md](route_pins.md). In short: within a group +of candidates with the same priority and weight, each candidate can +declare a list of payer *features* (e.g. `[email]` or `[email, client_ip]`) +via `#domain_RoutingPin{}`. At routing time the values of those features +are pulled from the routing context and attached to the candidate; when +two candidates in the same priority group share the same pin value, the +weight-based random tie-break is replaced by a deterministic +`phash2({pin, provider, terminal})` comparison. + +The upshot is that the effective weight distribution collapses from, say, +`33:33:33` to `100:0:0` for a given payer if all three candidates share +the same pin set, keeping the payer "pinned" to the route they first +used. Candidates with different pin sets participate in their own, +independent pinning group, so the overall distribution still makes sense +across the population. + +## Cascade: re-routing on failure + +Cascade is the routing-level response to a failed session. See +[State machines → Cascade and retries](state-machines.md#cascade-and-retries) +for the trigger logic. Mechanically: + +1. The payment records its current route and increments `iter`. +2. The routing context still holds the remaining candidates and their + scores (they were stashed during stage 2). +3. The next candidate is picked with the same scoring rules, and a brand-new + session is created against that route. + +Because the fault detector's view can change between iterations, +`filter_by_critical_provider_status/1` is re-run at each cascade attempt — +a provider that was alive at attempt 1 may be dead by attempt 3. + +## Rejections as first-class data + +`hg_routing_ctx` keeps rejected candidates grouped by reason (not just a +flat drop list). The reasons it surfaces are: + +- `adapter_unavailable` — marked dead by the fault detector +- `in_blacklist` — inspector rejected the token +- `forbidden` — matched a prohibition rule +- `limit_overflow` — would push a turnover limit over its ceiling + +All four reasons are preserved in the final `ChoiceContext` so that operators +can see exactly why an expected route was not picked. + +> [!NOTE] +> Stages 2 (fault detector) and 3 (blacklist) are each one RPC per +> candidate. If routing shows up as a latency hotspot, the candidate set is +> usually the thing to trim — either by tightening domain prohibitions or +> by narrowing the varset that selects policies. diff --git a/doc/state-machines.md b/doc/state-machines.md new file mode 100644 index 00000000..f585f024 --- /dev/null +++ b/doc/state-machines.md @@ -0,0 +1,472 @@ +# State machines + +Every durable entity in Hellgate — invoices, payments, refunds, chargebacks, +invoice templates, recurrent paytools — is an event-sourced state machine +implemented as a module that satisfies the +[`hg_machine`](../apps/hellgate/src/hg_machine.erl) behaviour. + +## The `hg_machine` behaviour + +From [hg_machine.erl](../apps/hellgate/src/hg_machine.erl): + +```erlang +-type machine() :: #{ + id := id(), + history := history(), + aux_state := auxst() +}. + +-type result() :: #{ + events => [event_payload()], + action => hg_machine_action:t(), + auxst => auxst() +}. + +-callback namespace() -> ns(). +-callback init(args(), machine()) -> result(). +-callback process_signal(signal(), machine()) -> result(). +-callback process_call(call(), machine()) -> {response(), result()}. +-callback process_repair(args(), machine()) -> result(). +``` + +The backend hands Hellgate the full `history()` (list of past events) plus +`aux_state()` (an opaque cache used to skip replay of the whole history), and +Hellgate returns *new* events plus an optional `action`: + +- `set_timer` — schedule a deadline (e.g. invoice expiration, session poll) +- `remove` — ask the backend to delete the machine +- `notify` — outgoing notifications + +Calls come in three shapes: + +- `hg_machine:start/3` — `init/2` builds the initial event list. +- `hg_machine:call/3` / `hg_machine:thrift_call/5` — synchronous, returns a + response. Handled by `process_call/2`. +- `hg_machine:repair/3` — manual intervention, see [Repair](risk-and-repair.md#repair). + +The top-level backend selector lives in `hg_machine:call_automaton/3` and +picks between Machinegun, Progressor and the hybrid router based on the +`hellgate` `backend` env var. + +## Invoice machine + +Module: [hg_invoice.erl](../apps/hellgate/src/hg_invoice.erl). +Namespace: `invoice`. + +The invoice is the outer state machine. It owns: + +- The immutable `#domain_Invoice{}` (shop, cost, due date, metadata, cart or + product, optional mutations) +- Zero or more *nested* payment state machines keyed by `PaymentID` +- A cached reference to the party/shop at creation revision + +Its internal state (`#st{}` in `hg_invoice.erl`) tracks which sub-entity is +currently "active": + +```erlang +-type activity() :: + invoice % waiting on payment creation or expiration + | {payment, payment_id()}. +``` + +Key event types (from `payment_events.hrl`): + +- `?invoice_created(Invoice)` +- `?invoice_status_changed(Status)` +- `?payment_ev(PaymentID, PaymentEvent)` — wraps nested payment events + +Status transitions (from the domain): + +```mermaid +stateDiagram-v2 + [*] --> unpaid + unpaid --> paid: all payments captured + unpaid --> cancelled: explicit cancellation + unpaid --> expired: due date reached + paid --> [*] + cancelled --> [*] + expired --> [*] +``` + +The invoice handles the following calls (selected): + +- `start_payment` — creates the nested payment machine +- `capture_payment` / `cancel_payment` / `refund_payment` — delegate to the + relevant payment sub-machine +- `create_chargeback` / `accept_chargeback` / `reject_chargeback` / `reopen_chargeback` +- `process_callback` — async provider callback routed by tag + +`timeout` signals are used for expiration and for session polling of nested +payments. + +## Payment machine + +Module: [hg_invoice_payment.erl](../apps/hellgate/src/hg_invoice_payment.erl) +(around 4k lines — the largest module in the project). + +Payments are *sub-machines* of invoices: they do not have their own Machinegun +namespace; their events are wrapped in `?payment_ev(PaymentID, ...)` and +appended to the invoice's history. `hg_invoice_payment` provides the +apply/process functions that `hg_invoice` calls into. + +Payment status (from the domain): + +```mermaid +stateDiagram-v2 + [*] --> pending + pending --> processed + pending --> cancelled + pending --> failed + processed --> captured + processed --> failed + captured --> refunded + captured --> charged_back + refunded --> [*] + captured --> [*] + cancelled --> [*] + failed --> [*] + charged_back --> [*] +``` + +The payment struct (simplified) holds: + +- The immutable `#domain_InvoicePayment{}`, the parent invoice and party refs +- `activity` — which internal step is in flight (see below) +- `target` — the desired terminal session outcome (`processed`, `captured`, + `cancelled`, `refunded`) +- `route` — the current provider+terminal choice +- `iter` — cascade attempt counter +- `sessions` — the stack of provider interactions for this payment +- `cash_flow`, `allocation`, `limits` — financial state +- Nested `refunds` and `chargebacks` keyed by their IDs +- `failure`, `retry_attempts`, `repair_scenario` — recovery state + +### Payment steps + +The `activity()` type in +[`hg_invoice_payment.erl`](../apps/hellgate/src/hg_invoice_payment.erl) is a +tagged union — `{payment, Step}`, `{refund, RefundID}`, +`{chargeback, ChargebackID, ChargebackActivity}`, +`{adjustment_new | adjustment_pending, AdjustmentID}`, or `idle`. The +payment branch wraps a `payment_step()`, and it is the steps (not the outer +`activity()` atoms) that encode the moving pieces of a payment flow. The +full list of steps, each corresponding to a concrete side-effect: + +| `payment_step()` | What it does | +| --------------------------- | ------------------------------------------------------------------- | +| `new` | Freshly created, not yet validated. | +| `shop_limit_initializing` | Shop-level turnover hold via `hg_limiter`. | +| `shop_limit_failure` | Shop limits exceeded — payment will fail. | +| `shop_limit_finalizing` | Commit/rollback shop-level hold at the end of the flow. | +| `risk_scoring` | Calls the inspector (`hg_inspector`) for a risk score. | +| `routing` | Gathers and ranks candidate routes (`hg_routing`). | +| `routing_failure` | All candidates rejected — transition to `failed`. | +| `cash_flow_building` | Computes the final postings with `hg_cashflow:finalize/3`. | +| `processing_session` | Calls the provider adapter through `hg_session` / `hg_proxy_provider`. | +| `processing_accounter` | Submits a posting plan to shumway (`plan/2`). | +| `processing_capture` | Executes a separate capture session for two-step flows. | +| `processing_failure` | Decides cascade, retry or fail. | +| `updating_accounter` | Commits/rollbacks the posting plan. | +| `flow_waiting` | Waiting for an async provider callback. | +| `finalizing_session` | Cleans up transient session state after a session result. | +| `finalizing_accounter` | Final accounting commit after capture. | + +The *target* of a session encodes what we want to achieve next: +`processed` (authorise), `captured` (settle), `cancelled` (void), +`refunded` (reverse). + +### Cascade and retries + +There are two complementary mechanisms for dealing with provider failures: + +- **Cascade** — try the *next* route candidate. +- **Retry** — try the *same* route again, optionally after a sleep, driven by + a `hg_retry` policy. + +Cascade is controlled by domain config +([`#domain_CascadeBehaviour{}`](https://github.com/valitydev/damsel)) and +implemented in [`hg_cascade.erl`](../apps/hellgate/src/hg_cascade.erl): + +```erlang +is_triggered(Behaviour, OperationFailure, Route, Sessions) -> boolean(). +``` + +Cascade fires when: + +1. The operation failure matches one of the configured mapped error signatures + (prefix match over the error notation path — e.g. `preauthorization_failed` + covers all its sub-codes), **and** +2. The user did not interact during the session (no 3DS UI, no OTP step, etc. + — see `is_user_interaction_triggered/3`). Replaying a route is pointless + after the cardholder made a choice, so UI interactions block cascade. + +```mermaid +flowchart TD + F[session failed] --> E{error matches
mapped signature?} + E -- no --> TERM[mark payment failed] + E -- yes --> UI{ui_occurred
during session?} + UI -- yes --> TERM + UI -- no --> NEXT{another candidate
available?} + NEXT -- no --> TERM + NEXT -- yes --> C[bump iter,
rollback limits,
new session on next route] + C --> S[(processing_session)] +``` + +If both hold, the payment picks the next candidate from the routing context, +bumps `iter`, and starts a fresh session. Otherwise the failure is terminal. + +> [!TIP] +> A payment that has already shown the user a 3DS redirect will not cascade, +> because the cardholder has effectively made a choice. If you need to retry +> in that case it has to go through the cancel/create loop, not cascade. + +Retries use [`hg_retry.erl`](../apps/hellgate/src/hg_retry.erl)'s policy +algebra: + +```erlang +-type policy_spec() :: + {linear, retries_num() | {max_total_timeout, pos_integer()}, pos_integer()} + | {exponential, retries_num() | {max_total_timeout, pos_integer()}, number(), pos_integer()} + | {exponential, retries_num() | {max_total_timeout, pos_integer()}, number(), pos_integer(), timeout()} + | {intervals, [pos_integer(), ...]} + | {timecap, timeout(), policy_spec()} + | no_retry. +``` + +Retries are used for session polling, refund reprocessing, and async wait +loops. + +### Allocation and cash flow on payments + +When a payment progresses past routing the final cash flow is computed from +the domain's posting templates and the selected provider/terminal: + +- Merchant settlement and guarantee accounts (from the shop config) +- Provider settlement account (from the chosen provider, by currency) +- System settlement and subagent accounts (from the payment institution) +- External income/outcome accounts (selected by varset) + +Allocations (split payments) are implemented in +[`hg_allocation.erl`](../apps/hellgate/src/hg_allocation.erl) but are +currently disabled — `calculate/5` returns `{error, allocation_not_allowed}`. +The plumbing is in place for future re-enablement. + +## Refund machine + +Module: [hg_invoice_payment_refund.erl](../apps/hellgate/src/hg_invoice_payment_refund.erl). + +Refunds are sub-machines of a captured payment. A refund holds: + +- Its own `#domain_InvoicePaymentRefund{}` +- A reversed cash flow (source/destination swapped, details marked as + reversal) +- The sessions used to execute the refund on the provider +- The same route as the original payment (providers require the original + transaction) +- A `status()`: `pending | succeeded | failed` + +Refund activities are narrower than payment activities: + +```erlang +-type activity() :: + new + | session % provider interaction in flight + | failure % decide retry or give up + | accounter % stage the reversal posting plan + | finished. +``` + +```mermaid +stateDiagram-v2 + [*] --> new + new --> session: start refund on provider + session --> accounter: provider ack + session --> failure: provider error + failure --> session: retry policy allows + failure --> finished: terminal failure + accounter --> finished + finished --> [*] +``` + +Typical flow: + +1. Build the reversed cash flow with [`hg_cashflow:revert/1`](../apps/hellgate/src/hg_cashflow.erl). +2. Hold the refund's turnover limits (the inverse of the capture hold). +3. Create a refund session bound to the original route and call + `proxy-provider:ProcessRefund` (or a callback-driven flow). +4. On success, commit the reversal posting plan to shumway; on failure, + roll it back and either retry or mark the refund as failed. + +## Chargeback machine + +Module: [hg_invoice_payment_chargeback.erl](../apps/hellgate/src/hg_invoice_payment_chargeback.erl). + +Chargebacks model disputes initiated by the acquirer or card scheme and have +their own three-stage lifecycle: + +```erlang +-type stage() :: 'chargeback' | 'pre_arbitration' | 'arbitration'. +-type status() :: 'pending' | 'accepted' | 'rejected' | 'cancelled'. +``` + +```mermaid +stateDiagram-v2 + direction LR + [*] --> chargeback + chargeback --> pre_arbitration: reopen + pre_arbitration --> arbitration: reopen + chargeback --> Terminal + pre_arbitration --> Terminal + arbitration --> Terminal + state Terminal { + [*] --> accepted + [*] --> rejected + [*] --> cancelled + } +``` + +Each stage can have its own cash-flow plan, kept in the chargeback struct: + +```erlang +#chargeback_st{ + cash_flow_plans = #{ + ?chargeback_stage_chargeback() => [], + ?chargeback_stage_pre_arbitration() => [], + ?chargeback_stage_arbitration() => [] + } +} +``` + +Operations: + +- `create/2` — open a dispute at the `chargeback` stage. +- `accept/3` — merchant accepts the dispute; apply the stage's posting plan. +- `reject/3` — merchant disputes the claim. +- `reopen/3` — move to the next stage (chargeback → pre-arbitration → + arbitration). +- `cancel/3` — drop the chargeback entirely. + +Each stage transition can produce a new cash-flow plan so that dispute +liability is accounted for on every step, not just at the terminal outcome. + +## Sessions + +Module: [hg_session.erl](../apps/hellgate/src/hg_session.erl). + +A *session* is one interaction with a provider adapter. A payment can have +multiple sessions: one per cascade attempt, plus separate sessions for +capture, void and refund. + +The session struct is roughly: + +```erlang +-type t() :: #{ + target := target(), % desired terminal status + status := active | suspended | finished, + trx := 'maybe'(trx_info()), + tags := [tag()], % callback tags + timeout_behaviour := timeout_behaviour(), + context := tag_context(), % invoice/payment id + route := route(), + payment_info := payment_info(), + result => session_result(), + proxy_state => binary(), % opaque provider state + interaction => interaction(), % 3DS / redirect / OTP + ui_occurred => boolean(), + timings => timings(), + repair_scenario => repair_scenario() +}. +``` + +Session lifecycle: + +1. `create/0` — blank session with defaults. +2. `set_payment_info/2` — attach the data that will be sent to the adapter. +3. `process/1` — call `proxy-provider:ProcessPayment` (or the relevant op) + and interpret the returned intent: + - `{finish, FinishIntent}` — adapter is done, extract success/failure. + - `{sleep, SleepIntent}` — poll again after a timer. + - `{suspend, SuspendIntent}` — suspend and wait for an async callback. +4. `apply_event/3` — apply a provider callback or a local timeout. +5. `deduce_activity/1` — derive the next payment activity from the session + state. + +```mermaid +stateDiagram-v2 + [*] --> active: create + active --> active: sleep intent (set_timer, re-process) + active --> suspended: suspend intent (tag registered) + suspended --> active: callback arrives + active --> finished: finish intent + suspended --> finished: timeout fallback + finished --> [*] +``` + +Asynchronous callbacks are dispatched by tag. A tag is registered in +[`hg_machine_tag`](../apps/hellgate/src/hg_machine_tag.erl) at session +creation time, mapping the tag to `(invoice_id, payment_id)`. When a provider +`POST`s to the host endpoint at +`/v1/proxyhost/provider/callback/`, `hg_proxy_host_provider` looks up the +binding and forwards the callback into the invoice machine, which in turn +applies it to the session and the payment. + +`timeout_behaviour()` encodes what to do when a session times out — +immediate, polling or callback — and drives the `set_timer` actions that +Hellgate emits. + +## Recurrent paytools + +Recurrent paytools are a separate machine type (`recurrent_paytools` +namespace). A recurrent paytool represents a tokenised payment method +obtained via `proxy-provider:GenerateToken` and reused for subsequent +payments without a fresh cardholder interaction. + +The token lifecycle runs through its own sessions against the provider, and +completed tokens are then consumed by payments whose invoice has +`make_recurrent = true` (see +[`hg_invoice_registered_payment.erl`](../apps/hellgate/src/hg_invoice_registered_payment.erl) +for the adjacent "registered" payment path used on the merchant side). + +## Invoice templates + +Module: [hg_invoice_template.erl](../apps/hellgate/src/hg_invoice_template.erl). + +Templates are reusable blueprints that produce an invoice when paired with a +price and (optionally) mutations. They live in their own namespace +(`invoice_template`) and expose CRUD plus `ComputeTerms`, which evaluates the +domain terms applicable to the template's shop to surface fees, available +payment methods, limits and similar. Templates also support invoice +mutations (see below). + +## Invoice mutations + +Module: [hg_invoice_mutation.erl](../apps/hellgate/src/hg_invoice_mutation.erl). + +Mutations are deterministic transformations of invoice data applied at invoice +creation. The only implemented mutation today is `amount` randomisation: + +```erlang +{amount, {randomization, #domain_RandomizationMutationParams{ + multiplicity = M, % only mutate amounts where amount rem M == 0 + min_amount = Min, + max_amount = Max, + direction = upward | downward | both +}}} +``` + +The mutation records both the `original` and the `mutated` amount so that the +invoice remains auditable. Once applied at creation time, mutations are +immutable for the life of the invoice. + +## Events and auxiliary state + +The `aux_state` field in the machine is an opaque cache (msgpack-encoded). +Each machine module populates it with whatever derived state is expensive to +recompute by replaying history (e.g. the most recent `#st{}`). The event +history remains the source of truth: on a cold start, `apply_event/3` can +rebuild the state from scratch. + +Event marshalling is handled by +[`mg_msgpack_marshalling`](../apps/hellgate/src/mg_msgpack_marshalling.erl) and, +on the Progressor side, by the `unmarshal_events/1` helpers in +[`hg_progressor.erl`](../apps/hg_progressor/src/hg_progressor.erl).