ecluse
Safe HaskellNone
LanguageGHC2021

Ecluse.Telemetry.Resolve

Description

Telemetry configuration resolution and export-failure routing — the boot-time substrate that sits between the operator's environment and the OpenTelemetry SDK.

Écluse's maintainer runs Datadog, but the project is vendor-neutral, so an operator may describe the same telemetry identity in either dialect: a Datadog shop sets the DD_* variables, a plain OpenTelemetry shop sets the OTEL_* ones. This module is the self-aligning resolver that collapses both into one answer, so logs and traces share a single identity whichever dialect was provided.

The resolver

resolveTelemetry is a bounded precedence table over exactly four fields — service.name, deployment.environment, service.version, and the OTLP export endpoint — each resolved Datadog-value-wins → vanilla OpenTelemetry → default. It is deliberately not a general per-variable merge: only these four cross between the dialects, and only their fixed precedence is encoded. The DD_API_KEY / DD_SITE agentless-SaaS credentials are never read — Écluse exports to an operator-declared, node-local collector/Agent, never directly to a vendor's cloud, so there is no path by which a key in the environment turns into off-cluster egress. The endpoint itself is a declared destination (like the mirror queue), not an attack surface, so it is normalised and used as given, not classified or gated.

The resolved ResolvedTelemetry is the single source of truth for both halves of the telemetry stack: otelEnvironmentOverrides projects it back to the canonical OTEL_* variables the env-driven SDK reads (so a DD_*-only deployment still configures the exporter), and the same record feeds the dd log object that stitches a log line to its trace.

Export-failure routing

Telemetry failures must stay off the request path and out of raw stderr. The SDK's batch exporter runs asynchronously, so an unreachable collector never touches a served request. This module owns the shared throttle those failures coalesce through: an ExportFailureSink carries one throttle plus a katip target, and routeExportFailure surfaces the first failure plainly, then a periodic heartbeat carrying the suppressed count, so a persistently unreachable endpoint is one visible warning and a heartbeat, not a per-flush flood. The exporter wrappers (Ecluse.Telemetry) feed the sink through observeExportResult; installExportErrorHandler routes the SDK's own diagnostic stream through the same sink.

The configuration model and the export-failure mechanism are described in docs/architecture/observability.md.

Synopsis

The resolved telemetry identity

data ResolvedTelemetry Source #

The telemetry identity resolved from the environment: the single source of truth for both the SDK configuration and the dd log object. rtEnvironment and rtVersion are Nothing when the operator named neither dialect's form — they are genuinely optional resource attributes, not defaulted to a placeholder.

Constructors

ResolvedTelemetry 

Fields

data TelemetryEndpoint Source #

A resolved OTLP export endpoint and the source it was resolved from.

Constructors

TelemetryEndpoint 

Fields

data EndpointSource Source #

Where a resolved OTLP endpoint came from, so the boot path can distinguish a deliberately-configured target from the silent default and warn on the latter.

Constructors

FromDdAgentHost

Derived from DD_AGENT_HOST (as http://{host}:4318).

FromOtelEndpoint

Taken verbatim from OTEL_EXPORTER_OTLP_ENDPOINT.

DefaultedEndpoint

No endpoint was configured; the http://localhost:4318 default applies.

resolveTelemetry :: [(String, String)] -> ResolvedTelemetry Source #

Resolve the telemetry identity from an environment list, each field Datadog-value-wins → vanilla OpenTelemetry → default. service.name falls DD_SERVICEOTEL_SERVICE_NAMEservice.name in OTEL_RESOURCE_ATTRIBUTESecluse; deployment.environment and service.version fall DD_ENV/DD_VERSION → the matching OTEL_RESOURCE_ATTRIBUTES key → unset; the endpoint is DD_AGENT_HOST (as http://{host}:4318) → OTEL_EXPORTER_OTLP_ENDPOINThttp://localhost:4318.

A value present but blank is treated as unset, so an empty DD_ENV= does not stamp an empty environment onto every signal. DD_API_KEY and DD_SITE are never consulted.

>>> rtServiceName (resolveTelemetry [("DD_SERVICE", "api"), ("OTEL_SERVICE_NAME", "ignored")])
"api"
>>> teUrl (rtEndpoint (resolveTelemetry []))
"http://localhost:4318"

Canonical OTEL_* projection

otelEnvironmentOverrides :: [(String, String)] -> [(String, String)] Source #

Project the resolved identity back to the canonical OTEL_* variables the env-driven SDK reads, so a DD_*-only deployment still configures the exporter. The overrides set OTEL_SERVICE_NAME, the OTLP endpoint, the http/protobuf protocol (the only transport built — gRPC is behind a disabled cabal flag), and an OTEL_RESOURCE_ATTRIBUTES whose service.name/deployment.environment/ service.version keys are overlaid by the resolution while any other operator-set attributes are preserved.

Applied with setEnv before the SDK initialises (see prepareTelemetry); idempotent for a vanilla deployment that already set the same OTEL_* values.

Export-failure throttle (pure core)

data ThrottleState Source #

The throttle state for SDK export-error routing: when an error was last logged, and how many have been suppressed since. Exposed so the throttle decision is unit-tested without wall-clock timing.

Constructors

ThrottleState 

Fields

Instances

Instances details
Show ThrottleState Source # 
Instance details

Defined in Ecluse.Telemetry.Resolve

Eq ThrottleState Source # 
Instance details

Defined in Ecluse.Telemetry.Resolve

data ThrottleEmit Source #

What throttleStep decided to do with an export error.

Constructors

EmitFirst

The first error: surface it plainly.

EmitHeartbeat Int

The throttle window elapsed: surface a heartbeat carrying the count of errors since the last surfaced one (this one included).

EmitSuppress

Within the window: suppress and count.

Instances

Instances details
Show ThrottleEmit Source # 
Instance details

Defined in Ecluse.Telemetry.Resolve

Eq ThrottleEmit Source # 
Instance details

Defined in Ecluse.Telemetry.Resolve

initialThrottle :: ThrottleState Source #

The initial throttle state: nothing logged, nothing suppressed.

throttleInterval :: NominalDiffTime Source #

How long export errors are coalesced between surfaced heartbeats.

throttleStep :: NominalDiffTime -> UTCTime -> ThrottleState -> (ThrottleState, ThrottleEmit) Source #

Advance the throttle for one export error at now: surface the first error, surface a heartbeat once the throttleInterval has elapsed since the last surfaced one (resetting the suppressed count), and otherwise suppress while counting. Pure, so a sequence of (time, decision) steps is asserted directly.

Export-failure routing

data ExportFailureSink Source #

The shared export-failure sink: a single throttle plus the katip target that every export failure feeds — the span exporter, the metric exporter, and the SDK's own diagnostic stream — so a persistently unreachable collector is one coalesced stream (the first failure plainly, then a periodic heartbeat) rather than several independent floods.

The clock and the surfacing action are injected so the throttle decision is unit-tested without wall-clock timing or a live katip scribe (mirroring the pure throttleStep tests); exportFailureSink wires the production clock and katip target.

newExportFailureSink :: IO UTCTime -> (Severity -> Text -> IO ()) -> IO ExportFailureSink Source #

Build an export-failure sink over an injected clock and surfacing action.

exportFailureSink :: LogEnv -> IO ExportFailureSink Source #

The production sink: the wall clock and the composition-root LogEnv as the katip target, tagged with this module (the plain-IO katip path the boot phase uses).

routeExportFailure :: ExportFailureSink -> Text -> IO () Source #

Route one export-failure diagnostic through the shared throttle into katip: the first surfaced plainly, a heartbeat carrying the suppressed count once throttleInterval has elapsed since the last surfaced one, otherwise suppressed and counted.

observeExportResult :: ExportFailureSink -> Text -> ExportResult -> IO () Source #

Observe one exporter's ExportResult, routing a Failure through the sink and ignoring a Success. This only observes the failure — the inner result is the caller's to return unchanged, so export semantics are untouched (a failed export stays off the request path). signal names the failing exporter (span / metric).

installExportErrorHandler :: ExportFailureSink -> IO () Source #

Install a process-global handler for the SDK's own diagnostic stream, routed through the shared sink so it coalesces with the exporter-failure feed. In hs-opentelemetry 1.0.0.0 the only caller of this handler is the SDK's internal logging — a failed OTLP export is dropped there rather than routed here — so the export-failure feed comes from the exporter wrappers (observeExportResult); this handler is kept for the SDK-internal diagnostics it still serves.

The forwarded diagnostic String is the SDK's own text and is trusted not to carry secrets: this module never reads the credential-bearing telemetry inputs (OTEL_EXPORTER_OTLP_HEADERS, DD_API_KEY, DD_SITE), so the only residual channel is whatever the SDK itself chooses to log, which the upstream exporter keeps to endpoint/status diagnostics.

Boot wiring

prepareTelemetry :: LogEnv -> [(String, String)] -> IO () Source #

Prepare the telemetry substrate at boot, before the SDK initialises: resolve the identity and normalise the canonical OTEL_* environment the env-driven SDK reads (so a DD_*-only deployment still configures the exporter). The export-failure observation itself is wired when the substrate stands up ("Ecluse.Telemetry.withTelemetry"), which builds the shared sink and installs the exporter wrappers and the SDK error handler.

A defaulted endpoint — neither DD_AGENT_HOST nor OTEL_EXPORTER_OTLP_ENDPOINT set — is surfaced through katip as one boot warning and falls back to http://localhost:4318; it is never a failure. The OTLP endpoint is an operator-declared destination (like the mirror queue), so it is normalised and used as given, not classified or gated.