An OpenNMS Horizon plugin that pushes performance data to any Prometheus-compatible Remote Write endpoint — Prometheus, Cortex, Grafana Mimir, VictoriaMetrics, Thanos Receive — and surfaces OpenNMS resource context as native Prometheus labels.

Incubation project — community support only

This is an incubation project of the OpenNMS Community. Support is available at opennms.discourse.group. No commercial support is available yet.

Overview

What this plugin does

prometheus-remote-writer implements the OpenNMS TimeSeriesStorage SPI from opennms-integration-api v2.0. When OpenNMS Horizon is configured with org.opennms.timeseries.strategy = integration and this plugin is the only TSS implementation registered, every collected sample flows through the plugin to a Prometheus-compatible Remote Write endpoint of your choice.

The plugin pushes OpenNMS resource context — node identity, foreign-source qualification, surveillance categories, interface descriptors, optional metadata — to the backend as native Prometheus labels. Operators query the resulting time series with PromQL directly, from Grafana’s native Prometheus data source. No OpenNMS-side query plugin and no round-trip to the OpenNMS REST API are required at query time.

What this plugin does NOT

This plugin runs on OpenNMS Horizon Core only. It is not yet supported on OpenNMS Sentinel.

The KAR will install cleanly into /opt/sentinel/deploy/ and the plugin’s TimeSeriesStorage OSGi service will register, but no samples will reach it: OpenNMS upstream has no Core → Sentinel sample-dispatch path for OIA TSS plugins, and the Sentinel-side streaming-telemetry adapter pipeline (architecturally compatible in principle) has not been verified end-to-end with this plugin.

If you need to offload sample persistence to a Sentinel container today, this is not the right plugin yet. Install it on Horizon Core and use the rest of this guide.

Where it fits

                       store(samples)
                            │
                            ▼
                ┌──────────────────────┐
   OpenNMS ────▶│  TimeSeriesStorage   │
   collectors   │   (this plugin)      │
                └──────────────────────┘
                            │
                            ▼  Remote Write v1 / v2
                ┌──────────────────────┐
                │  Prometheus / Mimir  │
                │  VictoriaMetrics /   │
                │  Cortex / Thanos     │
                └──────────────────────┘
                            ▲
                            │  PromQL via Prom HTTP API
                ┌──────────────────────┐
                │  Grafana (native     │
                │  Prometheus DS)      │
                └──────────────────────┘

Quick comparison vs. the legacy Cortex TSS plugin

OpenNMS ships a Prometheus integration today via opennms-cortex-tss-plugin. That plugin writes numeric samples to a Prometheus backend but keeps OpenNMS resource context (node label, foreign source, categories, asset record, interface descriptors) in a separate OpenNMS key-value store. To turn opaque resourceId labels back into human-readable resources, you need the OpenNMS Plugin for Grafana, which round-trips to the OpenNMS REST API at query time.

This plugin fixes that at the write path. Resource context is pushed to the backend as first-class Prometheus labels, so PromQL — on any vanilla Grafana Prometheus data source — works end-to-end with no OpenNMS query-time dependency.

Supported backends at a glance

Backend Remote Write v1 Remote Write v2 Notes

Prometheus 3.x

v2 receiver default-enabled

Prometheus 2.55+

Receiver must be enabled with --web.enable-remote-write-receiver

Prometheus 2.50–2.54

⚠ silently drops

Stay on v1 — see Wire protocols (v1 and v2)

Grafana Mimir 2.10+

VictoriaMetrics

✅ (with v2 ingest)

Cortex

Thanos Receive

Grafana Cloud

Reference specifications

The plugin is a clean-room implementation, written from public specifications:

The normative requirement set lives in the project’s tss-plugin spec under openspec/specs/tss-plugin/spec.md.

Installation

Compatibility

OpenNMS Horizon

35+

JVM

Temurin / OpenJDK 17 (matches the Horizon container)

Apache Karaf

4.4.10

Integration API

opennms-integration-api v2.0

The plugin’s OSGi bundle is compiled to Java 17 bytecode to match Horizon’s runtime; running on an older JVM fails feature resolution at install time.

Install on OpenNMS Horizon Core

The plugin ships as a Karaf KAR (prometheus-remote-writer-kar-X.Y.Z.kar). Drop it into Karaf’s deploy directory and Karaf hot-installs it:

# Download the KAR from the GitHub Release matching your installed version.
# Example for v0.3.0:
curl -L -o /opt/opennms/deploy/prometheus-remote-writer.kar \
    {project-repo-url}/releases/download/v{revnumber}/{project-artifact}-kar-{revnumber}.kar

Confirm from the Karaf shell:

ssh -p 8101 admin@localhost
karaf@root()> bundle:list -s | grep prometheus-remote-writer

Activate the plugin as the active TSS

Set the time-series strategy in etc/opennms.properties.d/timeseries.properties:

org.opennms.timeseries.strategy = integration

Restart Core for the strategy switch to take effect. The next collector flush sends samples through the plugin.

Drop in the minimal metatag config before you restart

By default, OpenNMS does not attach node, foreign-source, location, or interface tags to samples — they arrive at the plugin carrying only name and resourceId, and your Prometheus series end up with bare {resourceId="…"} labels regardless of any plugin-side config. Add the four-line snippet in Minimal metatag config alongside the strategy switch above so the first samples already carry useful labels (node, instance, node_label, foreign_source, foreign_id, location).

Minimum configuration

Drop the bare minimum at etc/org.opennms.plugins.tss.prometheus-remote-writer.cfg:

write.url = https://mimir.example.com/api/v1/push
read.url  = https://mimir.example.com/prometheus

read.url is the backend’s Prometheus-compatible root. The plugin appends /api/v1/series and /api/v1/query_range itself — do not include /api/v1 in the configured value. See Configuration reference for the full list of knobs and backend-specific URL shapes.

Verify

From the Karaf shell:

karaf@root()> bundle:list -s | grep prometheus-remote-writer
karaf@root()> opennms:prometheus-writer-stats

opennms:prometheus-writer-stats prints all plugin counters and gauges. Watch samples_written_total tick up as OpenNMS pushes its first samples through the plugin.

Default labels need OpenNMS metatag config to be useful

By default, samples arriving at the plugin carry only name and resourceId — OpenNMS does not attach node/interface/asset metatags unless you configure them. See Minimal metatag config for the four-line metatags config that turns those tags on.

Local sandbox (e2e)

The repository’s e2e/ directory contains a Docker Compose stack that stands up an OpenNMS Horizon core, Grafana, and one Prometheus-compatible backend of your choice. It is the quickest way to see the plugin work end-to-end on a laptop.

Prerequisites: Docker 24+ with Compose v2.

The fastest way to prove the plugin works is make smoke:

make smoke                               # all backends, sequential
make smoke BACKENDS=prometheus           # single backend
make smoke TIMEOUT=900 BACKENDS=mimir    # bump the per-backend deadline

For interactive exploration, see e2e/README.md in the source tree.

Configuration reference

All knobs live in etc/org.opennms.plugins.tss.prometheus-remote-writer.cfg (Karaf ConfigAdmin PID org.opennms.plugins.tss.prometheus-remote-writer) and take effect on the next flush cycle — no OpenNMS restart required, except where noted (typically wire-format and validator changes that require bundle restart).

Endpoint and authentication

Key Default Purpose

write.url

(required)

Remote Write v1/v2 ingest URL.

read.url

(required for read path)

Prometheus-compatible HTTP API root. The plugin appends /api/v1/series and /api/v1/query_range — do not include /api/v1 here.

instance.id

(unset)

Stamps every sample with onms_instance_id="<value>". Required for multi-instance deployments sharing a backend.

job.name

(unset)

Override the per-sample job derivation with a fleet-wide constant.

tenant.org-id

(unset)

Sets the X-Scope-OrgID header. Cortex/Mimir/VictoriaMetrics-cluster/Thanos partition by this.

auth.basic.username

(unset)

Basic auth — username.

auth.basic.password

(unset)

Basic auth — password.

auth.bearer.token

(unset)

Bearer auth. Mutually exclusive with Basic.

tls.ca-file

(unset)

PEM bundle to trust in addition to / in place of the JDK truststore.

tls.insecure-skip-verify

false

Disables hostname and certificate verification. Logs WARN on startup and every hour.

The plugin refuses to start if both auth.basic.* and auth.bearer.token are configured.

Wire format

Key Default Purpose

wire.protocol-version

1

1 = prometheus.WriteRequest (v1); 2 = io.prometheus.write.v2.Request with string interning. See Wire protocols (v1 and v2).

Write pipeline (in-memory)

Used when wal.enabled=false (the default). When the WAL is enabled, queue.capacity is ignored.

Key Default Purpose

queue.capacity

10000

Bounded in-memory queue. Overflow throws StorageException — OpenNMS sees the failure.

batch.size

1000

Maximum samples per Remote Write POST.

flush.interval-ms

1000

Flush whenever batch fills OR this interval elapses.

retry.max-attempts

5

5xx retry budget per batch.

retry.initial-backoff-ms

250

First backoff after a 5xx.

retry.max-backoff-ms

10000

Backoff ceiling.

shutdown.grace-period-ms

10000

Bounds the graceful-shutdown wait. With WAL enabled, bounds the in-flight HTTP wait, not a drain window.

HTTP client

Key Default Purpose

http.connect-timeout-ms

5000

Socket connect timeout.

http.read-timeout-ms

30000

Response read timeout.

http.write-timeout-ms

30000

Request write timeout.

http.max-connections

16

Per-route OkHttp connection pool size.

Read path

Key Default Purpose

max-series-lookback-seconds

7776000 (90 d)

findMetrics lookback when no time window is supplied.

Label policy

Key Default Purpose

labels.include

(empty)

Glob list of source-tag keys to surface as labels in addition to the default allowlist. Snake-cased on the wire. See Labels and enrichment.

labels.exclude

(empty)

Default labels to drop. Comma-separated label names.

labels.rename

(empty)

Rename a label. Comma-separated from → to pairs.

labels.copy

(empty)

Add a second name for a label. Comma-separated from → to pairs.

metric.prefix

(empty)

Prefix added to every metric name on the wire (sanitized).

Pipeline order is defaults → exclude → include → copy → rename → metadata. See Labels and enrichment for the full mental model and worked recipes.

Metadata passthrough

OpenNMS metadata is opt-in only. The built-in denylist always applies.

Key Default Purpose

metadata.enabled

false

Master switch.

metadata.include

(empty)

Glob list of <context>:<key> patterns to surface.

metadata.exclude

(empty)

Glob list to subtract from metadata.include.

metadata.label-prefix

onms_meta_

Prefix applied to emitted metadata labels.

metadata.case

preserve

preserve / lower / upper.

Metadata is an open KV store; operators put credentials in there (requisition:snmp-community, jdbc:password, API tokens). The built-in denylist (password, secret, token, key, snmp-community) is always applied — even when metadata.include would match a denied key. Leave metadata disabled unless you have an explicit use case.

Write-Ahead Log (durable buffering)

Opt-in via wal.enabled=true. See Write-Ahead Log for the full operator model.

Key Default Purpose

wal.enabled

false

Opt-in.

wal.path

(empty → ${karaf.data}/prometheus-remote-writer/wal)

Set explicitly to redirect to a mounted volume.

wal.max-size-bytes

536870912 (512 MB)

Total disk-footprint cap.

wal.segment-size-bytes

67108864 (64 MB)

Per-segment rotation threshold.

wal.fsync

batch

always / batch / never. See the wal section.

wal.overflow

backpressure

backpressure / drop-oldest.

Worked example — fully configured

# Endpoint
write.url                     = https://mimir.example.com/api/v1/push
read.url                      = https://mimir.example.com/prometheus

# Source identity
instance.id                   = opennms-us-east
# job.name unset — job is derived per-sample from resourceId shape

# Authentication
auth.bearer.token             = ${env:MIMIR_TOKEN}
tenant.org-id                 = fleet-prod

# Wire format
wire.protocol-version         = 2

# Write pipeline (in-memory, used because wal.enabled=false)
queue.capacity                = 50000
batch.size                    = 2000
flush.interval-ms             = 1000

# HTTP
http.max-connections          = 32

# Label policy
labels.include                = sysDescription, assetRegion
labels.copy                   = foreign_source -> tenant
metadata.enabled              = false

The full set of normative scenarios lives in openspec/specs/tss-plugin/spec.md in the source tree — that file is the authoritative source of truth for parser behavior, defaults, and edge cases.

Wire protocols (v1 and v2)

The plugin supports both Prometheus Remote Write protocol versions. Operator-selectable per deployment via wire.protocol-version.

Selection

Value Wire format Headers Backend requirement

1 (default)

Snappy-compressed prometheus.WriteRequest

Content-Type: application/x-protobuf
X-Prometheus-Remote-Write-Version: 0.1.0
Content-Encoding: snappy

Any backend that accepts Remote Write v1 — Prometheus, Mimir, VictoriaMetrics, Cortex, Thanos Receive.

2

Snappy-compressed io.prometheus.write.v2.Request (string interning)

Content-Type: application/x-protobuf;proto=io.prometheus.write.v2.Request
X-Prometheus-Remote-Write-Version: 2.0.0
Content-Encoding: snappy

Prometheus 3.0+ recommended (default-enabled receiver). Earlier versions: 2.55+ stable but receiver must be enabled with --web.enable-remote-write-receiver; 2.50–2.54 ship an experimental v2 receiver that silently drops v2 payloads under documented edge cases. Mimir 2.10+, VictoriaMetrics with v2 ingest enabled, Grafana Cloud, or equivalent are also supported.

When to flip to v2

  • Forward capacity. Native histograms, exemplars, per-series metadata, and created-timestamps are first-class in the v2 schema. The plugin does not populate them today (OpenNMS doesn’t produce them), but enabling v2 unblocks future features without another wire-format pivot.

  • Wire bandwidth. v2’s string interning eliminates per-sample repetition of label names and values. For typical OpenNMS batches (every series carries the same dozen-or-so default labels: name, node, job, instance, …​), this is a real pre-snappy reduction; the magnitude depends on batch size and label-sharing. After snappy the savings are smaller — measure your own deployment before flipping for bandwidth reasons alone.

When to leave it on v1

  • Backend compatibility is uncertain or includes older Prometheus / Mimir / VictoriaMetrics versions.

  • Wire bandwidth isn’t a constraint for your deployment volume.

Operational notes

Prometheus 2.50–2.54 silently drop v2 payloads

The receiver was experimental in that range — payloads can return 2xx (or 204) yet the samples never appear via /api/v1/series or /api/v1/query. This is the worst possible failure mode: the plugin’s samples_written_total ticks up but the backend has nothing to serve.

Pin Prometheus 3.0+ for v2. The receiver is default-enabled and stable in 3.0+. 2.55+ works if you enable --web.enable-remote-write-receiver explicitly; older releases should stay on wire.protocol-version=1.

No auto-fallback

If wire.protocol-version=2 is set but the backend is v1-only, the backend returns 4xx and the batch is dropped permanently (matches the existing 4xx semantics; visible via samples_dropped_4xx_total). Verify backend compatibility before flipping. The plugin emits a one-shot startup WARN naming the supported backend versions when wire.protocol-version=2 is set.

WAL is wire-version-agnostic

The on-disk WAL stores MappedSample (pre-wire), so flipping wire.protocol-version while the WAL holds pending samples is safe — the next flush emits according to the new version, no WAL migration needed.

No effect on the read path

The plugin’s read side (findMetrics, getTimeSeriesData) uses the standard Prometheus HTTP query API, which is independent of the remote-write version.

What v2 does NOT add (yet)

  • Native histograms — OpenNMS doesn’t produce them.

  • Exemplars — no trace-ID source on the OpenNMS side.

  • Per-series metadata — no help/unit source today.

  • Created-timestamp counter-reset hint.

The v2 wire layer in this release leaves these fields empty. A future change can populate any of them without breaking the wire layer.

Write-Ahead Log

Opt-in via wal.enabled=true. When enabled, every store() sample is appended to an on-disk Write-Ahead Log before the call returns, and the WAL replaces the in-memory queue.capacity buffer as source of truth.

What the WAL gives you

  • Restart preservation. Samples queued before a graceful shutdown replay to the endpoint on the next process start. Under the default batch fsync, a kill -9 may lose the last fsync window’s worth of samples; everything before that is durable.

  • Extended outage buffering. 5xx and transport failures never advance the WAL checkpoint. The plugin retries from the same offset on every flush cycle for as long as the endpoint stays down, up to wal.max-size-bytes total footprint.

Configuration knobs (recap)

Key Default Purpose

wal.enabled

false

Opt-in. When false, behavior matches the in-memory queue.capacity path exactly.

wal.path

""

Empty resolves to ${karaf.data}/prometheus-remote-writer/wal. Set explicitly to redirect to a mounted volume or faster disk.

wal.max-size-bytes

536870912 (512 MB)

Total disk-footprint cap. Overflow policy fires when reached.

wal.segment-size-bytes

67108864 (64 MB)

Per-segment rotation threshold. Must be ≤ wal.max-size-bytes.

wal.fsync

batch

always = fsync every append (tightest RPO); batch = fsync at each flush.interval-ms boundary (loses at most ~1 s on kill -9); never = OS page cache only (ephemeral deployments).

wal.overflow

backpressure

backpressure = store() throws StorageException when cap reached; drop-oldest = evict the oldest segment to make room for new samples.

Knobs that change meaning when wal.enabled=true
  • queue.capacity is ignored — a WARN is logged at startup if explicitly set.

  • shutdown.grace-period-ms no longer bounds a drain window — the WAL is durable, so unflushed samples persist across restart. Instead the knob bounds the in-flight HTTP request wait at shutdown.

How it works, briefly

  store(samples)
      │
      ▼
  LabelMapper.map()
      │
      ▼
  WAL.append(MappedSample)   ◀── durable on disk
      │
      ▼
  WalFlusher.pollBatch()  ──▶  HTTP POST  ──2xx──▶  Checkpoint.advance(offset)
      │                                                 │
      │ 4xx: advance checkpoint (matches pre-WAL drop)   ▼
      │                                                 segments past the
      │ 5xx exhausted / transport: leave checkpoint;    checkpoint become
      │ re-read same batch next cycle                   eligible for deletion

Segment files are named by start offset (00000000000000000000.seg, 00000000000067108864.seg, …​). Each segment has a companion .idx jsonl summary. checkpoint.json at the WAL root tracks the last offset confirmed shipped; written atomically (tmp + fsync + rename) on every advance.

Recovery on startup

When wal.enabled=true, the plugin scans wal.path at startup and replays any samples whose offset is greater than or equal to checkpoint.json.last_sent_offset. Recovery tolerates torn tails — an incomplete frame at the end of the most recent segment, typical of a process killed mid-append — by logging a WARN, truncating the segment to the last good frame, and resuming.

If any recovery step fails (unwritable path, corrupt checkpoint.json, unreadable segment), the plugin refuses to start with an actionable error message naming the file and the failure reason. Operators reset by removing the WAL directory while the plugin is stopped:

# while the plugin is stopped
rm -rf ${wal.path}

The plugin recreates the directory on next start.

Wire-version interaction

The WAL stores MappedSample (pre-wire). Flipping wire.protocol-version while the WAL holds pending samples is safe — the next flush emits in the new version with no WAL migration step. See Wire protocols (v1 and v2).

Operational guidance

Choice When

wal.fsync=always

You can justify the ~10× throughput hit for tighter-than-1s RPO. Rare.

wal.fsync=batch (default)

The right choice for almost everyone. Loses at most one flush.interval-ms window on kill -9 / power loss; matches typical Remote Write RPO expectations.

wal.fsync=never

Ephemeral deployments that accept losing in-flight data on kernel panic / power loss.

wal.overflow=backpressure (default)

Alerting-driven pipelines that want operators to see the failure when the backend is unreachable longer than the WAL cap allows.

wal.overflow=drop-oldest

Dashboards showing "the last N hours" — recency over history; tolerates silent eviction of the oldest samples better than a store() exception.

Containerised Karaf

If ${karaf.data} is ephemeral (e.g., a container with no mounted volume), the default wal.path evaporates across restart and the WAL is worse than useless. Set wal.path explicitly to a mounted volume.

When to leave the WAL off

  • Single-node OpenNMS with a highly-available local backend on the same box: disk I/O for samples that will definitely deliver is overhead.

  • Short uptime / test environments where restart preservation is not a requirement.

The default stays wal.enabled=false. A later release may flip the default once the feature has soaked in real deployments.

Self-metrics

The WAL adds these counters and gauges (visible via opennms:prometheus-writer-stats):

  • wal_replay_samples_total

  • samples_dropped_wal_full_total

  • wal_batches_dropped_4xx_total

  • wal_segments (gauge)

  • wal_disk_bytes (gauge)

The general write counters (samples_written_total, samples_dropped_4xx_total, samples_dropped_5xx_total, samples_dropped_transport_total) are unchanged and continue to apply.

Labels and enrichment

Default label set

For every sample, the plugin emits the following Prometheus labels when the corresponding source data is available:

Label Source Notes

onms_instance_id

config instance.id

Only emitted when instance.id is set.

job

derived from resourceId shape, or config job.name override

"snmp" for SNMP-collected data, "jmx" for slash-FS jmx-* / opennms-jvm groups, "opennms" catch-all for unparseable shapes.

name

intrinsic name

Sanitized to Prom’s metric-name grammar.

resourceId

intrinsic resourceId

Raw, lossless.

node

derived

"<foreignSource>:<foreignId>" when both are set; numeric dbId otherwise.

instance

same value as node

Prom-idiomatic subject-identity label for mixed-backend filtering. Emitted iff node is emitted.

node_label

external nodeLabel

Human-readable. Mutable — disable via config if churn is a concern.

foreign_source

external

Stable.

foreign_id

external

Stable.

location

external

OpenNMS monitoring location.

resource_type

parsed from resourceId

e.g. interfaceSnmp, hrStorageIndex, nodeSnmp.

resource_instance

parsed from resourceId

e.g. en0, 1, or empty for node-level.

if_name

external ifName

if_descr

external ifDescr

if_speed

derived

Bits-per-second: ifHighSpeed × 1_000_000 when non-zero, else ifSpeed.

onms_cat_<name>

surveillance categories

One label per category, value "true".

mtype

meta tag MetaTagNames.mtype

Metric type (gauge, counter, count, rate, timestamp). Set by the OpenNMS writer on every Sample. Load-bearing for graph rendering — see mtype round-trip and the read-time fallback.

onms_attr_<key>

plain-key Sample meta tags

MATE-scope tags (${node:…} / ${asset:…} / etc.) and any other plain-key meta tag. One label per plain-key meta tag — see Resource string attributes (onms_attr_* and onms_extattr_*).

onms_extattr_<key>

plain-key Sample external tags

Resource string attributes attached by collectors (JMX bean Name properties, JDBC datname, …). The values ${name} / ${datname} / ${spcname} resolve against. One label per plain-key external tag — see Resource string attributes (onms_attr_* and onms_extattr_*).

Deliberately excluded by default — available via labels.include if you want them: if_alias (user-editable, churns), sys_descr, sys_object_id, asset-record fields, OpenNMS metadata (see Configuration reference for the metadata gating rules).

mtype round-trip and the read-time fallback

OpenNMS’s read-side graph renderer (NewtsConverterUtils.dataPointToRow) unconditionally dereferences MetaTagNames.mtype on every Metric the plugin returns. A Metric reaching that code without an mtype meta tag trips a NullPointerException and the graph fetch returns HTTP 500.

To keep the round-trip working, the plugin:

  • Emits mtype as a default label on write — sourced from the Sample’s MetaTagNames.mtype meta tag (the OpenNMS writer sets it on every Sample). Reserved against labels.rename collisions like the rest of the default allowlist.

  • Synthesizes mtype="gauge" on read when a Prometheus response for a series lacks the label. This covers data already on disk from before the fix landed. Counter metrics in legacy data render as cumulative values rather than rates — visibly less informative but never wrong; new writes preserve the original mtype, so post-fix counter rendering is correct.

Operators who explicitly exclude mtype via labels.exclude = mtype will break graph rendering for new writes — the synthesis fallback still recovers those reads, but counter graphs degrade to gauges. The exclude path is intended only for non-OpenNMS consumers of the same Prometheus stack.

The samples_synthesized_mtype_total counter (visible via opennms:prometheus-writer-stats) tracks every fallback synthesis. The counter ticks once per Metric reconstruction — per matched series in findMetrics, once per fetch in getTimeSeriesData — not once per Sample. Watch it climb until Prometheus retention has aged out the pre-fix data, then drop to flat: at that point every rendered graph is using authentic mtype values from the writer.

If the counter rises indefinitely instead of plateauing, the most likely cause is labels.exclude = mtype in your config — that’s a supported operator override (intended for non-OpenNMS consumers of the same Prometheus stack), but it means the read path falls back to synthesis on every fetch indefinitely. Either remove the exclude rule or treat the rising counter as expected for your deployment.

Resource string attributes (onms_attr_* and onms_extattr_*)

OpenNMS resource-graph templates substitute shell-style placeholders like ${name}, ${datname}, and ${spcname} against string attributes attached to a resource. On the integration-API write path, those attributes arrive on the Metric partition system: meta tags carry MATE-scope values, external tags carry collector-emitted resource properties. The motivating case for the round-trip is that the resource-string-attribute named name (the Eventd Processing Stats row, the JDBC datasource label, etc.) collides with the intrinsic name (metric-name) tag the plugin emits as name — and OpenNMS- core’s TimeseriesResourceStorageDao.getStringAttributes() reads only from Metric.getExternalTags() for placeholder substitution, so partition fidelity is required end-to-end.

The plugin makes the round-trip work via two reserved label prefixes, one per partition:

  • onms_attr_<key> carries the META partition (MATE-scope tags, mtype aside). Read side strips and deposits on Metric.getMetaTags().

  • onms_extattr_<key> carries the EXTERNAL partition (collector- emitted resource string attributes — the values placeholder substitution actually reads). Read side strips and deposits on Metric.getExternalTags().

Concretely:

  • Write — for each non-intrinsic partition, every Sample tag whose key is non-empty, contains no : (context tags use onms_meta_ instead — they’re owned by the metadata processor regardless of partition), is not mtype, is not blocked by the built-in plain-key secret denylist (password, secret, token, snmp-community, all case-insensitive), and is not already represented under a canonical default-label name (for the external-partition pass — the meta pass uses an empty consumed-keys set to preserve v0.4.0 behavior) is emitted as <prefix><sanitized_key>=<sanitized_value>. The walks read the partition lists directly off the source Metric, bypassing the intrinsic-wins shadow merge that otherwise drops collisions.

    The plain-key denylist is deliberately narrower than the context-tag form (metadata.label-prefix’s denylist also includes `:*key). Resource string attributes commonly shaped like primary_key, partition_key, or foreign_key are exactly the attributes that resource graphs substitute via ${…​} placeholders, so the plain-key path lets them through. Only credential-shaped names (password / secret / token / snmp-community) are blocked from the onms_attr_ and onms_extattr_ namespaces.

  • Read — labels matching onms_attr_<key> reconstruct as a meta tag with key <key>; labels matching onms_extattr_<key> reconstruct as an external tag with key <key>. The prefixed forms are not also surfaced under their raw names — single source of truth, per partition.

Sanitization is one-way for non-identifier source keys

The plugin sanitizes the meta-tag key into the Prometheus label-name grammar ([a-zA-Z_][a-zA-Z0-9_]*) before applying the prefix. For identifier-shaped keys like name, datname, spcname (the standard OpenNMS placeholder set), sanitization is a no-op and the key round-trips identically. A meta tag named rack-unit ends up as onms_attr_rack_unit on the wire and reconstructs as a meta tag with key rack_unit (not rack-unit). Adjust your placeholder references accordingly if you use non-identifier attribute names.

No retroactive synthesis

Unlike the mtype round-trip, there is no read-side fallback for samples written before this fix. Pre-fix data continues to render literal ${name} (or empty) in resource-graph row labels until it ages out of the backend’s retention window. Operators with long retention can either re-collect or accept that legacy data graphs without resolved placeholders.

Operators who want to opt out of either namespace altogether (cost- sensitive backends, no resource-graph use case) can set labels.exclude = onms_attr_* and / or labels.exclude = onms_extattr_*. Excluding onms_extattr_* reverts resource-graph placeholder substitution to literal placeholders; excluding onms_attr_* mostly just drops MATE-scope label duplication. Other graphs are unaffected.

Label enrichment is two-sided

The default label allowlist is the write-side policy: it decides which OpenNMS-attached tags become Prometheus labels. The read-side — OpenNMS attaching those tags to samples in the first place — lives in OpenNMS itself, and it is off by default.

OpenNMS metatags config ─▶ MATE interpolation ─▶ sample tags ─▶ plugin label mapping ─▶ Prometheus labels
   (read-side, you)           (OpenNMS core)       (per Sample)      (this plugin)          (on the wire)

OpenNMS’s MetaTagDataLoader runs each configured value template through the MATE interpolator against the sample’s scope and attaches a tag for every non-empty result. The property names below and the MATE scope syntax (${node:…}, ${interface:…}, ${service:…}, ${asset:…}) are as of OpenNMS Horizon 35; upstream changes may rename properties in future Horizon releases. If you don’t configure any org.opennms.timeseries.tin.metatags.tag. properties, samples arrive at this plugin carrying only name and resourceId* — and your Prometheus series will have bare {resourceId="…"} labels regardless of what this plugin is configured to emit.

Minimal metatag config

Put these four lines in etc/opennms.properties.d/metatags.properties to enable node identity labels:

org.opennms.timeseries.tin.metatags.tag.nodeLabel     = ${node:label}
org.opennms.timeseries.tin.metatags.tag.foreignSource = ${node:foreign-source}
org.opennms.timeseries.tin.metatags.tag.foreignId     = ${node:foreign-id}
org.opennms.timeseries.tin.metatags.tag.location      = ${node:location}

After OpenNMS reloads, samples carry those four tags. The plugin’s default allowlist maps them to node_label, foreign_source, foreign_id, and location, and derives node="<fs>:<fid>" from the pair. For interface descriptors:

org.opennms.timeseries.tin.metatags.tag.ifName  = ${interface:if-name}
org.opennms.timeseries.tin.metatags.tag.ifDescr = ${interface:if-description}

Surveillance categories

Categories are a separate opt-in on the OpenNMS side:

org.opennms.timeseries.tin.metatags.exposeCategories = true

Setting this causes OpenNMS to attach a categories sample tag (comma-separated list of surveillance-category names). The plugin’s default allowlist already expands that single tag into one onms_cat_<sanitized-name> label per value — no additional plugin config needed.

The node record must exist in OpenNMS

MetaTagDataLoader resolves node-scope properties by looking up the node via its foreign-source/foreign-id pair or its numeric dbId. If the node record doesn’t exist — typically a requisition that was deleted, or an fs:fid that never got imported — no interpolation happens and the sample arrives with only name and resourceId. You’ll see empty enrichment labels, not garbage labels: verify the node exists in OpenNMS before debugging the plugin.

Identifying samples from multiple OpenNMS instances

Running more than one OpenNMS instance against the same Prometheus-compatible backend? Two independent knobs exist, and they solve different problems:

Knob Header / label What it does Works with

instance.id

label onms_instance_id

Stamps every sample with a stable per-instance identifier. PromQL can filter ({onms_instance_id="…"}) or aggregate (sum by (onms_instance_id) (…)) across all instances in the shared backend.

Every Prometheus-compatible backend.

tenant.org-id

header X-Scope-OrgID

Partitions storage at the backend tier — each tenant’s data is isolated, queried separately.

Mimir, Cortex, VictoriaMetrics cluster, Thanos Receive. No-op against plain Prometheus and single-tenant VictoriaMetrics.

When to use which

Deployment instance.id tenant.org-id

Single OpenNMS → dedicated backend

not required

not required

Multiple OpenNMS → shared Prometheus / single-tenant VictoriaMetrics

required

n/a (no-op)

Multiple OpenNMS → Mimir / Cortex / VM cluster, fleet-wide queries

required

optional

Multiple OpenNMS → Mimir / Cortex / VM cluster, strict per-instance isolation

optional

required

If you want both fleet-wide PromQL and backend-enforced isolation, set both.

Example

Two OpenNMS instances writing to the same Mimir cluster:

# opennms.properties.d on instance #1
instance.id    = opennms-us-east
tenant.org-id  = fleet-prod

# opennms.properties.d on instance #2
instance.id    = opennms-us-west
tenant.org-id  = fleet-prod

PromQL:

# All nodes, either OpenNMS
up{job="opennms"}

# Per-OpenNMS rollup
sum by (onms_instance_id) (rate(ifHCInOctets[5m]))

# Just the west instance
ifHCInOctets{onms_instance_id="opennms-us-west"}

Label pipeline — rename vs. copy vs. exclude

Both labels.rename and labels.copy produce a label under a new name, but the mental model differs:

  • labels.rename changes a label’s name. The original disappears.

  • labels.copy adds a second name for a label. Both names remain present with the same value.

They run in a fixed pipeline:

defaults  →  exclude  →  include  →  copy  →  rename  →  metadata

Copy is one-pass (sees labels that exist at its stage entry; does not recurse) and operates on pre-rename names. Reserved-target rules apply symmetrically to both — a to value that collides with a default label name, a reserved prefix (onms_cat_*, onms_meta_*), another rename target, or another copy target is rejected at startup with an actionable error.

Common labels.copy recipes

# Multi-tenant Mimir — emit `tenant` as a copy of `foreign_source` so
# per-requisition dashboards and the backend's tenant-id convention
# both key off the same value.
labels.copy = foreign_source -> tenant

# Migration-period dual emission — when changing a label name, copy the
# old name onto the new one for a release cycle so dashboards and alert
# rules can migrate gradually. Drop the copy once the rename lands.
labels.copy = node -> old_node_id
labels.copy = node → instance is now redundant

Pre-0.2 deployments often copied node onto instance. Since 0.2.0 the plugin emits instance as a default (mirror of node), so the directive is redundant — and would be rejected at startup because instance is now a reserved target. If you don’t want the instance label emitted, opt out with labels.exclude = instance.

If you want the value under a new name AND you want to drop the original, use labels.rename — it does both in one directive. A labels.copy source that doesn’t exist at copy time (typo, or a label the plugin never emits on this deployment) produces a single startup WARN naming the unknown source; it does not block startup.

Cross-source filtering with job and instance

Since 0.2.0 the plugin emits job and instance as default labels so dashboards that compose OpenNMS data with node-exporter, OTel, or other Prometheus data sources in the same backend can use the standard idiom:

# All OpenNMS-SNMP interface traffic for a specific node
{job="snmp", instance="NOC:router-42", __name__="ifHCInOctets"}

# Scope across data sources: everything about a host
{instance=~"NOC:router-42|10.0.0.1:9100"}

# Or by data source type
{job=~"snmp|node-exporter"}

The plugin’s instance value carries the OpenNMS-managed device identity (<foreignSource>:<foreignId> when requisitioned, or the numeric dbId), whereas node-exporter emits instance="<host>:<port>" — same label name, different value shapes for the same physical device. Cross-source value correlation (same label value across sources for the same device) requires backend relabel_config; the shared label name alone doesn’t bridge value shapes. job is the primary cross-source scoping filter.

The job value is derived from each sample’s resourceId pattern:

  • bracketed and slash-path SNMP-originated data → snmp

  • snmp/fs/…/jmx-* or opennms-jvm groups (prefix match on the literal jmx-, not a shell glob) → jmx

  • unparseable shapes → opennms catch-all

Set job.name = <constant> in the cfg to override the derivation with a fleet-wide constant value (useful when you want every sample from one plugin instance under the same job, e.g., job.name = opennms-prod).

The opennms catch-all

Samples whose resourceId isn’t recognised by any of the three grammars fall through to job="opennms" as the default. Samples from distinct upstream sources that share the same metric name and land in this catch-all will collide into a single time series (same {name, job="opennms"} identity, no distinguishing labels). If you see an unexpectedly high proportion of samples with job="opennms" in your backend, treat it as a signal that the parser is missing a real-world resourceId shape — open an issue with the offending string. The samples_unparseable_resource_id_total counter tracks this rate.

Reserved rename / copy targets

The plugin rejects labels.rename and labels.copy entries whose target would silently clobber an already-emitted label. Reserved targets:

Kind Value Why

Exact

name

Prometheus metric name.

Exact

resourceId

OpenNMS resource identifier (raw, lossless).

Exact

node

Derived FS-qualified or numeric node id.

Exact

foreign_source, foreign_id

Requisition identity.

Exact

node_label

Node’s human-readable name.

Exact

location

OpenNMS monitoring location.

Exact

resource_type, resource_instance

Parsed from resourceId.

Exact

if_name, if_descr, if_speed

SNMP interface descriptors.

Exact

job, instance

Default labels (since 0.2.0).

Exact

onms_instance_id

Multi-instance origin stamp (reserved even when instance.id is unset).

Prefix

onms_cat_*

Per-surveillance-category expansion.

Prefix

onms_meta_*

Default metadata-passthrough prefix.

Prefix

onms_attr_*

Resource string attributes — meta partition (see Resource string attributes (onms_attr_* and onms_extattr_*)).

Prefix

onms_extattr_*

Resource string attributes — external partition (see Resource string attributes (onms_attr_* and onms_extattr_*)).

Duplicate rename targets (foo → cluster, bar → cluster) and duplicate from keys (a → cluster, a → tenant) are also rejected. When multiple rename or copy entries have errors, the plugin reports all of them in one startup error so you fix once and restart once.

onms_meta_* reservation covers only the default prefix

The onms_meta_* prefix reservation covers only the plugin’s default metadata prefix. If you set metadata.label-prefix to something other than onms_meta_, rename / copy targets that collide with the customized prefix are not caught by startup validation; pick non-colliding targets by inspection. The default prefix is what the vast majority of deployments use.

metadata.label-prefix itself is now collision-checked

While the onms_meta_* reservation is conditional on the default prefix, the operator-supplied value of metadata.label-prefix is checked at startup against every other reserved prefix. A metadata.label-prefix set to onms_attr_, onms_extattr_, onms_cat_, any case variant (e.g. ONMS_ATTR_), or a shorter prefix that subsumes one of them (e.g. bare onms_) is rejected with an actionable message. The default onms_meta_ and any unrelated value (e.g. custom_) are accepted. This stops an operator from silently emitting metadata into another emitter’s namespace and breaking the read-side reconstruction.

Sanitization rules

The plugin sanitizes every metric name, label name, and label value to conform to the Prometheus text model before serialization:

  • Metric names: characters outside [a-zA-Z0-9_:] replaced with ; leading digit replaced with .

  • Label names: characters outside [a-zA-Z0-9_] replaced with ; leading digit replaced with .

  • Label values: truncated to the first 2048 bytes if longer.

  • NaN, +Infinity, -Infinity sample values are dropped before serialization (samples_dropped_nonfinite_total increments).

Backend compatibility

The plugin speaks Prometheus Remote Write v1 and v2 — every Remote Write backend on the market accepts either or both. See Wire protocols (v1 and v2) for the wire-format selection knob and the Prometheus 2.50–2.54 caveat.

CI matrix

Backend v1 (Prom 2.53.2 reference) v2 (Prom 3.0.1 reference) Coverage

Prometheus

Testcontainers integration test

Grafana Mimir

✅ (wire.protocol-version=2)

e2e smoke (make smoke BACKENDS=mimir)

VictoriaMetrics

✅ (with v2 ingest enabled)

e2e smoke (make smoke BACKENDS=victoriametrics)

Cortex

Compatible — not in CI

Compatible — not in CI

Thanos Receive

Compatible — not in CI

Compatible — not in CI

Grafana Cloud

Compatible — not in CI

Compatible — not in CI

read.url shapes per backend

The plugin appends /api/v1/series and /api/v1/query_range itself — do not include /api/v1 in the configured value.

Backend read.url

Prometheus

https://prom:9090

Grafana Mimir

https://mimir/prometheus

VictoriaMetrics

https://vm:8428

Cortex

https://cortex/prometheus

Thanos Receive (Query)

https://thanos-query

Grafana Cloud

https://prometheus-prod-NN-region.grafana.net/api/prom

Tenancy notes

tenant.org-id sets the X-Scope-OrgID header. Behavior per backend:

Backend Honors X-Scope-OrgID? Notes

Prometheus (vanilla)

No (no-op)

Single-tenant; partition with instance.id if multi-OpenNMS.

VictoriaMetrics single

No (no-op)

Single-tenant.

VictoriaMetrics cluster

Yes

Routes to per-tenant ingester.

Grafana Mimir

Yes

Required if Mimir is configured with auth.multitenancy_enabled=true.

Cortex

Yes

Same model as Mimir.

Thanos Receive

Yes

Tenancy via the --receive.tenant-header arg (defaults to THANOS-TENANT — set Thanos to use X-Scope-OrgID to match).

Grafana Cloud

Yes

The Cloud-issued credentials encode the tenant; tenant.org-id is typically not needed alongside Bearer auth.

See Labels and enrichment for the instance.id vs tenant.org-id decision matrix when running multiple OpenNMS instances against the same backend.

Out of scope (current line)

  • Native histograms and exemplars — the v2 wire layer reserves the fields, but the plugin doesn’t populate them. OpenNMS doesn’t surface histogram data through the TSS SPI today, and there’s no trace-ID source for exemplars on the OpenNMS side. Out-of-scope, not blocked by the wire format.

  • Per-series metadata (help, unit, created_timestamp) — same reasoning as histograms: v2 reserves the fields; no source-side population today.

  • mTLS client certificates — Basic, Bearer, and tenant-id header cover the common deployment shapes. Client-cert auth is a candidate for a future release if demand materializes.

  • Per-tenant routing / multi-destination fan-out — one write.url and one tenant.org-id per plugin instance. For multi-destination, run multiple OpenNMS instances; an in-process fan-out remains a future-release candidate.

  • Migration tooling from opennms-cortex-tss-plugin — not in scope. Recommended migration shape: stand both plugins up, dual-write for a period, switch queries once the new labels are established, uninstall cortex-tss. No in-product tooling.

  • Per-series delete() — Prometheus Remote Write has no delete semantic. delete(Metric) is a no-op that logs a rate-limited WARN. Configure retention at the backend tier (Prometheus --storage.tsdb.retention, Mimir/VictoriaMetrics compactor).

  • Full OpenNMS TSS compliance-suite pass — the compliance suite’s shouldDeleteMetrics and whole-Metric partition-equality assertions conflict with this plugin’s design. PrometheusComplianceIT skips the conflicting tests with documented @Ignore reasons.

Operations

Self-metrics

The plugin exposes internal operational metrics via a Dropwizard registry and prints them through the Karaf shell command opennms:prometheus-writer-stats.

Throughput and drops

Counter What it counts

samples_written_total

Samples successfully delivered to the endpoint.

samples_dropped_4xx_total

Samples in batches the endpoint rejected with 4xx.

samples_dropped_5xx_total

Samples in batches that exhausted the 5xx retry budget.

samples_dropped_transport_total

IOException / socket failures (distinct from 5xx so you can alert separately).

samples_dropped_queue_full_total

Samples rejected because the in-memory queue (wal.enabled=false) was full.

samples_dropped_wal_full_total

Samples rejected (or evicted under drop-oldest) because the WAL hit wal.max-size-bytes.

samples_dropped_nonfinite_total

Samples whose value was NaN, +Inf, or -Inf.

samples_dropped_duplicate_total

Same-timestamp same-series dedup (last-write-wins).

samples_unparseable_resource_id_total

Samples whose resourceId matched none of the parser grammars (rising rate ⇒ open an issue).

samples_synthesized_mtype_total

Reads where the Prometheus response lacked an mtype label and the plugin synthesized mtype="gauge" to keep graphs renderable. Ticks once per Metric reconstruction — per matched series in findMetrics, once per fetch in getTimeSeriesData. See mtype round-trip and the read-time fallback. Should drop to flat once Prometheus retention has aged out pre-fix data, or rise indefinitely if labels.exclude = mtype is configured.

Pipeline state

Gauge What it shows

queue_depth

Current in-memory queue occupancy (when wal.enabled=false).

wal_segments

WAL segment count on disk (when wal.enabled=true).

wal_disk_bytes

Total WAL footprint on disk.

http_in_flight

Running + queued HTTP requests at the dispatcher.

HTTP

Counter What it counts

http_bytes_written_total

Total payload bytes (post-snappy) sent.

http_writes_successful_total

2xx responses.

http_writes_failed_total

Non-2xx + transport failures.

Other

Counter What it counts

delete_noop_total

delete(Metric) calls (Remote Write has no delete semantic).

metadata_denylist_blocked_total

Metadata keys blocked by the built-in credential denylist.

wal_replay_samples_total

Samples replayed from the WAL on startup recovery.

These are starting points — tune to your deployment’s noise floor.

Alert idea Sketch

Drop rate jumps

rate(samples_dropped_5xx_total[5m]) + rate(samples_dropped_transport_total[5m]) > 0 for 10m

WAL filling up

wal_disk_bytes / on() (wal_max_size_bytes_constant) > 0.8 for 15m

Queue full (no-WAL deployments)

rate(samples_dropped_queue_full_total[5m]) > 0

job="opennms" cardinality exploded

rate(samples_unparseable_resource_id_total[1h]) > 0 — parser regression or new resourceId shape

Endpoint persistently unhappy

rate(http_writes_failed_total[5m]) / rate(http_writes_successful_total[5m]) > 0.1

TLS skip-verify left on in production

log-based — the plugin emits a WARN every hour

Log levels

Set in etc/org.ops4j.pax.logging.cfg or via the Karaf shell (log:set DEBUG org.opennms.plugins.tss.prometheusremotewriter).

Level What you see

INFO

Plugin lifecycle (start / stop), config-change diff, WAL startup recovery summary, wire.protocol-version=2 startup WARN message.

WARN

4xx response bodies (truncated), instance.id unset on startup, unknown labels.copy source, WAL torn-tail truncation, tls.insecure-skip-verify=true reminder.

DEBUG

Per-batch flush results, retry timing, label-pipeline diff for the first sample after a config reload. Verbose — use sparingly.

Capacity sizing

These are rough rules-of-thumb, not hard requirements. Measure your deployment.

Sample rate queue.capacity / WAL size Notes

< 5k samples/sec

defaults

queue.capacity = 10000, wal.max-size-bytes = 512 MB is plenty.

5k–25k samples/sec

queue.capacity = 50000, batch.size = 2000, http.max-connections = 32

Backpressure if your network RTT is high.

25k+ samples/sec

enable WAL with wal.max-size-bytes = 4 GB+

Single-process Karaf is the bottleneck before the wire; consider sharding by source.

WAL footprint scales with outage tolerance: at 10k samples/sec each sample averages roughly 30–80 bytes pre-snappy (the wire format is compressed; the WAL is not). Plan for ~30 MB/s of disk in-flight if you want to survive a 10-minute outage.

Karaf shell commands

Command Purpose

opennms:prometheus-writer-stats

Print all counters and gauges.

bundle:list -s | grep prometheus

Confirm the bundle is active and resolved.

config:list "(service.pid=org.opennms.plugins.tss.prometheus-remote-writer)"

Inspect the current effective config (any value not in the cfg file falls through to default).

Self-monitoring with the plugin itself

The plugin’s metrics are exposed via Dropwizard, which means you can scrape the Karaf JVM with a JMX exporter and feed those self-metrics back into the same Prometheus backend the plugin writes to. That gives you one Grafana dashboard with both OpenNMS data and the plugin’s own operational health.

A future release may add a built-in HTTP scrape endpoint to remove the JMX-exporter step. Not on the v0.4 roadmap.

Troubleshooting

Plugin won’t start

Karaf shell shows the bundle as Failure or unresolved.

  1. Check the Karaf logtail /opt/opennms/data/log/karaf.log (or wherever your distribution writes it). Configuration validators emit actionable error messages naming the offending key.

  2. Validators that fail at startup (refuse to start):

    Symptom Likely cause

    labels.rename target 'X' collides with the default label 'X'

    Pick a non-reserved target — see Labels and enrichment.

    labels.copy target 'X' collides …

    Same as above; rules apply symmetrically to copy.

    wire.protocol-version=3 is not a valid value

    Set to 1 or 2.

    Both auth.basic.* and auth.bearer.token are configured

    They’re mutually exclusive — pick one.

    instance.id contains a control character / … exceeds 2048 bytes

    Validation rejects unprintable / oversized values.

    WAL path is not writable: …

    The Karaf user can’t write to wal.path. Check ownership / SELinux / mount options.

    Corrupt checkpoint.json: …

    Recover by stopping the plugin and removing the WAL directory.

  3. Check OSGi feature resolutionfeature:list -i | grep prometheus in Karaf. Missing transitive features (older Karaf base) will appear as Unsatisfied.

Backend returns 4xx for every batch

The endpoint is rejecting the format. This is usually one of:

  • Wire-format mismatchwire.protocol-version=2 against a v1-only or v2-but-buggy backend (the Prometheus 2.50–2.54 trap; see Wire protocols (v1 and v2)). Drop to v1 to confirm.

  • Auth — Bearer token expired, Basic auth incorrect, or tenant.org-id not whitelisted on the backend. The 4xx response body is logged at WARN — read it.

  • Sample rejection — Mimir / Cortex limit on series-per-tenant or label-name length. Look for out of order sample, max-series-per-user, label name too long in the logged response body.

samples_dropped_4xx_total increments by the rejected batch size each time. If you’re seeing a steady rate, the deployment isn’t recovering on its own — fix the underlying cause.

Backend returns 5xx persistently

The plugin retries 5xx responses with exponential backoff up to retry.max-attempts (default 5), then drops. The same batch is re-enqueued only when wal.enabled=true — without the WAL, exhausted 5xx batches are gone.

  • Endpoint overloaded — Mimir ingester hot-spots, Prometheus TSDB compaction storms. Inspect the backend’s own metrics first.

  • Network path saturatedsamples_dropped_transport_total will rise alongside 5xx if upstream is the bottleneck. Confirm with iftop / nethogs on the OpenNMS host.

If the outage is short enough to fit in wal.max-size-bytes, enabling the WAL turns 5xx outages into delivery delays instead of drops. See Write-Ahead Log.

Series cardinality exploded

Three usual culprits:

  1. node_label churn — node renames create new series. Drop with labels.exclude = node_label if renames are routine.

  2. if_descr churn — vendor-generated; firmware upgrades change it. Same fix: labels.exclude = if_descr.

  3. Large onms_cat_ fan-out* — nodes with many surveillance categories add one label each per series. If your cardinality budget is tight, consider whether all categories need to be promoted to labels.

labels.include = * is rarely correct in production — it surfaces every non-default source tag. The default allowlist is deliberately narrow.

job="opennms" proportion is high

Samples whose resourceId matches none of the parser grammars (bracketed, slash-FS, slash-DB) fall through to job="opennms". The samples_unparseable_resource_id_total counter tracks the rate.

A rising counter is a signal that:

  • a new OpenNMS collector is emitting a resourceId shape the parser doesn’t recognise yet, OR

  • a parser regression has shipped.

Either way, file an issue with example resourceId strings — the parser is the project’s responsibility, not the operator’s.

WAL directory is missing across restart

Symptom: wal.enabled=true, but every restart starts from an empty WAL and you’ve lost the durability guarantee.

Cause: containerised Karaf with no mounted volume. ${karaf.data} is ephemeral — the default wal.path evaporates on container restart.

Fix: set wal.path explicitly to a path that’s mounted from a persistent volume.

Plugin starts but no samples reach the backend

  1. Confirm OpenNMS sees the plugin as the active TSS:

    org.opennms.timeseries.strategy = integration

    in etc/opennms.properties.d/timeseries.properties. OpenNMS restart required for the strategy switch — confirm with the OpenNMS log.

  2. Confirm collectors are running and producing samples — karaf@root()> log:tail, look for collectd activity.

  3. Check samples_written_total and http_writes_successful_total from opennms:prometheus-writer-stats. If both are 0, samples aren’t reaching the plugin (OpenNMS-side issue). If samples_written_total rises but the backend has nothing — check wire.protocol-version, the Prometheus 2.50–2.54 trap (Wire protocols (v1 and v2)), and the samples_dropped_* counters.

TLS skip-verify left on by accident

The plugin emits a WARN on startup and every hour when tls.insecure-skip-verify=true. Search for that log line:

grep 'tls.insecure-skip-verify' /opt/opennms/data/log/karaf.log

Production deployments should always have a valid CA chain. Use tls.ca-file to point at a private bundle if your backend uses an internal CA.

Where to ask

This is an incubation project — community channels first; there is no commercial support yet.

Include with any report:

  • Plugin version (opennms:prometheus-writer-stats prints it).

  • Backend type and version.

  • Relevant config (with secrets redacted).

  • The 4xx response body if applicable, or a short Karaf log excerpt for startup failures.

End-to-end sandbox

A self-contained Docker Compose stack for manually exercising OpenNMS Prometheus Remote Writer against real Prometheus-compatible backends. Lives under e2e/ in the repo and ships with a per-backend smoke harness wired into the project Makefile.

Use it to:

  • Try the plugin against a fresh Prometheus / Mimir / VictoriaMetrics before committing to a production install.

  • Iterate on plugin code locally with a real OpenNMS Horizon container in the loop.

  • Reproduce a backend-specific issue someone reported, with a known-good reference stack.

The sandbox is single-core by design — no Minion, no Sentinel, no ActiveMQ/Kafka, no TLS.

Layout

e2e/
├── compose.base.yml               # shared: postgres + core + grafana
├── compose.prometheus.yml         # extends base, adds prometheus + per-backend mounts
├── compose.mimir.yml              # extends base, adds mimir + per-backend mounts
├── compose.victoriametrics.yml    # extends base, adds vm + per-backend mounts
├── opennms/
│   ├── opennms.properties.d/
│   │   └── timeseries.properties  # activates TSS integration strategy
│   ├── prometheus.cfg             # plugin cfg for the prometheus backend
│   ├── mimir.cfg                  # plugin cfg for the mimir backend
│   └── victoriametrics.cfg        # plugin cfg for the vm backend
├── grafana/
│   └── datasources/               # one file per backend; the matching
│       ├── prometheus.yml         # compose.<backend>.yml mounts it to
│       ├── mimir.yml              # Grafana's provisioning dir
│       └── victoriametrics.yml
├── prometheus/
│   └── prometheus.yml             # minimal Prom config with remote-write receiver
└── mimir/
    └── mimir.yaml                 # single-binary Mimir config

Prerequisites

  • Docker 24+ with Compose v2.

  • Build the KAR first so the core container can pick it up from the mounted assembly/kar/target:

    make kar

Running

One compose file per backend — nothing else to match up. The base services (postgres, core, grafana) are defined once in compose.base.yml; each backend file extends: them and appends the backend-specific plugin cfg and Grafana datasource mounts.

# Prometheus
docker compose -f e2e/compose.prometheus.yml      up -d

# Grafana Mimir
docker compose -f e2e/compose.mimir.yml           up -d

# VictoriaMetrics
docker compose -f e2e/compose.victoriametrics.yml up -d

First boot of the core container can take several minutes while OpenNMS creates the database and loads features. Watch for:

Starting Karaf...

Endpoints

Service URL Default credentials

OpenNMS Web UI

http://localhost:8980/opennms/

admin / admin

OpenNMS Karaf SSH

ssh -p 8101 admin@localhost

admin / admin

Grafana

http://localhost:3000/

admin / admin (anonymous Viewer enabled)

Prometheus UI

http://localhost:9090/ (when active)

Mimir UI

http://localhost:9009/ (when active)

tenant e2e

VictoriaMetrics UI

http://localhost:8428/ (when active)

Grafana auto-provisions a datasource pointing at whichever backend is active (selected by the compose file you brought up). Open Explore → OpenNMS (<backend>) to run PromQL against the data OpenNMS just wrote.

Smoke test (automated)

The smoke harness lives entirely in the project Makefile. Per project convention, CI invokes make smoke directly.

make smoke                          # default backends: prometheus, mimir, victoriametrics
make smoke BACKENDS=prometheus      # single backend
make smoke BACKENDS="mimir victoriametrics"
make smoke-prometheus               # convenience wrapper, equivalent to BACKENDS=prometheus
make smoke SMOKE_TIMEOUT=300        # tighter deadline (default 600s per backend)
make smoke SMOKE_POLL=5             # tighter poll interval (default 15s)

Each backend is brought up, polled for > 0 ingested series, and torn down. The target depends on kar, so a fresh KAR is built first. Pass/fail summary is printed at the end; on a timeout, the last 40 lines of the relevant container’s karaf.log are dumped before teardown.

Verifying the plugin is active

# Karaf shell (default admin/admin)
ssh -p 8101 admin@localhost

Inside Karaf:

karaf@root()> feature:list | grep prometheus-remote-writer
karaf@root()> bundle:list | grep prometheus-remote-writer
karaf@root()> opennms:prometheus-writer-stats

For what opennms:prometheus-writer-stats reports, see Operations.

Querying the backend

Once OpenNMS has collected a few samples (default interval: 5 minutes on a fresh provisioning), query the backend directly.

Prometheus:

curl 'http://localhost:9090/api/v1/series?match%5B%5D={__name__=~".%2B"}' | jq .
curl 'http://localhost:9090/api/v1/query?query=up' | jq .

Mimir (requires the X-Scope-OrgID header; tenant is e2e per the cfg):

curl -H 'X-Scope-OrgID: e2e' \
  'http://localhost:9009/prometheus/api/v1/series?match%5B%5D={__name__=~".%2B"}' | jq .

VictoriaMetrics:

curl 'http://localhost:8428/api/v1/series?match%5B%5D={__name__=~".%2B"}' | jq .

Tear down

Use the same -f you brought the stack up with:

docker compose -f e2e/compose.prometheus.yml down -v --remove-orphans

-v removes the named data volumes (postgres, opennms, prometheus, mimir, vm). Drop -v if you want to keep state across restarts.

Iterating on the plugin

The assembly/kar/target directory is mounted read-only into /opt/opennms/deploy/. A rebuild of the KAR (make kar from the repo root) does not auto-reload the plugin — Karaf’s hot-deploy watches file timestamps, but the container sees the mount at a point in time. To reload a freshly built KAR:

# From the repo root
make kar

# Restart only the core container (use whichever compose file is active)
docker compose -f e2e/compose.prometheus.yml restart core

Or, inside the Karaf shell, feature:uninstall + feature:install cycles the plugin without restarting the container.

What’s NOT exercised here

  • Minion / remote pollers — this is a single-core sandbox.

  • ActiveMQ / Kafka messaging — not needed for the local TSS path.

  • TLS / auth to the backend — all cleartext on the compose network.

  • Multi-tenant routing beyond Mimir’s default e2e tenant.

  • Dashboards — Grafana is provisioned with a datasource only; build dashboards on top in Explore or by dropping JSON under a grafana/dashboards/ provisioning directory.