Skip to main content

Isolationforest Processor

Status Available in: contrib Maintainers: @atoulme Source: opentelemetry-collector-contrib

Supported Telemetry

Logs Metrics Traces

Overview

✨ Key Features

CapabilityDescription
Realtime Isolation ForestBuilds an ensemble of random trees over a sliding window of recent data and assigns a 0–1 anomaly score on ingestion (β‰ˆ O(log n) per point).
Multi‑signal supportCan be inserted into traces, metrics, logs pipelines – one config powers all three.
Per‑entity modellingfeatures config lets you maintain a separate model per unique combination of resource / attribute keys (e.g. per‑pod, per‑service).
Adaptive Window SizingAutomatically adjusts window size based on traffic patterns, memory usage, and model stability for optimal performance and resource utilization.
Flexible outputβ€’ Add an attribute iforest.is_anomaly=true
β€’ Emit a gauge metric iforest.anomaly_score
β€’ Drop anomalous telemetry entirely.
Config‑drivenTune tree count, subsample size, contamination rate, sliding‑window length, retraining interval, target metrics, and more – all in collector.yml.
Zero external depsPure Go implementation; runs wherever the Collector does (edge, gateway, or backend).

βš™οΈ How it Works

  1. Training window – The processor keeps up to window_size of the most recent data points for every feature‑group.
  2. Periodic (re‑)training – Every training_interval, it draws subsample_size points from that window and grows forest_size random isolation trees.
  3. Scoring – Each new point is pushed through the forest. Shorter average path length β‡’ higher anomaly score.
  4. Adaptive sizing – When enabled, window size automatically adjusts based on traffic velocity, memory usage, and model stability.
  5. Post‑processing –
    • If add_anomaly_score: true, a gauge metric iforest.anomaly_score is emitted with identical attributes/timestamp.
    • If the score β‰₯ anomaly_threshold, the original span/metric/log is flagged with iforest.is_anomaly=true.
    • If drop_anomalous_data: true, flagged items are removed from the batch instead of being forwarded.
Contamination rate – instead of hard‑coding anomaly_threshold, you can supply contamination_rate (expected % of outliers). The processor then auto‑derives a dynamic threshold equal to the (1 – contamination_rate) quantile of recent scores.
Performance is linear in forest_size and logarithmic in window_size; a default of 100 trees and a 1 k‑point window easily sustains 10–50 k points/s on a vCPU.

πŸ”§ Configuration

FieldTypeDefaultNotes
forest_sizeint100Number of trees in the ensemble. Higher β†’ smoother scores, more CPU.
subsample_sizeint256Rows sampled to build each tree. Must be ≀ window_size.
window_sizeint1000Sliding window of recent data maintained per feature‑group.
contamination_ratefloat (0–1)0.10Fraction of points expected to be outliers; used to auto‑tune threshold.
anomaly_thresholdfloat (0–1)derivedManual override – score β‰₯ this β‡’ anomaly. Ignored if contamination_rate set.
training_intervalduration5mModel is retrained no sooner than this interval.
features[]string[]Resource/attribute keys that define grouping. Blank β‡’ single global model.
metrics_to_analyze[]string[]Only these metric names are scored (metrics pipeline only). Blank β‡’ all.
add_anomaly_scoreboolfalseEmit iforest.anomaly_score metric.
drop_anomalous_databoolfalseRemove anomalous items from the batch instead of forwarding.
adaptive_windowobjectnullEnables adaptive window sizing (see Adaptive Window section below).

πŸ”„ Adaptive Window Configuration

When enabled, the processor automatically adjusts window size based on traffic patterns and resource constraints:
FieldTypeDefaultNotes
enabledboolfalseEnable adaptive window sizing.
min_window_sizeint1000Minimum window size (safety bound).
max_window_sizeint100000Maximum window size (memory protection).
memory_limit_mbint256Shrink window when memory usage exceeds this limit.
adaptation_ratefloat0.1Rate of window size changes (0.0-1.0).
velocity_thresholdfloat50.0Samples/sec threshold for triggering window growth.
stability_check_intervalduration5mHow often to evaluate model stability for expansion.
See the sample below for context.

πŸ“„ Sample config.yml

receivers:
  otlp:
    protocols:
      grpc:            # β†’ listen on 0.0.0.0:4317

processors:
  isolationforest:
    # ─── core algorithm parameters ────────────────────────────────
    forest_size:        150          # trees per forest
    subsample_size:     512          # rows per tree
    contamination_rate: 0.05         # 5 % expected outliers
    threshold:          0.0          # 0 β‡’ let contamination_rate drive the cut-off
    mode:               both         # enrich + filter (see docstring)
    training_window:    24h          # window of data kept for training
    update_frequency:   5m           # retrain every 5 minutes
    min_samples:        1000         # wait until this many points seen

    # ─── where to write results on each data point ───────────────
    score_attribute:          anomaly.isolation_score   # float 0–1
    classification_attribute: anomaly.is_anomaly        # bool

    # ─── which numeric features the model should look at ─────────
    features:
      traces:  [duration]           # span duration (Β΅s / ns)
      metrics: [value]              # the sample’s numeric value
      logs:    [severity_number]    # log severity enum

    # ─── performance guard-rails (optional) ──────────────────────
    performance:
      max_memory_mb:     512
      batch_size:        1000
      parallel_workers:  4

exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"   # Prom-server will scrape /metrics here
    send_timestamps: true      # (field is valid in the standard exporter)

service:
  pipelines:
    metrics:
      receivers:  [otlp]
      processors: [isolationforest]
      exporters:  [prometheus]

Note: Use routingconnector to seggregate the different kind of spans(db, messaging etc.) and send them to separate isolationforestprocessor deployments so the anomaly detection is pertianing to the respective category of signals.

What the example does

SignalWhat’s scoredFeature groupingOutputNotes
TracesSpan duration (ns)service.name, k8s.pod.nameiforest.is_anomaly attr + optional dropUse a span/trace exporter to route anomalies.
MetricsOnly system.cpu.utilization, system.memory.utilizationSameAttribute + score metricThe score appears as iforest.anomaly_score gauge.
LogsSize of the log payload (bytes) by defaultSameAttribute flagYou can expose a numeric log attribute and configure the processor to use that via code changes.

πŸš€ Best Practices

  • Tune forest_size vs. latency – start with 100 trees; raise to 200–300 if scores look noisy.
  • Use per‑entity models – add features (service, pod, host) to avoid global comparisons across very different series.
  • Let contamination drive threshold – set contamination_rate to the % of traffic you’re comfortable labelling outlier; avoid hand‑tuning anomaly_threshold.
  • Use adaptive window sizing – enable for dynamic workloads; the processor will automatically grow windows during high traffic and shrink under memory pressure.
  • Route anomalies – keep drop_anomalous_data=false and add a simple [routing‑processor] downstream to ship anomalies to a dedicated exporter or topic.
  • Monitor model health – the emitted iforest.anomaly_score metric is perfect for a Grafana panel; watch its distribution and adapt window / contamination accordingly.

πŸ—οΈ Internals (High‑Level)

               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚ IsolationForestProcessor (per Collector instance) β”‚
               β”‚ ───────────────────────────────────────────────── β”‚
               β”‚  β€’ Sliding window (per feature‑group)             β”‚
               β”‚  β€’ Forest of N trees (per feature‑group)          β”‚
Telemetry ───▢ β”‚  β€’ Score calculator & anomaly decision            β”‚ ───▢  Next processor/exporter
               β”‚  β€’ Adaptive window sizing (optional)              β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Training cost: O(current_window_size Γ— forest_size Γ— log subsample_size) every training_interval Scoring cost: O(forest_size Γ— log subsample_size) per item Note: With adaptive window sizing enabled, current_window_size dynamically adjusts between min_window_size and max_window_size based on traffic patterns and memory constraints, making training costs adaptive to workload conditions.

🀝 Contributing

  • Bugs / Questions – please open an issue in the fork first.
  • Recently added: Adaptive window sizing for dynamic traffic patterns.
  • Planned enhancements
    • Multivariate scoring (multiple numeric attributes per point).
    • Expose Prometheus counters for training time / CPU cost.
PRs welcome – please include unit tests and doc updates.

Configuration

Example Configuration

processors:
  isolationforest:
    forest_size: 50
    mode: "enrich"
    threshold: 0.75
    features:
      traces: ["duration", "error"]

Last generated: 2026-04-13