Healthcheckv2 Extension
Overview
This is an experimental extension that is intended to replace the existing health check extension. As the stability level is currently development, users wishing to experiment with this extension will have to build a custom collector binary using the OpenTelemetry Collector Builder. Health check extension V2 has new functionality that can be opted-in to, and also supports original healthcheck extension functionality with the exception of thecheck_collector_pipeline feature. See the warning below.
⚠️⚠️⚠️ Warning ⚠️⚠️⚠️
The check_collector_pipeline feature of this extension was not working as expected and has been
removed. The config remains for backwards compatibility, but it too will be removed in the future.
Users wishing to monitor pipeline health should use the v2 functionality described below and
opt-in to component health as described in
component health configuration.
V1
Health Check Extension V1 enables an HTTP url that can be probed to check the status of the OpenTelemetry Collector. This extension can be used as a liveness and/or readiness probe on Kubernetes. The following settings are required:endpoint(default = localhost:13133): Address to publish the health check status. You can review the full list ofServerConfig. See our security best practices doc to understand how to set the endpoint in different environments.path(default = ”/”): Specifies the path to be configured for the health check server.response_body(default = ""): Specifies a static body that overrides the default response returned by the health check service.check_collector_pipeline:(deprecated and ignored): Settings of collector pipeline health checkenabled(default = false): Whether enable collector pipeline check or notinterval(default = “5m”): Time interval to check the number of failuresexporter_failure_threshold(default = 5): The failure number threshold to mark containers as healthy.
V2
Health Check Extension - V2 provides HTTP and gRPC healthcheck services. The services can be used separately or together depending on your needs. The source of health for both services is component status reporting, a collector feature, that allows individual components to report their health viaStatusEvents. The health check extension aggregates the component StatusEvents into overall
collector health and pipeline health and exposes this data through its services.
Below is a table enumerating component statuses and their meanings. These will be mapped to
appropriate status codes for the protocol.
| Status | Meaning |
|---|---|
| Starting | The component is starting. |
| OK | The component is running without issue. |
| RecoverableError | The component has experienced a transient error and may recover. |
| PermanentError | The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. |
| FatalError | The collector has experienced a fatal runtime error and will shutdown. |
| Stopping | The component is in the process of shutting down. |
| Stopped | The component has completed shutdown. |
Configuration
Below is sample configuration for both the HTTP and gRPC services with component health opt-in. Note, theuse_v2: true setting is necessary during the interim while V1 functionality is
incrementally phased out.
Component Health Config
By default the Health Check Extension will not consider component error statuses as unhealthy. That is, an error status will not be reflected in the response code of the health check, but it will be available in the response body regardless of configuration. This behavior can be changed by opting in to include recoverable and / or permanent errors.include_permanent_errors
To opt-in to permanent errors set include_permanent_errors: true. When true, a permanent error
will result in a non-ok return status. By definition, this is a permanent state, and one that will
require human intervention to fix. The collector is running, albeit in a degraded state, and
restarting is unlikely to fix the problem. Thus, caution should be used when enabling this setting
while using the extension as a liveness or readiness probe in k8s.
include_recoverable_errors and recovery_duration
To opt-in recoverable errors set include_recoverable_errors: true. This setting works in tandem
with the recovery_duration option. When true, the Health Check Extension will consider a
recoverable error to be healthy until the recovery duration elapses, and unhealthy afterwards.
During the recovery duration an ok status will be returned. If the collector does not recover in
that time, a non-ok status will be returned. If the collector subsequently recovers, it will resume
reporting an ok status.
HTTP Service
Status Endpoint
The HTTP service provides a status endpoint that can be probed for overall collector status and per-pipeline status. The endpoint is located at/status by default, but can be configured using
the http.status.path setting. Requests to /status will return the overall collector status. To
probe pipeline status, pass the pipeline name as a query parameter, e.g. /status?pipeline=traces.
The HTTP status code returned maps to the overall collector or pipeline status, with the mapping
described below.
⚠️ Take care not to expose this endpoint on non-localhost ports as it contains the internal state
of the running collector.
Mapping of Component Status to HTTP Status
Component statuses are aggregated into overall collector status and overall pipeline status. In each case, you can consider the aggregated status to be the sum of its parts. The mapping from component status to HTTP status is as follows:| Status | HTTP Status Code |
|---|---|
| Starting | 503 - Service Unavailable |
| OK | 200 - OK |
| RecoverableError | 200 - OK1 |
| PermanentError | 200 - OK2 |
| FatalError | 500 - Internal Server Error |
| Stopping | 503 - Service Unavailable |
| Stopped | 503 - Service Unavailable |
- If
include_recoverable_errors: true: 200 when elapsed time < recovery duration; 500 otherwise - If
include_permanent_errors: true: 500 - Internal Server Error
Response Body
The response body contains either a detailed, or non-detailed view into collector or pipeline health in JSON format. The level of detail applies to the contents of the response body and is controlled by theverbose query parameter. Component event attributes are included when
http.status.include_attributes is set to true, regardless of whether the response is detailed.
Error Precedence
The response body contains either a partial or complete aggregate status in JSON format. The aggregation process functions similar to a priority queue, where the most relevant status bubbles to the top. By default, FatalError > PermanentError > RecoverableError, however, the priority of RecoverableError and PermanentError will be reversed ifinclude_permanent_errors is false and
include_recoverable_errors is true as this configuration makes RecoverableErrors more
relevant.
Collector Health
The detailed response body for collector health will include the overall status for the collector, the overall status for each pipeline in the collector, and the statuses for the individual components in each pipeline. The non-detailed response will only contain the overall collector health. Verbose Example Assuming the health check extension is configured withhttp.status.endpoint set to
localhost:13133, a request to http://localhost:13133/status?verbose
will have a response body such as:
- The overall status is
StatusRecoverableErrorbut the status healthy becauseinclude_recoverable_errorsis set tofalseor it istrueand the recovery duration has not yet passed. pipeline:metrics/grpchas a matching status, as doesexporter:otlp_grpc/staging. This implicates the exporter as the root cause for the pipeline and overall collector status.pipeline:traces/httpis completely healthy.
verbose query parameter, only the overall
status will be returned. The pipeline and component level statuses will be omitted. If
http.status.include_attributes is enabled, the overall status will also include an attributes
field.
Pipeline Health
The detailed response body for pipeline health is essentially a zoomed in version of the detailed collector response. It contains the overall status for the pipeline and the statuses of the individual components. The non-detailed response body contains only the overall status for the pipeline. Verbose Response Example Assuming the health check extension is configured withhttp.status.endpoint set to
localhost:13133, a request to
http://localhost:13133/status?pipeline=traces/http&verbose will have a response body such as:
verbose query parameter, only the overall pipeline status
will be returned. The component level statuses will be omitted. If http.status.include_attributes
is enabled, the overall status will also include an attributes field.
Collector Config Endpoint
The HTTP service optionally exposes an endpoint that provides the collector configuration. Note, the configuration returned is unfiltered and may contain sensitive information. As such, the configuration is disabled by default. Enable it using thehttp.config.enabled setting. By
default the path will be /config, but it can be changed using the http.config.path setting.
⚠️ Take care not to expose this endpoint on non-localhost ports as it contains the unobfuscated
config of the running collector.
gRPC Service
The health check extension provides an implementation of the grpc_health_v1 service. The service was chosen for compatibility with existing gRPC health checks, however, it does not provide the additional detail available with the HTTP service. Additionally, the gRPC service has a less nuanced view of the world with only two reportable statuses:HealthCheckResponse_SERVING and
HealthCheckResponse_NOT_SERVING.
Mapping of ComponentStatus to HealthCheckResponse_ServingStatus
The HTTP and gRCP services use the same method of component status aggregation to derive overall collector health and pipeline health from individual status events. The component statuses map to the followingHealthCheckResponse_ServingStatuses.
| Status | HealthCheckResponse_ServingStatus |
|---|---|
| Starting | NOT_SERVING |
| OK | SERVING |
| RecoverableError | SERVING1 |
| PermanentError | SERVING2 |
| FatalError | NOT_SERVING |
| Stopping | NOT_SERVING |
| Stopped | NOT_SERVING |
- If
include_recoverable_errors: true: SERVING when elapsed time < recovery duration; NOT_SERVING otherwise. - If
include_permanent_errors: true: NOT_SERVING
HealthCheckRequest
The gRPC service exposes two RPCs:Check and Watch (more about those below). Each takes a
HealthCheckRequest argument. The HealthCheckRequest message is defined as:
"" as the service name. To query for
pipeline health, use the pipeline name as the service.
Check RPC
TheCheck RPC is defined as:
NotFound. Otherwise it will
return a HealthCheckResponse with the serving status as mapped in the table above.
Watch Streaming RPC
TheWatch RPC is defined as:
Watch RPC will initiate a stream for the given service. If the service is known at the time
the RPC is made, its current status will be sent and changes in status will be sent thereafter. If
the service is unknown, a response with a status of `HealthCheckResponse_SERVICE_UNKNOWN“ will be
sent. The stream will remain open, and if and when the service starts reporting, its status will
begin streaming.
Future
There are plans to provide the ability to export status events as OTLP logs adhering to the event semantic conventions.Last generated: 2026-04-13