Introduction
Trace proxy metrics typically provide insights into the performance, behavior, and health of a system’s distributed tracing component. This document provides information about the various trace metrics monitored and reported by our tracing system.
Trace Proxy Metrics
Metric | Description |
---|
trace_operations_latency | Span latency in microseconds (µs) by service, operation, and app. A Prometheus histogram with buckets {100, 200, 300, ..., 1000}. |
trace_root_operation_latency | Root span latency in microseconds (µs) by service, operation, and app. A Prometheus histogram with buckets {100, 200, 300, ..., 1000}. |
trace_accepted | Indicates that a new trace has been added to the collector’s cache. |
trace_operations_latency_ms | Difference between start and end time of a span for each trace operation. |
trace_operations_failed | Number of error events in spans for each trace operation. |
trace_operations_succeeded | Number of succeeded events in spans for each trace operation. |
trace_spans_count_total | Count of total spans. |
trace_root_operation_latency_ms | Difference between start and end time of a root span for each trace operation. |
trace_root_span | Number of root spans in an operation. |
trace_spans_count | Count of total spans for each operation. |
trace_root_operations_failed | Number of error events in root spans for each trace operation. |
trace_operation_error | Ratio of trace_operation_failed to trace_spans_count. |
trace_response_http_status | Total count of requests based on HTTP status code. |
trace_response_grpc_status | Total count of requests based on GRPC status code. |
trace_apdex_latency | Creates buckets based on the Apdex threshold (latency in ms) configured. Traces are ingested and assigned to buckets using trace_apdex_latency_bucket{le="<latency>"} . |
Trace Metrics
Metric | Description |
---|
trace_duration_ms | Processing time spent by the span in trace proxy. |
trace_send_dropped | Number of traces dropped by the sampler. In dry run mode, this remains 0, indicating all traces are sent to OpsRamp. |
trace_send_kept | Number of traces sent after applying the sampling rule. In dry run mode, this increments while trace_send_dropped remains 0. |
trace_send_ejected_full | Traces exceeding cache capacity that are sent based on this condition. |
trace_send_ejected_memsize | Traces that cannot be kept but are put into a new cache and sent accordingly. |
trace_send_expired | Traces sent when their timeout is completed. |
trace_send_got_root | Traces containing root spans that are sent based on this condition. |
trace_send_has_root | Count of spans that are root. |
trace_send_no_root | Count of spans that are not root. |
trace_sent_cache_hit | Trace proxy received a span from an already sent trace, checking its sampling decision to send or drop it. |
Collector Metrics
Metric | Description |
---|
collector_cache_buffer_overrun (Metric Type: Counter) | This value should remain zero; a positive value could indicate the need to grow the size of the collector’s circular buffer. (The size of the circular buffer is set via the configuration field CacheCapacity .) Note that if collect_cache_buffer_overrun is increasing, it does not necessarily mean that the cache is full. You may see this value increasing while collect_cache_entries values remain low in comparison to collect_cache_capacity . This is due to the circular nature of the buffer, and can occur when traces stay unfinished for a long time in the face of high throughput traffic. Anytime a trace arrives that persists for longer than the time it takes to accept the same number of traces as collect_cache_capacity (also known as make a full circle around the ring), a cache buffer overrun is triggered. Setting CacheCapacity therefore depends not only on trace throughput but also on trace duration (both of which are tracked via other metrics). When a cache buffer overrun is triggered, it means that a trace has been sent to Opsramp before it has been completed. Depending on your tracing strategy, this could result in an incorrect sampling decision for the trace. For example, if all the fields have been received that you have sampling rules set up for, the decision could be correct. However, if some of those fields have not been received yet, the sampling decision could be incorrect. |
collector_cache_capacity (Metric Type: Gauge) | Equivalent to the value set in your configuration for CacheCapacity . Use this value in conjunction with collect_cache_entries to see how full the cache is over time. |
collector_cache_entries (Metric Type: Histogram) | Records avg, max, min, p50, p95, and p99 values, indicating how full the cache is over time. |
collector_cache_size (Metric Type: Gauge) | Length of a circular buffer of currently stored traces. |
collector_incoming_queue (Metric Type: Histogram) | Records avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from outside the trace proxy and need processing. |
collector_peer_queue (Metric Type: Histogram) | Records avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from other trace proxy peers and require processing. |
collector_metrics_labels_series (Metric Type: Gauge) | Represents the number of series in each metric. |
collector_metrics_push_latency_ms | Measures the time taken by the OpenTelemetry Collector to complete a metrics push request. Typically recorded in milliseconds, representing the duration from initiation to successful completion. |
Routing Metrics
Metric | Description |
---|
incoming_router_batch | Increments when trace proxy batch event processing endpoint is hit. |
peer_router_batch | Increments when trace proxy batch event processing endpoint is hit. |
incoming_router_dropped | Increments when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans. |
peer_router_dropped | Increments when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans. |
incoming_router_event | Increments when trace proxy single event processing endpoint is hit. |
peer_router_event | Increments when trace proxy single event processing endpoint is hit. |
incoming_router_nonspan | Increments when Trace Proxy accepts other non-span events that are not part of a trace. |
peer_router_nonspan | Increments when Trace Proxy accepts other non-span events that are not part of a trace. |
incoming_router_peer | Count of traces routed in from traces generator (incoming). |
peer_router_peer | Count of traces routed in from peer traces generator. |
incoming_router_proxied | Count of traces routed in from traces generator (incoming) and reached to proxy. |
peer_router_proxied | Count of traces routed in from peer traces generator and reached to proxy. |
incoming_router_span | Increments when trace proxy accepts events that are part of a trace, also known as spans. |
peer_router_span | Increments when trace proxy accepts events that are part of a trace, also known as spans. |
Transmission Metrics
Metric | Description |
---|
upstream_enqueue_errors | Count of spans that encountered errors while dispatching the event to OpsRamp. |
peer_enqueue_errors | Count of spans that encountered errors while dispatching the event to a peer. |
upstream_response_errors | Count of spans that received an error response or had a StatusCode greater than 202 when hitting upstream addresses. |
peer_response_errors | Count of spans that received an error response or had a StatusCode greater than 202 when hitting peer addresses. |
upstream_response_20x | Count of spans that had no error response and received a StatusCode less than 203 when hitting upstream addresses. |
peer_response_20x | Count of spans that had no error response and received a StatusCode less than 203 when hitting peer addresses. |
Sampling Metrics
Metric | Description |
---|
dynsampler_num_dropped | Count of traces dropped due to dynamic sampling. |
rulessampler_num_dropped | Count of traces dropped due to rules-based sampling. |
dynsampler_num_kept | Count of traces that are not dropped due to dynamic sampling. |
rulessampler_num_kept | Count of traces that are not dropped due to rules-based sampling. |
dynsampler_sample_rate | Records avg, max, min, p50, p95, and p99 of the sample rate reported by the configured sampler. |
rulessampler_sample_rate | Sample rate specified in the config section of the rules-based sampler. |
Cuckoo Cache Metrics
This wraps a cuckoo filter implementation in a way that lets us keep it running forever without filling up. A cuckoo filter can’t be emptied (you can delete individual items if you know what they are, but you can’t get their names from the filter). Consequently, what we do is keep two filters, current and future. The current one is the one we use to check against, and when we add, we add to both. But the future one is started after the current one, so that when the current gets too full, we can discard it, replace it with future, and then start a new, empty future. This is why the future filter is nil until the current filter reaches 0.5.
Metric | Description |
---|
cuckoo_current_capacity | Dropped size of Cuckoo cache, as specified in the configuration section. |
cuckoo_future_load_factor | Fraction of slots occupied in the future filter. |
cuckoo_current_load_factor | Fraction of slots occupied in the current filter. |
Note
- There are some other Process and Go metrics which are used to find health of trace proxy.
- These process and go metrics are prefixed with process_ and go_ .