Introduction

Trace proxy metrics typically provide insights into the performance, behavior, and health of a system’s distributed tracing component. This document provides information about the various trace metrics monitored and reported by our tracing system.

Trace Proxy Metrics

MetricDescription
trace_operations_latencySpan latency in microseconds (µs) by service, operation, and app. A Prometheus histogram with buckets {100, 200, 300, ..., 1000}.
trace_root_operation_latencyRoot span latency in microseconds (µs) by service, operation, and app. A Prometheus histogram with buckets {100, 200, 300, ..., 1000}.
trace_acceptedIndicates that a new trace has been added to the collector’s cache.
trace_operations_latency_msDifference between start and end time of a span for each trace operation.
trace_operations_failedNumber of error events in spans for each trace operation.
trace_operations_succeededNumber of succeeded events in spans for each trace operation.
trace_spans_count_totalCount of total spans.
trace_root_operation_latency_msDifference between start and end time of a root span for each trace operation.
trace_root_spanNumber of root spans in an operation.
trace_spans_countCount of total spans for each operation.
trace_root_operations_failedNumber of error events in root spans for each trace operation.
trace_operation_errorRatio of trace_operation_failed to trace_spans_count.
trace_response_http_statusTotal count of requests based on HTTP status code.
trace_response_grpc_statusTotal count of requests based on GRPC status code.
trace_apdex_latencyCreates buckets based on the Apdex threshold (latency in ms) configured. Traces are ingested and assigned to buckets using trace_apdex_latency_bucket{le="<latency>"}.

Trace Metrics

MetricDescription
trace_duration_msProcessing time spent by the span in trace proxy.
trace_send_droppedNumber of traces dropped by the sampler. In dry run mode, this remains 0, indicating all traces are sent to OpsRamp.
trace_send_keptNumber of traces sent after applying the sampling rule. In dry run mode, this increments while trace_send_dropped remains 0.
trace_send_ejected_fullTraces exceeding cache capacity that are sent based on this condition.
trace_send_ejected_memsizeTraces that cannot be kept but are put into a new cache and sent accordingly.
trace_send_expiredTraces sent when their timeout is completed.
trace_send_got_rootTraces containing root spans that are sent based on this condition.
trace_send_has_rootCount of spans that are root.
trace_send_no_rootCount of spans that are not root.
trace_sent_cache_hitTrace proxy received a span from an already sent trace, checking its sampling decision to send or drop it.

Collector Metrics

MetricDescription
collector_cache_buffer_overrun (Metric Type: Counter)This value should remain zero; a positive value could indicate the need to grow the size of the collector’s circular buffer. (The size of the circular buffer is set via the configuration field CacheCapacity.) Note that if collect_cache_buffer_overrun is increasing, it does not necessarily mean that the cache is full. You may see this value increasing while collect_cache_entries values remain low in comparison to collect_cache_capacity. This is due to the circular nature of the buffer, and can occur when traces stay unfinished for a long time in the face of high throughput traffic. Anytime a trace arrives that persists for longer than the time it takes to accept the same number of traces as collect_cache_capacity (also known as make a full circle around the ring), a cache buffer overrun is triggered. Setting CacheCapacity therefore depends not only on trace throughput but also on trace duration (both of which are tracked via other metrics). When a cache buffer overrun is triggered, it means that a trace has been sent to Opsramp before it has been completed. Depending on your tracing strategy, this could result in an incorrect sampling decision for the trace. For example, if all the fields have been received that you have sampling rules set up for, the decision could be correct. However, if some of those fields have not been received yet, the sampling decision could be incorrect.
collector_cache_capacity (Metric Type: Gauge)Equivalent to the value set in your configuration for CacheCapacity. Use this value in conjunction with collect_cache_entries to see how full the cache is over time.
collector_cache_entries (Metric Type: Histogram)Records avg, max, min, p50, p95, and p99 values, indicating how full the cache is over time.
collector_cache_size (Metric Type: Gauge)Length of a circular buffer of currently stored traces.
collector_incoming_queue (Metric Type: Histogram)Records avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from outside the trace proxy and need processing.
collector_peer_queue (Metric Type: Histogram)Records avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from other trace proxy peers and require processing.
collector_metrics_labels_series (Metric Type: Gauge)Represents the number of series in each metric.
collector_metrics_push_latency_msMeasures the time taken by the OpenTelemetry Collector to complete a metrics push request. Typically recorded in milliseconds, representing the duration from initiation to successful completion.

Routing Metrics

MetricDescription
incoming_router_batchIncrements when trace proxy batch event processing endpoint is hit.
peer_router_batchIncrements when trace proxy batch event processing endpoint is hit.
incoming_router_droppedIncrements when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans.
peer_router_droppedIncrements when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans.
incoming_router_eventIncrements when trace proxy single event processing endpoint is hit.
peer_router_eventIncrements when trace proxy single event processing endpoint is hit.
incoming_router_nonspanIncrements when Trace Proxy accepts other non-span events that are not part of a trace.
peer_router_nonspanIncrements when Trace Proxy accepts other non-span events that are not part of a trace.
incoming_router_peerCount of traces routed in from traces generator (incoming).
peer_router_peerCount of traces routed in from peer traces generator.
incoming_router_proxiedCount of traces routed in from traces generator (incoming) and reached to proxy.
peer_router_proxiedCount of traces routed in from peer traces generator and reached to proxy.
incoming_router_spanIncrements when trace proxy accepts events that are part of a trace, also known as spans.
peer_router_spanIncrements when trace proxy accepts events that are part of a trace, also known as spans.

Transmission Metrics

MetricDescription
upstream_enqueue_errorsCount of spans that encountered errors while dispatching the event to OpsRamp.
peer_enqueue_errorsCount of spans that encountered errors while dispatching the event to a peer.
upstream_response_errorsCount of spans that received an error response or had a StatusCode greater than 202 when hitting upstream addresses.
peer_response_errorsCount of spans that received an error response or had a StatusCode greater than 202 when hitting peer addresses.
upstream_response_20xCount of spans that had no error response and received a StatusCode less than 203 when hitting upstream addresses.
peer_response_20xCount of spans that had no error response and received a StatusCode less than 203 when hitting peer addresses.

Sampling Metrics

MetricDescription
dynsampler_num_droppedCount of traces dropped due to dynamic sampling.
rulessampler_num_droppedCount of traces dropped due to rules-based sampling.
dynsampler_num_keptCount of traces that are not dropped due to dynamic sampling.
rulessampler_num_keptCount of traces that are not dropped due to rules-based sampling.
dynsampler_sample_rateRecords avg, max, min, p50, p95, and p99 of the sample rate reported by the configured sampler.
rulessampler_sample_rateSample rate specified in the config section of the rules-based sampler.

Cuckoo Cache Metrics

This wraps a cuckoo filter implementation in a way that lets us keep it running forever without filling up. A cuckoo filter can’t be emptied (you can delete individual items if you know what they are, but you can’t get their names from the filter). Consequently, what we do is keep two filters, current and future. The current one is the one we use to check against, and when we add, we add to both. But the future one is started after the current one, so that when the current gets too full, we can discard it, replace it with future, and then start a new, empty future. This is why the future filter is nil until the current filter reaches 0.5.

MetricDescription
cuckoo_current_capacityDropped size of Cuckoo cache, as specified in the configuration section.
cuckoo_future_load_factorFraction of slots occupied in the future filter.
cuckoo_current_load_factorFraction of slots occupied in the current filter.