Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.streamnative.io/llms.txt

Use this file to discover all available pages before exploring further.

Grafana Dashboard

A pre-built Grafana dashboard is available as CompactionScheduler.json in the apache-pulsar-grafana-dashboard repository. Import it into your Grafana instance for comprehensive monitoring.

How to Import

  1. Download CompactionScheduler.json from the repository.
  2. Open Grafana -> Dashboards -> Import.
  3. Upload CompactionScheduler.json or paste the JSON content.
  4. Select your Prometheus data source.
  5. Click Import.

Dashboard Overview

The dashboard is organized into the following sections:
SectionDescription
OverviewTopic count, task count, publish/compact/commit failed tasks, commit batch size
Compaction WriteCompaction lag, task publish lag, task stats, non-committable tasks, throughput (bytes/messages), latencies for compaction duration, WAL read, Parquet write, task commit, lakehouse commit, end-to-end pipeline
Persistent APIRead throughput, read latencies (index+data, message, Oxia index, Oxia metadata)
WALRead cache eviction/loading rate, WAL read latency, S3 cache loading latency
S3S3 read throughput, request rate, S3 read latency
Compaction ReadLakehouse read bytes/messages, read latency
Compaction Write DetailsLakehouse write/encode/before-write/write-record latencies, Parquet write-record/write-metadata latencies
DLQ TasksDead Letter Queue task statistics

Key Alerts

These metrics should be monitored with alerting rules:
MetricAlert ConditionSeverity
pulsar_storage_compact_lagCompaction lag exceeds threshold per topicWarning
compaction_cluster_leaders_ratioSum across cluster is not exactly 1Critical
pulsar_storage_compact_quarantined_topics_countGreater than 0Warning
pulsar_storage_compact_topics_in_dlqGreater than 0Critical
pulsar_storage_compact_tasks_in_dlqGreater than 0Critical
pulsar_storage_compact_publish_task_failed_count_totalIncreasingWarning
pulsar_storage_compact_failed_task_count_totalIncreasingWarning
pulsar_storage_compact_task_commit_duration_seconds_count{pulsar_response_status="failed"}IncreasingCritical
pulsar_subscription_back_logBacklog exceeds thresholdWarning

Compaction Service Metrics

The compaction service has three stages: task publishing (leader), WAL-to-Parquet conversion (worker), and commit to lakehouse (leader).

Task Lifecycle

MetricTypeDescription
pulsar_storage_compact_ongoing_topic_countGaugeNumber of topics currently undergoing compaction
pulsar_storage_compact_ongoing_task_countGaugeNumber of active compaction tasks in progress
pulsar_storage_compact_tasks_in_init_stateGaugeTasks in initialization state
pulsar_storage_compact_tasks_in_compacted_stateGaugeTasks in compacted state
pulsar_storage_compact_tasks_in_prepared_commit_stateGaugeTasks in prepared commit state
pulsar_storage_compact_tasks_in_committed_stateGaugeTasks in committed state

Throughput

MetricTypeDescription
pulsar_storage_compact_bytes_totalCounterTotal bytes processed during compaction
pulsar_storage_compact_messages_totalCounterTotal messages processed during compaction
pulsar_storage_compact_published_task_bytesGaugeSize in bytes of messages batched in one compaction task
pulsar_storage_compact_committed_parquet_file_bytesGaugeSize in bytes of committed Parquet files
pulsar_storage_compact_commit_task_batch_sizeGaugeNumber of Parquet files in a single commit batch

Offset Tracking

MetricTypeDescription
pulsar_storage_compact_latest_message_offsetGaugeLatest message offset for each topic
pulsar_storage_compact_latest_published_offsetGaugeLatest published task’s message offset
pulsar_storage_compact_last_compacted_offsetGaugeLatest offset confirmed as fully committed to lakehouse
pulsar_storage_compact_lagGaugeDifference between latest message offset and last compacted offset

Latency

MetricTypeDescription
pulsar_storage_compact_duration_seconds_bucketHistogramTotal latency of a compaction task
pulsar_storage_compact_read_messages_duration_seconds_bucketHistogramLatency for reading messages from WAL files
pulsar_storage_compact_write_messages_duration_seconds_bucketHistogramLatency for decoding, converting, and writing to Parquet
pulsar_storage_compact_task_commit_duration_seconds_bucketHistogramLatency for committing a task (includes Oxia index + catalog snapshot)
pulsar_storage_compact_commit_to_lakehouse_duration_seconds_bucketHistogramLatency for committing snapshot to catalog service only
pulsar_storage_compact_message_from_ursa_to_parquet_duration_seconds_bucketHistogramEnd-to-end latency: message write to Parquet file write
pulsar_storage_compact_message_end_to_end_duration_seconds_bucketHistogramEnd-to-end latency: message write to lakehouse commit

Failures

MetricTypeDescription
pulsar_storage_compact_publish_task_failed_count_totalCounterTotal failed task publications
pulsar_storage_compact_failed_task_count_totalCounterTotal failed WAL-to-Parquet conversions
pulsar_storage_compact_quarantined_topics_countGaugeTopics quarantined due to compaction failures
pulsar_storage_compact_topics_in_dlqGaugeTopics in Dead Letter Queue
pulsar_storage_compact_tasks_in_dlqGaugeTasks in Dead Letter Queue
pulsar_storage_compact_non_committable_task_countCounterNon-committable tasks exceeding threshold
pulsar_storage_compact_non_committable_task_histogram_bytes_bucketHistogramSize distribution of non-committable tasks

WAL Storage Metrics

MetricTypeDescription
pulsar_storage_wal_putEntry_count_totalCounterTotal entries written to WAL
pulsar_storage_wal_putEntry_rejected_count_totalCounterTotal entries rejected during WAL write
pulsar_storage_wal_putEntry_duration_seconds_bucketHistogramWAL write latency
pulsar_storage_wal_putEntry_pending_duration_seconds_bucketHistogramTime entries wait in WAL buffer
pulsar_storage_wal_putEntry_cache_duration_seconds_bucketHistogramWrite cache write latency
pulsar_storage_wal_getEntries_duration_seconds_bucketHistogramBatch read latency (cache or backend)
pulsar_storage_wal_getEntry_duration_seconds_bucketHistogramSingle entry read latency
pulsar_storage_wal_writeCache_flush_duration_seconds_bucketHistogramWrite cache flush latency
pulsar_storage_wal_readCache_loading_count_totalCounterRead cache loads from backend
pulsar_storage_wal_readCache_eviction_count_totalCounterRead cache evictions
pulsar_storage_wal_readCache_loading_duration_seconds_bucketHistogramCache loading latency
pulsar_storage_wal_read_cache_missed_totalCounterRead cache misses
pulsar_storage_wal_putEntry_pending_countGaugeEntries queued in WAL pending buffer
pulsar_storage_wal_writeCache_flushCallback_pending_countGaugePending flush acknowledgments
pulsar_storage_wal_readCache_size_bytesGaugeCurrent read cache size

Write Cache Metrics

MetricTypeDescription
pulsar_storage_wal_writeCache_used_bytesGaugeWrite cache utilization
pulsar_storage_wal_writeCache_bufferSegment_usedGaugeBuffer segments in use
pulsar_storage_wal_writeCache_cacheSegment_usedGaugeCache segments in use
pulsar_storage_wal_writeCache_segment_countGaugeTotal allocated segments
pulsar_storage_wal_writeCache_capacity_bytesGaugeMax capacity per segment

File Storage Metrics

MetricTypeDescription
pulsar_storage_backend_storage_request_totalCounterTotal backend storage operations
pulsar_storage_backend_write_duration_seconds_bucketHistogramBackend write latency
pulsar_storage_backend_read_duration_seconds_bucketHistogramBackend read latency
pulsar_storage_backend_metadata_read_duration_seconds_bucketHistogramMetadata read latency
pulsar_storage_backend_crc_duration_seconds_bucketHistogramCRC calculation latency
pulsar_storage_backend_delete_duration_seconds_bucketHistogramObject deletion latency
pulsar_storage_backend_write_bytes_count_bytes_totalCounterTotal bytes written to backend
pulsar_storage_backend_read_bytes_count_bytes_totalCounterTotal bytes read from backend

Lakehouse Read Metrics

MetricTypeDescription
pulsar_storage_lakehouse_read_messages_totalCounterTotal messages read from lakehouse (Parquet files)
pulsar_storage_lakehouse_read_bytes_bytes_totalCounterTotal bytes read from lakehouse
pulsar_storage_lakehouse_read_request_totalCounterTotal read requests processed
pulsar_storage_lakehouse_read_cache_hit_totalCounterParquet prefetch cache hits
pulsar_storage_lakehouse_read_cache_miss_totalCounterParquet prefetch cache misses
pulsar_storage_lakehouse_read_latency_seconds_bucketHistogramRead latency
pulsar_storage_lakehouse_read_request_queued_latency_seconds_bucketHistogramQueue wait time before processing

Lakehouse Writer Metrics

MetricTypeDescription
pulsar_storage_lakehouse_writer_before_write_durationHistogramPre-write operation latency
pulsar_storage_lakehouse_writer_write_all_durationHistogramBatch write latency
pulsar_storage_lakehouse_writer_write_record_durationHistogramIndividual record write latency
pulsar_storage_lakehouse_writer_encode_durationHistogramRecord encoding latency

Lakehouse Reader Metrics

MetricTypeDescription
pulsar_storage_lakehouse_reader_seek_durationHistogramSeek operation latency
pulsar_storage_lakehouse_reader_read_all_durationHistogramBatch read latency
pulsar_storage_lakehouse_reader_read_record_durationHistogramIndividual record read latency
pulsar_storage_lakehouse_reader_decode_durationHistogramRecord decoding latency

Parquet File Metrics

Writer

MetricTypeDescription
pulsar_storage_lakehouse_parquet_write_record_durationHistogramParquet record write latency
pulsar_storage_lakehouse_parquet_write_metadata_durationHistogramParquet metadata write latency

Reader

MetricTypeDescription
pulsar_storage_lakehouse_parquet_read_record_durationHistogramParquet record read latency
pulsar_storage_lakehouse_parquet_read_metadata_durationHistogramParquet metadata read latency
pulsar_storage_lakehouse_parquet_seek_by_offset_durationHistogramSeek by offset latency
pulsar_storage_lakehouse_parquet_seek_by_secondary_index_durationHistogramSeek by secondary index latency