Documentation Index
Fetch the complete documentation index at: https://docs.streamnative.io/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you can visualize Lakehouse metrics in your own Grafana instance, enable Metrics Remote Write on your Cloud Environment to forward StreamNative Cloud metrics to your Prometheus-compatible monitoring system or Datadog.
Grafana Dashboard
A pre-built Grafana dashboard is available as CompactionScheduler.json in the apache-pulsar-grafana-dashboard repository. Import it into your Grafana instance for comprehensive monitoring.
How to Import
- Download
CompactionScheduler.json from the repository.
- Open Grafana -> Dashboards -> Import.
- Upload
CompactionScheduler.json or paste the JSON content.
- Select your Prometheus data source.
- Click Import.
Dashboard Overview
The dashboard is organized into the following sections:
| Section | Description |
|---|
| Overview | Topic count, task count, publish/compact/commit failed tasks, commit batch size |
| Compaction Write | Compaction lag, task publish lag, task stats, non-committable tasks, throughput (bytes/messages), latencies for compaction duration, WAL read, Parquet write, task commit, lakehouse commit, end-to-end pipeline |
| Persistent API | Read throughput, read latencies (index+data, message, Oxia index, Oxia metadata) |
| WAL | Read cache eviction/loading rate, WAL read latency, S3 cache loading latency |
| S3 | S3 read throughput, request rate, S3 read latency |
| Compaction Read | Lakehouse read bytes/messages, read latency |
| Compaction Write Details | Lakehouse write/encode/before-write/write-record latencies, Parquet write-record/write-metadata latencies |
| DLQ Tasks | Dead Letter Queue task statistics |
Key Alerts
These metrics should be monitored with alerting rules:
| Metric | Alert Condition | Severity |
|---|
pulsar_storage_compact_lag | Compaction lag exceeds threshold per topic | Warning |
compaction_cluster_leaders_ratio | Sum across cluster is not exactly 1 | Critical |
pulsar_storage_compact_quarantined_topics_count | Greater than 0 | Warning |
pulsar_storage_compact_topics_in_dlq | Greater than 0 | Critical |
pulsar_storage_compact_tasks_in_dlq | Greater than 0 | Critical |
pulsar_storage_compact_publish_task_failed_count_total | Increasing | Warning |
pulsar_storage_compact_failed_task_count_total | Increasing | Warning |
pulsar_storage_compact_task_commit_duration_seconds_count{pulsar_response_status="failed"} | Increasing | Critical |
pulsar_subscription_back_log | Backlog exceeds threshold | Warning |
Compaction Service Metrics
The compaction service has three stages: task publishing (leader), WAL-to-Parquet conversion (worker), and commit to lakehouse (leader).
Task Lifecycle
| Metric | Type | Description |
|---|
pulsar_storage_compact_ongoing_topic_count | Gauge | Number of topics currently undergoing compaction |
pulsar_storage_compact_ongoing_task_count | Gauge | Number of active compaction tasks in progress |
pulsar_storage_compact_tasks_in_init_state | Gauge | Tasks in initialization state |
pulsar_storage_compact_tasks_in_compacted_state | Gauge | Tasks in compacted state |
pulsar_storage_compact_tasks_in_prepared_commit_state | Gauge | Tasks in prepared commit state |
pulsar_storage_compact_tasks_in_committed_state | Gauge | Tasks in committed state |
Throughput
| Metric | Type | Description |
|---|
pulsar_storage_compact_bytes_total | Counter | Total bytes processed during compaction |
pulsar_storage_compact_messages_total | Counter | Total messages processed during compaction |
pulsar_storage_compact_published_task_bytes | Gauge | Size in bytes of messages batched in one compaction task |
pulsar_storage_compact_committed_parquet_file_bytes | Gauge | Size in bytes of committed Parquet files |
pulsar_storage_compact_commit_task_batch_size | Gauge | Number of Parquet files in a single commit batch |
Offset Tracking
| Metric | Type | Description |
|---|
pulsar_storage_compact_latest_message_offset | Gauge | Latest message offset for each topic |
pulsar_storage_compact_latest_published_offset | Gauge | Latest published task’s message offset |
pulsar_storage_compact_last_compacted_offset | Gauge | Latest offset confirmed as fully committed to lakehouse |
pulsar_storage_compact_lag | Gauge | Difference between latest message offset and last compacted offset |
Latency
| Metric | Type | Description |
|---|
pulsar_storage_compact_duration_seconds_bucket | Histogram | Total latency of a compaction task |
pulsar_storage_compact_read_messages_duration_seconds_bucket | Histogram | Latency for reading messages from WAL files |
pulsar_storage_compact_write_messages_duration_seconds_bucket | Histogram | Latency for decoding, converting, and writing to Parquet |
pulsar_storage_compact_task_commit_duration_seconds_bucket | Histogram | Latency for committing a task (includes Oxia index + catalog snapshot) |
pulsar_storage_compact_commit_to_lakehouse_duration_seconds_bucket | Histogram | Latency for committing snapshot to catalog service only |
pulsar_storage_compact_message_from_ursa_to_parquet_duration_seconds_bucket | Histogram | End-to-end latency: message write to Parquet file write |
pulsar_storage_compact_message_end_to_end_duration_seconds_bucket | Histogram | End-to-end latency: message write to lakehouse commit |
Failures
| Metric | Type | Description |
|---|
pulsar_storage_compact_publish_task_failed_count_total | Counter | Total failed task publications |
pulsar_storage_compact_failed_task_count_total | Counter | Total failed WAL-to-Parquet conversions |
pulsar_storage_compact_quarantined_topics_count | Gauge | Topics quarantined due to compaction failures |
pulsar_storage_compact_topics_in_dlq | Gauge | Topics in Dead Letter Queue |
pulsar_storage_compact_tasks_in_dlq | Gauge | Tasks in Dead Letter Queue |
pulsar_storage_compact_non_committable_task_count | Counter | Non-committable tasks exceeding threshold |
pulsar_storage_compact_non_committable_task_histogram_bytes_bucket | Histogram | Size distribution of non-committable tasks |
WAL Storage Metrics
| Metric | Type | Description |
|---|
pulsar_storage_wal_putEntry_count_total | Counter | Total entries written to WAL |
pulsar_storage_wal_putEntry_rejected_count_total | Counter | Total entries rejected during WAL write |
pulsar_storage_wal_putEntry_duration_seconds_bucket | Histogram | WAL write latency |
pulsar_storage_wal_putEntry_pending_duration_seconds_bucket | Histogram | Time entries wait in WAL buffer |
pulsar_storage_wal_putEntry_cache_duration_seconds_bucket | Histogram | Write cache write latency |
pulsar_storage_wal_getEntries_duration_seconds_bucket | Histogram | Batch read latency (cache or backend) |
pulsar_storage_wal_getEntry_duration_seconds_bucket | Histogram | Single entry read latency |
pulsar_storage_wal_writeCache_flush_duration_seconds_bucket | Histogram | Write cache flush latency |
pulsar_storage_wal_readCache_loading_count_total | Counter | Read cache loads from backend |
pulsar_storage_wal_readCache_eviction_count_total | Counter | Read cache evictions |
pulsar_storage_wal_readCache_loading_duration_seconds_bucket | Histogram | Cache loading latency |
pulsar_storage_wal_read_cache_missed_total | Counter | Read cache misses |
pulsar_storage_wal_putEntry_pending_count | Gauge | Entries queued in WAL pending buffer |
pulsar_storage_wal_writeCache_flushCallback_pending_count | Gauge | Pending flush acknowledgments |
pulsar_storage_wal_readCache_size_bytes | Gauge | Current read cache size |
Write Cache Metrics
| Metric | Type | Description |
|---|
pulsar_storage_wal_writeCache_used_bytes | Gauge | Write cache utilization |
pulsar_storage_wal_writeCache_bufferSegment_used | Gauge | Buffer segments in use |
pulsar_storage_wal_writeCache_cacheSegment_used | Gauge | Cache segments in use |
pulsar_storage_wal_writeCache_segment_count | Gauge | Total allocated segments |
pulsar_storage_wal_writeCache_capacity_bytes | Gauge | Max capacity per segment |
File Storage Metrics
| Metric | Type | Description |
|---|
pulsar_storage_backend_storage_request_total | Counter | Total backend storage operations |
pulsar_storage_backend_write_duration_seconds_bucket | Histogram | Backend write latency |
pulsar_storage_backend_read_duration_seconds_bucket | Histogram | Backend read latency |
pulsar_storage_backend_metadata_read_duration_seconds_bucket | Histogram | Metadata read latency |
pulsar_storage_backend_crc_duration_seconds_bucket | Histogram | CRC calculation latency |
pulsar_storage_backend_delete_duration_seconds_bucket | Histogram | Object deletion latency |
pulsar_storage_backend_write_bytes_count_bytes_total | Counter | Total bytes written to backend |
pulsar_storage_backend_read_bytes_count_bytes_total | Counter | Total bytes read from backend |
Lakehouse Read Metrics
| Metric | Type | Description |
|---|
pulsar_storage_lakehouse_read_messages_total | Counter | Total messages read from lakehouse (Parquet files) |
pulsar_storage_lakehouse_read_bytes_bytes_total | Counter | Total bytes read from lakehouse |
pulsar_storage_lakehouse_read_request_total | Counter | Total read requests processed |
pulsar_storage_lakehouse_read_cache_hit_total | Counter | Parquet prefetch cache hits |
pulsar_storage_lakehouse_read_cache_miss_total | Counter | Parquet prefetch cache misses |
pulsar_storage_lakehouse_read_latency_seconds_bucket | Histogram | Read latency |
pulsar_storage_lakehouse_read_request_queued_latency_seconds_bucket | Histogram | Queue wait time before processing |
Lakehouse Writer Metrics
| Metric | Type | Description |
|---|
pulsar_storage_lakehouse_writer_before_write_duration | Histogram | Pre-write operation latency |
pulsar_storage_lakehouse_writer_write_all_duration | Histogram | Batch write latency |
pulsar_storage_lakehouse_writer_write_record_duration | Histogram | Individual record write latency |
pulsar_storage_lakehouse_writer_encode_duration | Histogram | Record encoding latency |
Lakehouse Reader Metrics
| Metric | Type | Description |
|---|
pulsar_storage_lakehouse_reader_seek_duration | Histogram | Seek operation latency |
pulsar_storage_lakehouse_reader_read_all_duration | Histogram | Batch read latency |
pulsar_storage_lakehouse_reader_read_record_duration | Histogram | Individual record read latency |
pulsar_storage_lakehouse_reader_decode_duration | Histogram | Record decoding latency |
Parquet File Metrics
Writer
| Metric | Type | Description |
|---|
pulsar_storage_lakehouse_parquet_write_record_duration | Histogram | Parquet record write latency |
pulsar_storage_lakehouse_parquet_write_metadata_duration | Histogram | Parquet metadata write latency |
Reader
| Metric | Type | Description |
|---|
pulsar_storage_lakehouse_parquet_read_record_duration | Histogram | Parquet record read latency |
pulsar_storage_lakehouse_parquet_read_metadata_duration | Histogram | Parquet metadata read latency |
pulsar_storage_lakehouse_parquet_seek_by_offset_duration | Histogram | Seek by offset latency |
pulsar_storage_lakehouse_parquet_seek_by_secondary_index_duration | Histogram | Seek by secondary index latency |