> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamnative.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Lakehouse Observability

## Prerequisites

Before you can visualize Lakehouse metrics in your own Grafana instance, [enable Metrics Remote Write](/cloud/log-and-monitor/advanced-observability#metrics-remote-write-integration) on your Cloud Environment to forward StreamNative Cloud metrics to your Prometheus-compatible monitoring system or Datadog.

## Grafana Dashboard

A pre-built Grafana dashboard is available as [`CompactionScheduler.json`](https://github.com/streamnative/apache-pulsar-grafana-dashboard/tree/master/dashboards.kubernetes) in the [apache-pulsar-grafana-dashboard](https://github.com/streamnative/apache-pulsar-grafana-dashboard) repository. Import it into your Grafana instance for comprehensive monitoring.

### How to Import

1. Download [`CompactionScheduler.json`](https://github.com/streamnative/apache-pulsar-grafana-dashboard/tree/master/dashboards.kubernetes) from the repository.
2. Open Grafana -> **Dashboards** -> **Import**.
3. Upload `CompactionScheduler.json` or paste the JSON content.
4. Select your Prometheus data source.
5. Click **Import**.

### Dashboard Overview

The dashboard is organized into the following sections:

| Section                      | Description                                                                                                                                                                                                      |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Overview**                 | Topic count, task count, publish/compact/commit failed tasks, commit batch size                                                                                                                                  |
| **Compaction Write**         | Compaction lag, task publish lag, task stats, non-committable tasks, throughput (bytes/messages), latencies for compaction duration, WAL read, Parquet write, task commit, lakehouse commit, end-to-end pipeline |
| **Persistent API**           | Read throughput, read latencies (index+data, message, Oxia index, Oxia metadata)                                                                                                                                 |
| **WAL**                      | Read cache eviction/loading rate, WAL read latency, S3 cache loading latency                                                                                                                                     |
| **S3**                       | S3 read throughput, request rate, S3 read latency                                                                                                                                                                |
| **Compaction Read**          | Lakehouse read bytes/messages, read latency                                                                                                                                                                      |
| **Compaction Write Details** | Lakehouse write/encode/before-write/write-record latencies, Parquet write-record/write-metadata latencies                                                                                                        |
| **DLQ Tasks**                | Dead Letter Queue task statistics                                                                                                                                                                                |

***

## Key Alerts

These metrics should be monitored with alerting rules:

| Metric                                                                                       | Alert Condition                            | Severity |
| -------------------------------------------------------------------------------------------- | ------------------------------------------ | -------- |
| `pulsar_storage_compact_lag`                                                                 | Compaction lag exceeds threshold per topic | Warning  |
| `compaction_cluster_leaders_ratio`                                                           | Sum across cluster is not exactly 1        | Critical |
| `pulsar_storage_compact_quarantined_topics_count`                                            | Greater than 0                             | Warning  |
| `pulsar_storage_compact_topics_in_dlq`                                                       | Greater than 0                             | Critical |
| `pulsar_storage_compact_tasks_in_dlq`                                                        | Greater than 0                             | Critical |
| `pulsar_storage_compact_publish_task_failed_count_total`                                     | Increasing                                 | Warning  |
| `pulsar_storage_compact_failed_task_count_total`                                             | Increasing                                 | Warning  |
| `pulsar_storage_compact_task_commit_duration_seconds_count{pulsar_response_status="failed"}` | Increasing                                 | Critical |
| `pulsar_subscription_back_log`                                                               | Backlog exceeds threshold                  | Warning  |

***

## Compaction Service Metrics

The compaction service has three stages: task publishing (leader), WAL-to-Parquet conversion (worker), and commit to lakehouse (leader).

### Task Lifecycle

| Metric                                                  | Type  | Description                                      |
| ------------------------------------------------------- | ----- | ------------------------------------------------ |
| `pulsar_storage_compact_ongoing_topic_count`            | Gauge | Number of topics currently undergoing compaction |
| `pulsar_storage_compact_ongoing_task_count`             | Gauge | Number of active compaction tasks in progress    |
| `pulsar_storage_compact_tasks_in_init_state`            | Gauge | Tasks in initialization state                    |
| `pulsar_storage_compact_tasks_in_compacted_state`       | Gauge | Tasks in compacted state                         |
| `pulsar_storage_compact_tasks_in_prepared_commit_state` | Gauge | Tasks in prepared commit state                   |
| `pulsar_storage_compact_tasks_in_committed_state`       | Gauge | Tasks in committed state                         |

### Throughput

| Metric                                                | Type    | Description                                              |
| ----------------------------------------------------- | ------- | -------------------------------------------------------- |
| `pulsar_storage_compact_bytes_total`                  | Counter | Total bytes processed during compaction                  |
| `pulsar_storage_compact_messages_total`               | Counter | Total messages processed during compaction               |
| `pulsar_storage_compact_published_task_bytes`         | Gauge   | Size in bytes of messages batched in one compaction task |
| `pulsar_storage_compact_committed_parquet_file_bytes` | Gauge   | Size in bytes of committed Parquet files                 |
| `pulsar_storage_compact_commit_task_batch_size`       | Gauge   | Number of Parquet files in a single commit batch         |

### Offset Tracking

| Metric                                           | Type  | Description                                                        |
| ------------------------------------------------ | ----- | ------------------------------------------------------------------ |
| `pulsar_storage_compact_latest_message_offset`   | Gauge | Latest message offset for each topic                               |
| `pulsar_storage_compact_latest_published_offset` | Gauge | Latest published task's message offset                             |
| `pulsar_storage_compact_last_compacted_offset`   | Gauge | Latest offset confirmed as fully committed to lakehouse            |
| `pulsar_storage_compact_lag`                     | Gauge | Difference between latest message offset and last compacted offset |

### Latency

| Metric                                                                        | Type      | Description                                                            |
| ----------------------------------------------------------------------------- | --------- | ---------------------------------------------------------------------- |
| `pulsar_storage_compact_duration_seconds_bucket`                              | Histogram | Total latency of a compaction task                                     |
| `pulsar_storage_compact_read_messages_duration_seconds_bucket`                | Histogram | Latency for reading messages from WAL files                            |
| `pulsar_storage_compact_write_messages_duration_seconds_bucket`               | Histogram | Latency for decoding, converting, and writing to Parquet               |
| `pulsar_storage_compact_task_commit_duration_seconds_bucket`                  | Histogram | Latency for committing a task (includes Oxia index + catalog snapshot) |
| `pulsar_storage_compact_commit_to_lakehouse_duration_seconds_bucket`          | Histogram | Latency for committing snapshot to catalog service only                |
| `pulsar_storage_compact_message_from_ursa_to_parquet_duration_seconds_bucket` | Histogram | End-to-end latency: message write to Parquet file write                |
| `pulsar_storage_compact_message_end_to_end_duration_seconds_bucket`           | Histogram | End-to-end latency: message write to lakehouse commit                  |

### Failures

| Metric                                                               | Type      | Description                                   |
| -------------------------------------------------------------------- | --------- | --------------------------------------------- |
| `pulsar_storage_compact_publish_task_failed_count_total`             | Counter   | Total failed task publications                |
| `pulsar_storage_compact_failed_task_count_total`                     | Counter   | Total failed WAL-to-Parquet conversions       |
| `pulsar_storage_compact_quarantined_topics_count`                    | Gauge     | Topics quarantined due to compaction failures |
| `pulsar_storage_compact_topics_in_dlq`                               | Gauge     | Topics in Dead Letter Queue                   |
| `pulsar_storage_compact_tasks_in_dlq`                                | Gauge     | Tasks in Dead Letter Queue                    |
| `pulsar_storage_compact_non_committable_task_count`                  | Counter   | Non-committable tasks exceeding threshold     |
| `pulsar_storage_compact_non_committable_task_histogram_bytes_bucket` | Histogram | Size distribution of non-committable tasks    |

***

## WAL Storage Metrics

| Metric                                                         | Type      | Description                             |
| -------------------------------------------------------------- | --------- | --------------------------------------- |
| `pulsar_storage_wal_putEntry_count_total`                      | Counter   | Total entries written to WAL            |
| `pulsar_storage_wal_putEntry_rejected_count_total`             | Counter   | Total entries rejected during WAL write |
| `pulsar_storage_wal_putEntry_duration_seconds_bucket`          | Histogram | WAL write latency                       |
| `pulsar_storage_wal_putEntry_pending_duration_seconds_bucket`  | Histogram | Time entries wait in WAL buffer         |
| `pulsar_storage_wal_putEntry_cache_duration_seconds_bucket`    | Histogram | Write cache write latency               |
| `pulsar_storage_wal_getEntries_duration_seconds_bucket`        | Histogram | Batch read latency (cache or backend)   |
| `pulsar_storage_wal_getEntry_duration_seconds_bucket`          | Histogram | Single entry read latency               |
| `pulsar_storage_wal_writeCache_flush_duration_seconds_bucket`  | Histogram | Write cache flush latency               |
| `pulsar_storage_wal_readCache_loading_count_total`             | Counter   | Read cache loads from backend           |
| `pulsar_storage_wal_readCache_eviction_count_total`            | Counter   | Read cache evictions                    |
| `pulsar_storage_wal_readCache_loading_duration_seconds_bucket` | Histogram | Cache loading latency                   |
| `pulsar_storage_wal_read_cache_missed_total`                   | Counter   | Read cache misses                       |
| `pulsar_storage_wal_putEntry_pending_count`                    | Gauge     | Entries queued in WAL pending buffer    |
| `pulsar_storage_wal_writeCache_flushCallback_pending_count`    | Gauge     | Pending flush acknowledgments           |
| `pulsar_storage_wal_readCache_size_bytes`                      | Gauge     | Current read cache size                 |

### Write Cache Metrics

| Metric                                             | Type  | Description              |
| -------------------------------------------------- | ----- | ------------------------ |
| `pulsar_storage_wal_writeCache_used_bytes`         | Gauge | Write cache utilization  |
| `pulsar_storage_wal_writeCache_bufferSegment_used` | Gauge | Buffer segments in use   |
| `pulsar_storage_wal_writeCache_cacheSegment_used`  | Gauge | Cache segments in use    |
| `pulsar_storage_wal_writeCache_segment_count`      | Gauge | Total allocated segments |
| `pulsar_storage_wal_writeCache_capacity_bytes`     | Gauge | Max capacity per segment |

***

## File Storage Metrics

| Metric                                                         | Type      | Description                      |
| -------------------------------------------------------------- | --------- | -------------------------------- |
| `pulsar_storage_backend_storage_request_total`                 | Counter   | Total backend storage operations |
| `pulsar_storage_backend_write_duration_seconds_bucket`         | Histogram | Backend write latency            |
| `pulsar_storage_backend_read_duration_seconds_bucket`          | Histogram | Backend read latency             |
| `pulsar_storage_backend_metadata_read_duration_seconds_bucket` | Histogram | Metadata read latency            |
| `pulsar_storage_backend_crc_duration_seconds_bucket`           | Histogram | CRC calculation latency          |
| `pulsar_storage_backend_delete_duration_seconds_bucket`        | Histogram | Object deletion latency          |
| `pulsar_storage_backend_write_bytes_count_bytes_total`         | Counter   | Total bytes written to backend   |
| `pulsar_storage_backend_read_bytes_count_bytes_total`          | Counter   | Total bytes read from backend    |

***

## Lakehouse Read Metrics

| Metric                                                                | Type      | Description                                        |
| --------------------------------------------------------------------- | --------- | -------------------------------------------------- |
| `pulsar_storage_lakehouse_read_messages_total`                        | Counter   | Total messages read from lakehouse (Parquet files) |
| `pulsar_storage_lakehouse_read_bytes_bytes_total`                     | Counter   | Total bytes read from lakehouse                    |
| `pulsar_storage_lakehouse_read_request_total`                         | Counter   | Total read requests processed                      |
| `pulsar_storage_lakehouse_read_cache_hit_total`                       | Counter   | Parquet prefetch cache hits                        |
| `pulsar_storage_lakehouse_read_cache_miss_total`                      | Counter   | Parquet prefetch cache misses                      |
| `pulsar_storage_lakehouse_read_latency_seconds_bucket`                | Histogram | Read latency                                       |
| `pulsar_storage_lakehouse_read_request_queued_latency_seconds_bucket` | Histogram | Queue wait time before processing                  |

***

## Lakehouse Writer Metrics

| Metric                                                  | Type      | Description                     |
| ------------------------------------------------------- | --------- | ------------------------------- |
| `pulsar_storage_lakehouse_writer_before_write_duration` | Histogram | Pre-write operation latency     |
| `pulsar_storage_lakehouse_writer_write_all_duration`    | Histogram | Batch write latency             |
| `pulsar_storage_lakehouse_writer_write_record_duration` | Histogram | Individual record write latency |
| `pulsar_storage_lakehouse_writer_encode_duration`       | Histogram | Record encoding latency         |

## Lakehouse Reader Metrics

| Metric                                                 | Type      | Description                    |
| ------------------------------------------------------ | --------- | ------------------------------ |
| `pulsar_storage_lakehouse_reader_seek_duration`        | Histogram | Seek operation latency         |
| `pulsar_storage_lakehouse_reader_read_all_duration`    | Histogram | Batch read latency             |
| `pulsar_storage_lakehouse_reader_read_record_duration` | Histogram | Individual record read latency |
| `pulsar_storage_lakehouse_reader_decode_duration`      | Histogram | Record decoding latency        |

***

## Parquet File Metrics

### Writer

| Metric                                                     | Type      | Description                    |
| ---------------------------------------------------------- | --------- | ------------------------------ |
| `pulsar_storage_lakehouse_parquet_write_record_duration`   | Histogram | Parquet record write latency   |
| `pulsar_storage_lakehouse_parquet_write_metadata_duration` | Histogram | Parquet metadata write latency |

### Reader

| Metric                                                              | Type      | Description                     |
| ------------------------------------------------------------------- | --------- | ------------------------------- |
| `pulsar_storage_lakehouse_parquet_read_record_duration`             | Histogram | Parquet record read latency     |
| `pulsar_storage_lakehouse_parquet_read_metadata_duration`           | Histogram | Parquet metadata read latency   |
| `pulsar_storage_lakehouse_parquet_seek_by_offset_duration`          | Histogram | Seek by offset latency          |
| `pulsar_storage_lakehouse_parquet_seek_by_secondary_index_duration` | Histogram | Seek by secondary index latency |
