App Metrics for Tanzu 2.2

Monitor App Metrics

Last Updated October 22, 2024

You can monitor the health of the App Metrics service using the logs, metrics, and Key Performance Indicators (KPIs) emitted by Tanzu Application Service and the App Metrics application.

For more information about monitoring TAS for VMs, see Monitoring TAS for VMs in the Tanzu Application Service documentation.

HealthWatch

The premier way to monitor App Metrics is using HealthWatch. After HealthWatch is installed, navigate to the JobHealth dashboard to view the App Metrics deployment, which is named appMetrics.

HealthWatch also supports alerting based on VM persistent disk percentage system.disk.persistent.percent and VM health system.healthy.

App Metrics dashboard

App Metrics dashboard for the appmetrics application also displays its the platform indicators as custom metrics. App Metrics also supports alerting based on dashboard indicators.

Key performance indicators

Key Performance Indicators (KPIs) for App Metrics are metrics that you might find most useful for monitoring the App Metrics service. KPIs are high signal value metrics that can indicate emerging issues.

App Metrics for VMware Tanzu provides the following KPIs as general alerting and response guidance for typical App Metrics installations. You can continue to fine-tune the alert measures for your installation by observing historical trends.

You can expand beyond this guidance and create new, installation specific monitoring metrics, thresholds, and alerts that are based on learning from your own installations.

BOSH metrics

All BOSH deployed components generate the following metrics. You can monitor them to verify they are not consuming excess resources.


Log Store VMs (log-store-vms)

Metric disk_persistent_percent
Description Percentage of VM persistent disk used for Log Store.

Use: It's important to make sure that the system disks of the data services do not fill up and cause data loss and performance degradation.

Type: percent
PromQl Used: avg(avg_over_time(system_mem_percent{deployment=~'log-store-prod',job='log-store',source_id='bosh-system-metrics-forwarder'}[60s])) by (index)
Recommended alert thresholds Yellow warning: > 70%
Red critical: > 85%
Recommended response Log Store disks can be scaled up vertically as needed to prevent data loss. Scaling horizontally results in data loss.

PostgreSQL VM (db-and-errand-runner)

Metric disk_persistent_percent
Description Percentage of VM persistent disk used for PostgreSQL.

Use: This stores custom indicator files, configured monitors and triggered alerts. As the disk fills up it prevents further customization of dashboards and monitors and prevents new alert triggers from being displayed on metrics graphs.

PromQl Used: avg(avg_over_time(system_disk_persistent_percent{deployment=~'appMetrics-.*',job='db-and-errand-runner',source_id='bosh-system-metrics-forwarder'}[60s]))
Type: percent
Recommended alert thresholds Yellow warning: > 90%
Red critical: > 95%
Recommended response Scale up disk as appropriate. Further customization is not available while scaling occurs.

Application metrics

All applications pushed using Cloud Foundry automatically emit the following application metrics. App Metrics is a single application and can be monitored by App Metrics or another application monitoring services. The following KPIs can indicate problems with App Metrics and are useful for monitoring any application. Non-routed applications return no data or all zeros for Latency, Errors, and Traffic metrics.


Latency

Description The Amount of time to service a request.

Use: Slow feedback is a symptom of degraded performance.

PromQl Used: (sum(rate(http_duration_seconds_sum{source_id="$sourceId"}[60s])) by (process_type, source_id) / sum(rate(http_duration_seconds_count{source_id="$sourceId"}[60s])) by (process_type, source_id) * 1000)
Type: milliseconds
Recommended response Scale up as appropriate.

Traffic

Description The Amount of time to service a request.

Use: Slow feedback is a symptom of degraded performance.

PromQl Used: (sum(rate(http_duration_seconds_sum{source_id="$sourceId"}[60s])) by (process_type, source_id) / sum(rate(http_duration_seconds_count{source_id="$sourceId"}[60s])) by (process_type, source_id) * 1000)
Type: milliseconds
Recommended response Scale up as appropriate.

Errors

Description The rate of failed requests. For example, number of 500 status responses.

Use: Any number of failures indicate a problem with the application or underlying infrastructure.

PromQl Used: sum((rate(http_total{source_id="$sourceId",status_code="500"}[60s:30s])) * 60) by (process_type, source_id)
Type: count
Recommended response Investigate application metrics and logs and the metrics.sys.DOMAIN/integration-status endpoint.

Saturation

Description The amount of resources being used by the application.

Use: This is made up of CPU, Memory, and Disk. Performance might degrade as the amount of resource used approaches the Saturation point.

CPU PromQl Used: avg(avg_over_time(cpu{source_id="sourceId"}[60s])) by (process_type, source_id)
CPU Type: percent
Memory PromQl Used: avg(memory{source_id="sourceId"} / memory_quota{source_id="sourceId"}) by (process_type, source_id) * 100
Memory Type: percent
Disk PromQl Used: avg(disk{source_id="sourceId"} / disk_quota{source_id="sourceId"}) by (process_type, source_id) * 100
Disk Type: percent
Recommended alert thresholds for App Metrics Yellow warning: > 80%
Red critical: > 90%
Recommended response Scale up memory and disk quota on the app as appropriate and turn off the push-apps errand on the tile.