Identify dashboards and monitoring tools that can be used to resolve cluster issues

virtualramblings

Sep 17, 2019 10 min read

Identify dashboards and monitoring tools that can be used to resolve cluster issues

Advanced Pages

CVMIP/DNS:Port

2009: Stargate page to monitor backend storage
- Check QoS Queue and OpLog QoS queue (admitted/outstanding IO’s)
- Cache hit rates should be 80-90%+ if workload is ready heavy for best possible performance
- Check Avg Latency, Avg Op Size, Avg. Outstanding

Nutanix Advanced Pages — Image credit: https://nutanixbible.com

2009/latency: Stargate page backend latency
2009/vdisk_stats: Stargate page histograms of I/O, latency, writes
- If high read latency, check read source for vDisk and look where I/Os are served from (more than likely reads coming from HDD)
- Random/smaller IO (<64K) written directly to OpLog. Larger/sequential IO bypass
- OpLog and written to extent store (Estore)
- Ops and Randomness section shows if IO is random or sequential
- Working Set Size shows last 2 minutes and 1 hour
- Read source shows where IO’s are being served from
- Write Destination shows where IO’s are going
- Extent Group Up-Migration shows data up-migrated in last 300, 3600, and 86400 seconds

2009/h/traces: Stargate page monitor activity traces for operations
2009/h/vars: Stargate page monitor varous counters
2010: Curator page for Curator runs
- Need to be on the Curator master
- Partial scan every 60 minutes
- Full scan every 6 hours
- Also triggered by:
  - Periodic (normal state)
  - Disk/Node/Block failure
  - ILM Imbalance
  - Disk/Tier Imbalance
  - Partial scans have a single MapReduce job
  - Full scans have four

2010/master/control: Curator page for manually starting jobs
2011: Chronos page monitors jobs/tasks scheduled by Curator
2020: Cerebro page monitors PDs/rep status/DR
2020/h/traces: Cerebro page to monitor activity traces for PD/rep
2030: Main Acropolis page for details about hosts/tasks/networking
2030/sched: Acropolis page with info about VM/resource scheduling/placement decisions
2030/tasks: Acropolis page shows info about Acropolis tasks/state
2030/vms: Information about Acropolis VMs

Cluster Commands

Cluster Status:

cluster status

Local CVM Status:

genesis status

Check Upgrade Status:

upgrade_status

Stop Cluster Service:

cluster stop [Service Name]

Start Stopped Cluster Services:

cluster start #NOTE: This will start all stopped services

Start Single Service:

cluster state [Service Name]

Restart Local Service:

genesis stop [Service Name]

Find Cluster ID:

zeus_config_printer | grep cluster_id

Find AOS Version:

allssh “cat /etc/nutanix/release_version”

Find CVM Version:

allssh “cat /etc/nutanix/svm-version”

Run NCC Health Checks:

ncc health_checks run_all

Logs

All cluster logs:

allssh “cat ~/data/logs/Acropolis.log”

Errors logs:

allssh “cat ~/data/logs/[COMPONENT].ERROR”

Fatal logs:

allssh “cat ~/data/logs/[COMPONENT].FATAL”

Storage Layers and Monitoring

VM Layer

Metrics reported by hypervisor for VM
Represent performance VM is seeing
Indicative of I/O
Usage: troubleshooting VM level

Hypervisor Layer

Metrics reported by hypervisor
Represent performance Hypervisor is seeing
Usage: detailed/valuable metrics

Controller Layer

Metrics from Nutanix Controllers
From CVM’s
What front-end is seeing from NFS/SMB/iSCSI or backend operations (balancing, ILM).
Should normally match hypervisor layer
Usage: show backend operations

Disk Layer

Metrics reported by disk devices
Pulled from physical disks.
Data hitting OpLog or Extent Store.
Usage: See how many ops served from cache or disk