Overview

We use a set of services to monitor NDIP state and trigger alerts in case there are problems. We monitor system services, docker containers, disk space, system tests, tools tests, etc.

Below is a list of services that are used to provide monitoring:

Service	Host	Configuration
Node Exporters	on each VM we deploy	https://github.com/neutrons/post_processing_agent
Prometheus Stack	prometheus_push_gateway	Ansible playbook
Slack	slack.com	Ansible playbook

Further details about each service are provided in the corresponding subsections.

What is monitored

Metric	Source
Systemd services	Node Exporter
Disk space	Node Exporter
Docker response time & number of running containers	Push Gateway, Docker metrics
Web services	Black Box
System tests	Push Gateway
Tool tests	Push Gateway

Take a look at the Alert rules for more details