Overview
We use a set of services to monitor NDIP state and trigger alerts in case there are problems. We monitor system services, docker containers, disk space, system tests, tools tests, etc.
Below is a list of services that are used to provide monitoring:
| Service | Host | Configuration |
|---|---|---|
| Node Exporters | on each VM we deploy | https://github.com/neutrons/post_processing_agent |
| Prometheus Stack | prometheus_push_gateway | Ansible playbook |
| Slack | slack.com | Ansible playbook |
Further details about each service are provided in the corresponding subsections.
What is monitored
| Metric | Source |
|---|---|
| Systemd services | Node Exporter |
| Disk space | Node Exporter |
| Docker response time & number of running containers | Push Gateway, Docker metrics |
| Web services | Black Box |
| System tests | Push Gateway |
| Tool tests | Push Gateway |
Take a look at the Alert rules for more details