Make sure whatever information provided can be actionable. For example, providin...

Make sure whatever information provided can be actionable.

For example, providing CPU metric alone is just for alerting. If it exceeds a threshold, make sure it gives insights into which process/container was using how much CPU at given moment. Bonus point if you can link logs from that process/container of that time.

For disks, tell which directory is large, and what kind of file types are using much space.

Pretty graphs that don't tell you what to look for next are nothing.