When it comes to "what" to monitor, many usual suspects already posted in this thread, so in an attempt not to repeat what's there already, I will mention just the following (will somewhat assume Linux/systemd):
- systemd unit failures - I install a global OnFailure hook that applies for all the units, to trigger an alert via a mechanism of choice for a given system,
- restarts of key services - you typically don't want to miss those, but if they are silent, then you quite likely will,
- netfilter reconfigurations - nftables cli has useful `monitor` subcommand for this,
- unexpected ingress or egress connection attempts,
- connections from unknown/unexpected networks (if can't just outright block them for any reason).
Can I bother you for a rough how wrt unexpected ingress/egress and unknown connections?
I'm not aware of any tooling that'd enable such monitoring without massively impacting performance - but I'm not particularly knowledgeable in this field either.
Just from someone that'd be interested to improve my monitoring on my homelab server
You are right, there is a performance and resources aspect to this.
When I've given those two particular monitoring examples, I should probably put more emphasis on the word "unexpected", which by its nature reduces cost close to zero for day to day operations. A problem may occur if something wrong is not only actually happening, but also on a massive scale, in which case paying a price for a short moment hopefully makes sense. Although cost benefit ratio may vary, depending on the specifics, sure.
Just to illustrate the point, staying in the context of the transport layer. Let's say I don't expect a particular db server inside internal network to make egress connections to anything other than:
- a local http proxy server to port x1 to fetch os updates and push to external db backups (yes, this proxy then needs source/target policy rules and logging for this http traffic, but that's the same idea, just on the higher layer),
- a local backup server to port x2 to push internal backups,
- a local time server to port x3 for time sync,
- and a local monitoring server to port x4 for logs forwarding.
Depending on the specifics, I may not even need outgoing DNS traffic.
For ingress, I may expect the following connections only:
- to port y1 for db replication, but only from the server running authoritative hot standby,
- to port y2 for SSH access, but only from a set of local bastion hosts,
- to port y3 for metrics pooling, but only from local metric servers.
In the case as above, I would log any other egress or ingress attempts to this host. I can do it with some sanity, because those would be, if at all, a low frequency events stemming from some misconfiguration on my part (VERY good way for detecting those), some software exposing unexpected (to me) behavior by design, important misconceptions in my mental model of this network as a whole, or an actual intentional unauthorized attempt (hopefully never!). In all of those cases I want to know and intervene, if only to update whitelisting setup or maybe reduce logging severity of some events, if they are rare and interesting enough that I can still accept logging them.
On the other hand, If I were to directly expose something to the public internet, as a rule of thumb, I would not log every connection attempt to every port, as those would be more or less constant and more than "expected".
As for the tooling, I believe anything that you use for traffic policing will do, as under those particular assumptions we don't need to chase any unusual performance characteristics.
For example, in the context of Linux and netfilter, you can put a logging-only rule at the end of related chain with default drop policy set and have some useful semantics in the message contents, so that logs monitoring will be easier to configure to catch those up, categorize (direction, severity, class, ...) and act upon it (alerting).
And when it comes to monitoring and logging around relatively high volume, "expected" network traffic (I'm taking a guess you were thinking about something like port 22 on your homelab perimeter as an example), I guess you either don't do it at all, or it's crucial enough that you have (1) compute resources to do it and, equally important, (2) a good idea what to actually do with this data. And then you probably enter into the space of network IDS, IPS, SIEM, WAF and similar acronyms, inspecting application layer traffic etc, but I don't have enough experience to recommend anything particular here.
- systemd unit failures - I install a global OnFailure hook that applies for all the units, to trigger an alert via a mechanism of choice for a given system,
- restarts of key services - you typically don't want to miss those, but if they are silent, then you quite likely will,
- netfilter reconfigurations - nftables cli has useful `monitor` subcommand for this,
- unexpected ingress or egress connection attempts,
- connections from unknown/unexpected networks (if can't just outright block them for any reason).