Architecture of a Typical Monitoring & Security Stack (Logs, Metrics, Infra Health)

When people say “monitoring stack,” they often mean three very different problems:

  • Logs — human-readable events: “who logged in,” “what failed,” “what changed,” “what got blocked.”
  • Metrics — numeric time series: CPU, memory, disk IO, latency, throughput.
  • Infrastructure health — availability and state: uptime, services up/down, SNMP counters, discovery, maintenance windows.

If you mix these into one tool “because it can do everything,” you usually end up with a fragile system that’s hard to operate. The more reliable approach is to let each layer do what it’s best at:

  • Logs: rsyslog → Wazuh → OpenSearch
  • Metrics: exporter → Prometheus → Grafana
  • Infra health: agents/SNMP checks → Zabbix

This article explains how these pieces fit together in a practical, Linux-first way — and how to decide when to use an agent vs agentless.


1) Start With the Inputs: What Are You Monitoring?

In most environments, your “inputs” fall into three groups:

Servers (Linux/Windows)

  • Logs: auth, sudo, SSH, systemd/journald, kernel messages, web server logs (nginx/apache), mail logs (postfix/dovecot), database logs, application logs.
  • Metrics: CPU/RAM, disk and filesystem usage, IOPS, network throughput, process counts, service health.
  • Security signals: brute force attempts, privilege escalation, file integrity changes (FIM), audit events, suspicious binaries.

Network Devices (routers, firewalls, switches, Wi-Fi)

  • Logs: syslog events (VPN up/down, admin logins, config changes, firewall denies).
  • Metrics: SNMP counters (interfaces, errors, CPU load, memory).

Applications (web apps, APIs, mail systems)

  • Logs: access logs, error logs, audit logs (“who did what”), and application-specific events.
  • Metrics: request rate, error rate, latency (RED metrics), DB query time, queue depth.

Rule of thumb: logs answer “what happened,” metrics answer “how bad is it and when did it change,” and infra health answers “is it up and should we page someone.”


2) Logs Layer: rsyslog → Wazuh → OpenSearch

The logs pipeline should be treated like a delivery system: it must be boring, durable, and predictable.

Why rsyslog first?

rsyslog is a proven workhorse for:

  • Receiving syslog from servers and network devices (UDP/TCP, ideally with TLS where possible).
  • Routing logs by facility/severity, program name, hostname, or custom rules.
  • Buffering/spooling to disk when downstream systems are slow or temporarily down.
  • Archiving raw logs for retention and later forensics.

Think of rsyslog as your log switch — it keeps the pipeline stable even if the “analytics” side is being upgraded or restarted.

Where Wazuh fits

Wazuh is not “just another log viewer.” It focuses on security analytics and host-based signals:

  • Detection rules & correlation (security events, suspicious behavior patterns).
  • File Integrity Monitoring (FIM) for critical paths like /etc, web roots, and configs.
  • Vulnerability and compliance checks (depending on setup and platform).
  • Agent-based collection for deep endpoint visibility.

Wazuh turns raw events into structured security alerts. Those alerts (and often parsed events) go into a search/index backend.

Why OpenSearch (Indexer) is separate

OpenSearch is built for fast search and dashboards over large volumes of structured data. You want it because:

  • You can search by fields (host, user, rule ID, IP, program, severity).
  • Dashboards become responsive and useful.
  • Retention and index lifecycle policies can be managed cleanly.

Suggested log flow (practical)

Sources (servers, routers, apps)
        |
        | syslog (TCP/TLS preferred, UDP acceptable for low-risk devices)
        v
   rsyslog (routing + disk spool + archive)
        |
        | forwarded stream (or file input on the Wazuh side)
        v
   Wazuh Manager (decode + correlate + alert)
        |
        v
 OpenSearch (index + search + dashboards)

Operational benefit: if OpenSearch is down or Wazuh is restarting, rsyslog can keep accepting logs and spooling them to disk. This is how you avoid “we lost the logs during the incident.”


3) Metrics Layer: exporters → Prometheus → Grafana

Logs are verbose and expensive to query for performance trends. Metrics are the opposite: lightweight, structured, and perfect for graphs and alert thresholds.

Exporters: where metrics come from

Prometheus usually collects metrics via HTTP endpoints exposed by exporters:

  • node_exporter — Linux host metrics (CPU, memory, disks, filesystem, network).
  • windows_exporter — Windows metrics.
  • blackbox_exporter — external checks (HTTP/TCP/ICMP, TLS expiry, latency).
  • Service exporters: nginx, PostgreSQL, MySQL/MariaDB, Redis, etc.

Prometheus: the time-series engine

Prometheus scrapes these endpoints on a schedule (a “pull” model). It stores time series data efficiently and supports powerful queries.

Grafana: dashboards and alerting

Grafana sits on top and answers the question: “What does normal look like, and when did it stop being normal?” You’ll build dashboards for:

  • Host capacity and resource pressure
  • Service latency and error rates
  • Traffic patterns and saturation
  • Golden signals (latency, traffic, errors, saturation)

Suggested metrics flow

Exporters on hosts/services
        |
        | HTTP /metrics
        v
   Prometheus (scrape + store)
        |
        v
   Grafana (dashboards + alerting)

Design note: keep metrics retention realistic. For long-term storage at scale, you typically add remote storage later. In a small lab or SMB environment, local retention is often enough.


4) Infrastructure Health Layer: agents/SNMP checks → Zabbix

Zabbix excels at classic infrastructure operations:

  • Availability monitoring: ping, port checks, HTTP checks, service state.
  • SNMP monitoring: network device interfaces, errors, utilization, CPU, memory.
  • Discovery and inventory: find devices, templates, auto-registration.
  • Maintenance windows: planned downtime without noisy alerts.

Zabbix can do metrics too, but in a modern “split stack” approach it’s common to use:

  • Prometheus/Grafana for performance engineering and deep time-series visualization.
  • Zabbix for uptime, SNMP, and operational state.

5) Agent vs Agentless: What to Choose (and Why)

This is where most stacks either become elegant… or become a management nightmare.

Choose an agent when you need depth and trust

Use an agent if you need:

  • Host-level visibility (processes, journald, file changes, local audit events).
  • Security telemetry (FIM, audit integration, detailed login events).
  • Reliable data collection even when the host is busy (agent queues locally).

Examples:

  • Wazuh Agent on Linux servers for FIM, auditd, and security detections.
  • node_exporter on Linux for accurate OS metrics.
  • Zabbix Agent for custom checks that require local context.

Choose agentless when the target can’t run an agent (or shouldn’t)

Agentless is ideal when:

  • The device cannot run agents (routers, switches, firewalls).
  • You want simple availability checks without installing anything.
  • You prefer centralized polling (SNMP, ICMP, external HTTP checks).

Examples:

  • Network devices send syslog to rsyslog and expose SNMP to Zabbix.
  • blackbox_exporter checks external HTTP/TCP/ICMP without any agent on the target.

Decision table (fast cheat sheet)

Target Logs Metrics Infra health Agent?
Linux server Wazuh Agent +/or syslog to rsyslog node_exporter → Prometheus Zabbix Agent (optional) + checks Usually yes
Windows server Wazuh Agent (or event forwarding strategy) windows_exporter → Prometheus Zabbix Agent (optional) Usually yes
Firewall/router/switch syslog → rsyslog (rare via exporter) SNMP → Zabbix No (agentless)
External service / website provider logs (if available) synthetics blackbox checks No (agentless)

6) Avoid Duplication: Define “Source of Truth” Per Layer

It’s tempting to monitor the same thing in multiple tools “just in case.” In practice that creates confusion and alert fatigue.

A clean division of responsibilities looks like this:

  • Security detections & compliance signals: Wazuh
  • Performance metrics & dashboards: Prometheus + Grafana
  • Uptime, discovery, SNMP, maintenance windows: Zabbix
  • Log routing, buffering, and archival: rsyslog

Tip: If you page on something, decide which tool owns the alert. Everyone on-call should know where to look first.


7) Real-World Pitfalls (and How to Avoid Them)

Pitfall: losing logs during outages

Fix: use rsyslog disk spooling and archive raw logs locally before forwarding downstream.

Pitfall: “agents everywhere” becoming unmanageable

Fix: use agents where they add unique value (security depth, host context), and use agentless for network devices and synthetics.

Pitfall: dashboards that lie

Fix: standardize labels/hostnames/tags (env, role, app, location). Bad naming breaks observability faster than any outage.

Pitfall: alert fatigue

Fix: start with a small set of high-signal alerts (disk full risk, service down, VPN down, repeated auth failures, TLS expiry). Expand slowly.


8) A Practical “Reference Stack” for Small Environments

If you’re building a cost-effective lab or a small business monitoring setup, this architecture scales nicely:

LOGS
Devices/Servers → rsyslog (buffer + archive) → Wazuh → OpenSearch

METRICS
Hosts/Services exporters → Prometheus → Grafana

INFRA HEALTH
SNMP + checks (and optional agents) → Zabbix

It’s modular, upgrade-friendly, and resilient: you can restart analytics components without breaking collection, and you can scale each layer independently.


Conclusion

A “typical stack” is not a single product — it’s a set of well-chosen roles:

  • rsyslog makes log transport reliable.
  • Wazuh turns logs and endpoint signals into security detections.
  • OpenSearch makes searching and dashboards fast.
  • Prometheus + Grafana gives you clean, powerful performance monitoring.
  • Zabbix gives you operational health, discovery, and SNMP visibility.

If you take only one idea from this lesson: use agents where you need deep host context and security telemetry, and stay agentless for devices that don’t support agents or don’t need that depth.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.