Context:
I run a small virtualized on prem environment across eight Linux VMs. Three are Ubuntu and five are Rocky. The environment hosts shared services, development workloads, and experiments. When something breaks, the first thing you need is visibility across the entire set, not eight separate SSH sessions and a handful of guesses.
Problem & Constraints:
Troubleshooting across multiple machines is slow when logs are scattered. You end up logging into one box, checking journald, then logging into the next box, and repeating that until you find the needle. That approach does not scale and it does not support quick response.
The constraints were mixed runtimes and mixed distributions. Rocky uses Podman, Ubuntu uses Docker. The system still needs one coherent logging pipeline that behaves the same from the query side.
Solution & Architecture:
The solution is a standard Loki stack. Grafana is the interface. Loki is the log store. Promtail runs on each VM and ships logs into Loki. Once logs land in Loki with consistent labels, Grafana can query and filter streams quickly. Grafana is accessible at http colon slash slash 192 dot 168 dot star dot star colon 3003.
The rollout started small on the control node and a single Rocky VM to validate end to end flow. Once the pipeline was proven, I automated it with Ansible so new VMs can be enrolled quickly. I maintained one playbook for Ubuntu and one for Rocky to account for Docker versus Podman.
The core of the system is speed and consistency. A host label and a job label make it possible to pivot from broad visibility to targeted triage in seconds.
Proof & Outcome:
Logs from all eight VMs are visible and queryable in one place. I can filter by host and by service and see the actual log lines in context without leaving Grafana. That eliminates the slow loop of logging into multiple machines just to reconstruct a timeline.
The outcome is faster troubleshooting and faster response. The next step is alerting so failures are actionable. Alerts will be routed to email and Telegram.
Next Steps:
Standardize labels across Ubuntu and Rocky so system logs and container logs are consistently separated. Add alerting for common failure patterns and service health events and route notifications to email and Telegram. Expand the automation so enrolling a new VM is a single Ansible run with a predictable label set.