45+

Servers Monitored

99.9%

Uptime Achieved

15min

Avg Response Time

85%

Issues Caught Early

Project Profile

Industry: Manufacturing

Environment: Hybrid (on-premise + cloud)

Tools: OMD, Webmin, Custom Dashboards

Servers: Windows, Linux, Network Devices

The Challenge

This manufacturing company had grown their IT infrastructure organically over the years. What started as a few servers had become a sprawling environment of 45+ systems across multiple locations: file servers, database servers, application servers, domain controllers, backup systems, and various specialized equipment running everything from Windows Server to Linux to embedded systems.

The problem was visibility. The IT team found out about issues the same way everyone else did—when something stopped working and someone complained. A disk filling up would go unnoticed until an application crashed. A failing backup job might not be discovered for days. Memory leaks would slowly degrade performance until a server needed to be rebooted during business hours.

The reactive approach was burning out the IT team and frustrating users. Every incident was an emergency. There was no warning, no time to plan, no ability to address problems during maintenance windows instead of peak business hours.

The company needed a monitoring solution that would:

  • Provide real-time visibility into all servers and critical services
  • Alert the right people before problems became outages
  • Track trends to enable capacity planning
  • Work across their mixed Windows/Linux environment
  • Not require expensive licensing or complex infrastructure

The Solution

We implemented a comprehensive monitoring stack built on proven open-source tools, customized for the client's specific environment and integrated with their existing workflows.

OMD: The Monitoring Core

At the heart of the solution is OMD (Open Monitoring Distribution), which bundles Nagios/Check_MK with a complete monitoring ecosystem. OMD provides:

  • Service monitoring - Checks for hundreds of service types out of the box, from basic ping tests to detailed application health checks
  • Agent-based monitoring - Lightweight agents on each server report detailed metrics: CPU, memory, disk, network, running processes, Windows services, and more
  • SNMP monitoring - Network switches, routers, UPS systems, and other infrastructure devices monitored via SNMP
  • Historical data - Performance metrics stored for trend analysis and capacity planning
  • Flexible alerting - Configurable notification rules based on severity, time of day, and escalation paths

We configured monitoring for all 45+ servers with checks appropriate to each system's role. Database servers get query performance monitoring. File servers get share availability and quota checks. Domain controllers get replication health monitoring. Each server type has a tailored set of checks that matter for that workload.

Webmin: Server Management Interface

For the Linux servers in the environment, we deployed Webmin to provide a web-based management interface. Webmin gives the IT team:

  • System administration - User management, package updates, service control, and configuration editing through a browser
  • Log viewing - Centralized access to system logs without SSH access to each server
  • Scheduled tasks - Cron job management with a visual interface
  • Resource monitoring - Real-time CPU, memory, and disk graphs for quick health checks

Webmin complements OMD by providing the management capabilities to actually fix issues once monitoring identifies them. The IT team can respond to an alert and resolve many issues without leaving their browser.

Custom Dashboards

While OMD's built-in interface is powerful, we created custom dashboards tailored to different audiences:

Operations Dashboard: A wall-mounted display in the IT area showing real-time status of all critical systems. Green/yellow/red indicators make it immediately obvious when something needs attention. This dashboard auto-refreshes and is designed to be readable from across the room.

Executive Dashboard: A high-level view for management showing uptime percentages, incident trends, and capacity utilization. No technical details—just the metrics that matter for business decisions.

Service-Specific Dashboards: Focused views for specific systems like the ERP environment, showing database performance, application server health, and integration service status on a single screen.

Email Alert System

Monitoring is only useful if the right people find out about problems. We implemented a tiered alerting system:

Warning alerts go to the IT team's shared inbox. These are issues that need attention but aren't emergencies—disk space at 80%, a backup that took longer than usual, elevated but not critical CPU usage.

Critical alerts go directly to on-call staff via email and SMS. These are issues that need immediate response—a server down, a critical service stopped, disk space at 95%.

Escalation rules ensure that if a critical alert isn't acknowledged within 15 minutes, it escalates to additional team members and management.

Alert fatigue is a real problem with monitoring systems, so we spent considerable time tuning thresholds and suppressing noise. The goal is that every alert represents something that actually needs human attention. When people start ignoring alerts because most are false alarms, the monitoring system has failed.

The Implementation

Week 1: Deployed the OMD server and configured basic monitoring for critical systems. Immediate visibility into the most important servers.

Week 2: Rolled out monitoring agents to all servers. Configured service-specific checks for databases, applications, and infrastructure.

Week 3: Deployed Webmin to Linux servers. Configured email alerting with initial thresholds.

Week 4: Built custom dashboards. Tuned alert thresholds based on baseline data. Trained IT staff on the tools.

Ongoing: Continuous refinement of thresholds, addition of new checks as needs are identified, and regular review of alert patterns to reduce noise.

The Results

99.9% uptime achieved. With early warning of issues, the team can address problems before they cause outages. Planned maintenance replaces emergency firefighting.

85% of issues caught before user impact. Disk space warnings, memory pressure, failing services—these are now addressed proactively rather than reactively.

15-minute average response time. When something does go wrong, the right people know immediately. No more waiting for a user to report a problem.

Capacity planning enabled. Historical data shows trends over time. The team can see that a database server's storage is growing 5% per month and plan accordingly, rather than being surprised when it fills up.

Reduced after-hours emergencies. Issues that would have become 2 AM emergencies are now caught during business hours when they're easier to address.

Better vendor accountability. When a cloud service or ISP has issues, the monitoring system provides documentation. "Our monitoring shows your service was unavailable from 2:15 to 2:47" is more effective than "it seemed slow yesterday."

What We Monitor

The monitoring implementation covers:

Server Health

  • CPU utilization and load
  • Memory usage and swap
  • Disk space and I/O
  • Network throughput
  • System uptime

Services & Applications

  • Windows services status
  • Linux daemon status
  • Database connectivity and performance
  • Web application response times
  • Backup job completion

Network Infrastructure

  • Switch port status
  • Router interface throughput
  • UPS battery and load
  • Internet connectivity
  • VPN tunnel status

Security & Compliance

  • Failed login attempts
  • Certificate expiration
  • Antivirus definition age
  • Patch compliance status
  • Unauthorized service detection

Key Takeaways

Open source doesn't mean amateur. OMD, Nagios, and Check_MK power monitoring at organizations of all sizes, including major enterprises. The tools are mature, well-documented, and actively maintained. The savings on licensing can be invested in proper implementation and customization.

Monitoring is a process, not a project. The initial deployment is just the beginning. Thresholds need tuning, new systems need to be added, and alert patterns need regular review. Budget ongoing time for monitoring maintenance.

Alert fatigue kills monitoring effectiveness. Every alert should require action. If the team learns to ignore alerts because most are noise, they'll ignore the important ones too. Ruthlessly tune thresholds and suppress alerts that don't indicate real problems.

Dashboards should match the audience. Technical staff need details. Management needs summaries. A single dashboard rarely serves both well. Build views appropriate to each audience's needs and decisions.


Need visibility into your infrastructure? Learn about our infrastructure services or contact us to discuss monitoring for your environment.

Related Case Studies

Multi-System Integration

Manufacturer connects ERP, CRM, shipping, and BI for unified operations.

Read Case Study
M&A Data Consolidation

5 companies merged into unified ERP & CRM with clean, consistent data.

Read Case Study