A manufacturer with 45+ servers across a mixed Windows/Linux environment went from purely reactive IT -- finding out about problems when users complained -- to proactive monitoring with automated alerts. 85% of issues are now resolved before users are affected. Uptime hit 99.9%.
Servers Monitored
Uptime Achieved
Issues Caught Early
Avg Response Time
Industry: Manufacturing
Environment: Hybrid (on-premise + cloud)
Tools: OMD/Check_MK, Webmin, Custom Dashboards
Infrastructure: Windows, Linux, network devices
The Challenge
The IT team found out about server problems the same way everyone else did -- when something stopped working and someone complained. Disks filled up until applications crashed. Failed backups went unnoticed for days. Memory leaks degraded performance until mid-day reboots were required.
Every incident was an emergency. No warning, no time to plan, no ability to address issues during maintenance windows. The reactive approach was burning out IT staff and frustrating every department that depended on the systems.
The Solution
OMD Monitoring Core
We deployed OMD (Open Monitoring Distribution) with Nagios/Check_MK as the monitoring foundation. Lightweight agents on each server report CPU, memory, disk, network, processes, and service status. Network devices monitored via SNMP. Each server type gets tailored checks -- database servers get query performance monitoring, domain controllers get replication health, file servers get share availability and quotas.
Tiered Alert System
Warnings (disk at 80%, slow backup, elevated CPU) go to the IT team's shared inbox. Critical alerts (server down, critical service stopped, disk at 95%) go directly to on-call staff via email and SMS. Escalation rules push unacknowledged criticals to additional staff and management after 15 minutes.
We spent significant time tuning thresholds and suppressing noise. Every alert must represent something requiring human action. Alert fatigue is the fastest way to make a monitoring system useless.
Custom Dashboards
Operations: Wall-mounted display in IT area with green/yellow/red status for all critical systems. Readable from across the room, auto-refreshing.
Executive: Uptime percentages, incident trends, capacity utilization. Business metrics, no technical detail.
Service-Specific: Focused views for ERP environment, database performance, and integration service health.
Linux Server Management
Webmin deployed to all Linux servers for browser-based administration -- user management, package updates, service control, log viewing, and cron job management. IT responds to alerts and resolves issues without leaving the browser.
Implementation
Week 1: OMD server deployed, critical systems monitored immediately.
Week 2: Agents rolled out to all 45+ servers. Service-specific checks configured.
Week 3: Webmin deployed. Email alerting configured with initial thresholds.
Week 4: Custom dashboards built. Thresholds tuned against baseline data. IT staff trained.
Ongoing: Continuous threshold refinement, new checks as needs emerge, regular alert pattern review.
The Results
99.9% uptime. Early warning lets the team address problems before outages. Planned maintenance replaces emergency firefighting.
85% of issues caught before user impact. Disk warnings, memory pressure, failing services resolved proactively instead of reactively.
15-minute average response time. The right people know immediately when something goes wrong. No waiting for user reports.
Capacity planning enabled. Historical trends show a database server growing 5% per month. The team plans ahead instead of being surprised.
Fewer after-hours emergencies. Issues caught during business hours when they are easier and cheaper to address.
Vendor accountability. Documented outage windows give IT leverage with ISPs and cloud providers.
What We Monitor
Server Health: CPU, memory, disk space/I/O, network, uptime
Services: Windows services, Linux daemons, database connectivity, web app response times, backup completion
Network: Switch ports, router throughput, UPS battery/load, internet, VPN tunnels
Security: Failed logins, certificate expiration, AV definitions, patch compliance
Finding out about server problems from your users?
We will assess your infrastructure and show you how proactive monitoring can eliminate surprise outages.
Book a Free Workflow AuditRelated Case Studies
6 Systems Connected for Manufacturer
40% admin time reduction by integrating ERP, CRM, shipping, phone, and BI.
Read Case Study5 Companies Unified into One ERP & CRM
Post-acquisition data consolidation: 180K+ records processed, 34% duplicates eliminated.
Read Case Study