Hardware Monitoring & Troubleshooting Basics - Knowledgebase

This article provides essential information on monitoring your dedicated server's hardware health and basic troubleshooting steps. Proactive monitoring can help prevent costly downtime.

Introduction
- Unlike VPS, dedicated servers mean you are responsible for the physical hardware. Monitoring hardware health is crucial to detect potential issues before they lead to outages.
- Key components to monitor: CPU, RAM, Storage (HDDs/SSDs), Fans, Temperatures, Power Supplies.
Utilizing IPMI/iLO/DRAC for Hardware Monitoring
- Your server's out-of-band management interface is the primary tool for hardware monitoring, as it operates independently of the server's OS.
- Steps:
  1. Log in to your IPMI/iLO/DRAC interface (as described in "Accessing Your Dedicated Server Control Panel").
  2. Navigate to sections like:
    - System Health: (iLO) Provides an overall health summary.
    - Sensor Readings / System Information: (IPMI/iLO/DRAC) Displays real-time data for CPU temperature, fan speeds, voltage levels, etc. Look for any values outside normal ranges (often highlighted in yellow/red).
    - Event Log / IML (Integrated Management Log): (IPMI/iLO/DRAC) This log records hardware events, errors, and warnings (e.g., drive failures, memory errors, power supply issues). Regularly check this log.
    - Storage / RAID Controller: (DRAC/iLO) Check the status of your hard drives and RAID arrays. Look for "Degraded" or "Failed" states.
- Alerting: Some OOB interfaces allow configuration of email alerts for critical hardware events.
OS-Level Monitoring Tools (Complementary)
- While OOB is primary for hardware, OS tools can provide software-level insights.
- Linux:
  - dmesg: Displays kernel ring buffer messages, which can show hardware errors detected by the OS.
  - smartctl (from smartmontools): Monitor S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) data for hard drives to predict failures.sudo apt install smartmontools # Debian/Ubuntusudo yum install smartmontools # CentOSsudo smartctl -a /dev/sda # Replace /dev/sda with your disk
  - lm_sensors: Monitor CPU/motherboard temperatures and fan speeds (requires setup).
  - RAID Monitoring: If you have a hardware RAID card, install its specific monitoring utility (e.g., MegaCli for Broadcom/LSI, perccli for Dell PERC).
- Windows Server:
  - Event Viewer: Check "System" logs for hardware-related errors and warnings.
  - Task Manager / Resource Monitor: Monitor CPU/Memory usage, but not direct hardware health.
  - Disk Management: Check disk status (Healthy, Failed).
  - Manufacturer Tools: Install specific monitoring software from Dell (OpenManage), HP (Insight Management Agents) for detailed hardware health.
Basic Hardware Troubleshooting Steps
- Sudden Shutdown/No Power:
  - Check power cables (if physical access).
  - Check power supply status lights on the server chassis.
  - Review IPMI/iLO/DRAC power log for unexpected shutdowns.
  - Contact ServerHood.com if no lights or server is unresponsive via OOB power cycling.
- Server Lagging/Overheating:
  - Check CPU temperature and fan speeds via OOB or lm_sensors. High temps indicate cooling issues or excessive load.
  - Ensure physical air flow is not obstructed (if in your own rack).
  - Contact support if fans are failing or temperatures are consistently high.
- Hard Drive/RAID Issues:
  - Check RAID controller status via OOB or OS-level utility.
  - Look for "Degraded" or "Failed" status.
  - If a drive has failed, the server will usually beep or flash a light. Immediately notify ServerHood.com for drive replacement.
  - If no RAID, check smartctl output for failing attributes.
- Memory Errors:
  - Often appear as kernel panics (Linux) or Blue Screens of Death (Windows).
  - Event logs (OOB, OS) will show memory errors.
  - Requires RAM replacement. Contact ServerHood.com.
- No Video Output / No Boot:
  - Connect to the remote console via IPMI/iLO/DRAC to see if there are any boot errors on the screen.
  - This often indicates OS corruption or critical hardware failure (e.g., CPU, RAM, motherboard).
When to Contact ServerHood.com Support:
- Any detected hardware failure (disk, RAM, PSU, motherboard, fan).
- Inability to access IPMI/iLO/DRAC or if it shows a critical error.
- Persistent crashes or unresponsiveness that cannot be resolved by software reboots.
- Suspicion of power supply issues or other critical component failure.
- Provide them with any specific error messages, log entries, or diagnostic tool outputs.
Conclusion
- Proactive hardware monitoring is key to a stable dedicated server. Regularly checking your OOB management interface and OS-level logs can help you identify and address hardware issues before they become critical.

Categories

Categories

Support

Hardware Monitoring & Troubleshooting Basics Print

Was this answer helpful?

Related Articles

Support

Categories

Categories

Support

Hardware Monitoring & Troubleshooting Basics Print

Was this answer helpful?

Related Articles

Support

Generate Password