Maintaining Server Health

The new IPMI standard defines interfaces to monitor a server’s health, and comes with 
many other management capabilities

Good server management is the key to reducing downtime and loss of productivity. One way to know whether a server is running normally or not is platform instrumentation, that is, monitoring a server’s physical characteristics like temperature, voltage, fans, and power supplies. IPMI or Intelligent Platform Management Interface is a specification that defines common interfaces to the hardware that monitor these physical health characteristics in a server.

Although, many servers in the market today have some built-in monitoring capabilities, they are not interoperable. This becomes important in heterogeneous and multi-server computing environments. IPMI enables interoperability between baseboards and servers, baseboards and server-management software and even between servers. 
Intel, Hewlett-Packard, NEC, and Dell are the promoters of IPMI, which is currently in v1.5. Besides defining commands, data structures, and formats for the hardware, IPMI also states common management functions such as how the System Event Log and Sensor Data Records are managed and accessed, how the system interfaces work, how sensors operate, how control functions such as system power on/off and reset are initiated, and how the IPMI host-system watchdog timer function operates. 

IPMI also includes an Intelligent Platform Management Bus (IPMB) specification. This defines an internal management expansion bus to link chassis-management features with motherboard -management subsystem. There’s also an Intelligent Chassis Management Bus (ICMB) specification, which defines a dedicated ‘inter-chassis’ management bus for interconnecting IPMI management between multiple host systems and peripheral chassis.
Servers based on IPMI use hardware that works even when the processor is down so that platform management information and control capabilities are always accessible. So a system can be managed under all phases like power-down, pre-boot, or OS load. It can use all major communication interfaces like LAN, serial/modem, local management software, third-party emergency management add-on cards, and other IPMI-enabled servers. 

Other key features of IPMI include automatic alerting and recovery actions to notify a remote destination of a problem. Recovery actions include remotely commanding a system to power on/off, power cycle, reset, or trigger a diagnostic interrupt. It can also set ‘boot options’ after a remote startup. For example, after remotely resetting a server, you can switch it to a partition other than the main OS partition. 

The core of IPMI is a management micro-controller, called a BMC (Baseboard Management Controller). This operates on standby power and checks system health at fixed intervals. If any elements malfunction, it can take actions like logging the event, generating alerts, and even performing automatic recovery actions like system power down or resets. The BMC accesses a set of non-volatile storage that holds the Sensor Data Records, System Event Log, and Field Replaceable Unit information like unique hardware serial numbers. 

IPMI has been adopted by vendors like Sun, National Semiconductor, and the companies that started the initiative.

Kunal Dua