I am making steady progress towards moving the Computers Are Bad enterprise
cloud to its new home, here in New Mexico. One of the steps in this process is,
of course, purchasing a new server... the current Big Iron is getting rather
old (probably about a decade!) and here in town I'll have the rack space for
more machines anyway.
In our modern, cloud-centric industry, it is rare that I find myself comparing
the specifications of a Dell PowerEdge against an HP ProLiant. Because the
non-hyperscale server market has increasingly consolidated around Intel
specifications and reference designs, it is even rarer that there is much of a
difference between the major options.
This brings back to mind one of those ancient questions that comes up among
computer novices and becomes a writing prompt for technology bloggers. What
is a server? Is it just, like, a big computer? Or is it actually special?
There's a lot of industrial history wrapped up in that question, and the answer
is often very context-specific. But there are some generalizations we can make
about the history of the server: client-server computing originated mostly as
an evolution of time-sharing computing using multiple terminals connected to a
single computer. There was no expectation that terminals had a similar
architecture to computers (and indeed they were usually vastly simpler
machines), and that attitude carried over to client-server systems. The PC
revolution instilled a WinTel monoculture in much of client-side computing by
the mid-'90s, but it remained common into the '00s for servers to run entirely
different operating systems and architectures.
The SPARC and Solaris combination was very common for servers, as were IBM's
minicomputer architectures and their numerous operating systems. Indeed, one of
the key commercial contributions of Java was the way it allowed enterprise
applications to be written for a Solaris/SPARC backend while enabling code
reuse for clients that ran on either stalwarts like Unix/RISC or "modern"
business computing environments like Windows/x86. This model was sometimes
referred to as client-server computing with "thick clients." It preserved the
differentiation between "server" and "client" as classes of machines, and the
universal adherance of serious business software to this model lead to an
association between server platforms and "enterprise computing."
Over time, things have changed, as they always do. Architectures that had been
relegated to servers became increasingly niche and struggled to compete with
the PC architecture on cost and performance. The general architecture of server
software shifted away from vertical scaling and high-uptime systems to
horizontal scaling with relaxed reliability requirements, taking away much of
the advantage of enterprise-class computers. For the most part, today, a server
is just a big computer. There are some distinguishing features: servers are far
more likely to be SMP or NUMA, with multiple processor sockets. While the days
of SAS and hardware RAID are increasingly behind us, servers continue to have
more complex storage controllers and topologies than clients. And servers,
almost by definition, offer some sort of out of band management.
Out-of-band management, sometimes also called lights-out management, identifies
a capability that is almost unheard of in clients. A separate, smaller
management computer allows for remote access to a server even when it is, say,
powered off. The terms out-of-band and in-band in this context emerge from
their customery uses in networking and telecom, meaning that out of band
management is performed without the use of the standard (we might say "data
plane") network connection to a machine. But in practice they have drifted in
meaning, and it is probably better to think of out-of-band management as
meaning that the operating system and general-purpose components are not
required. This might be made clearer by comparison: a very standard example of
in-band management would be SSH, a service provided by the software on a
computer that allows you to interact with it. Out-of-band management, by
contrast, is provided by a dedicated hardware and software stack and does not
require the operating system or, traditionally, even the CPU to cooperate.
You can imagine that this is a useful capability. Today, out-of-band management
is probably best exemplified by the remote console that most servers offer.
It's basically an embedded IP KVM, allowing you to interact with the machine as
if you were at a locally connected monitor and keyboard. A lot of OOB
management products also offer "virtual media," where you can upload an ISO
file to the management interface and then have it appear to the computer proper
as if it were a physical device. This is extremely useful for installing
operating systems.
OOB management is an interesting little corner of computer history. It's not a
new idea at all; in fact, similar capabilities can be found through pretty much
the entire history of business computing. If anything, it's gotten simpler and
more boring over time. A few evenings ago I was watching a clabretro
video about an IBM p5 he's gotten
working. As is the case in most of his videos about servers, he has to give a
brief explanation of the multiple layers of lower-level management systems
present in the p5 and their various textmode and web interfaces.
If we constrain our discussion of "servers" to relatively modern machines,
starting say in the late '80s or early '90s, there are some common features:
- Some sort of local operator interface (this term itself being a very old
one), like an LCD matrix display or grid of LED indicators, providing low-level
information on hardware health.
- A serial console with access to the early bootloader and a persistent
low-level management system.
- A higher-level management system, with a variable position in the stack
depending on architecture, for remote management of the machine workload.
A lot of this stuff still hangs around today. Most servers can tell you on the
front panel if a redundant component like a fan or power supply has failed,
although the number of components that are redundant and can be replaced online
has dwindled with time from "everything up to and including CPUs" on '90s
prestige architectures to sometimes little more than fans. Serial management is
still pretty common, mostly as a holdover of being a popular way to do OS
installation and maintenance on headless machines [1].
But for the most part, OOB management has consolidated in the exact same way as
processor architecture: onto Intel IPMI.
IPMI is confusing to some people for a couple of reasons. First, IPMI is a
specification, not an implementation. Most major vendors have their own
implementation of IPMI, often with features above and beyond the core IPMI
spec, and they call them weird acronyms like HP iLO and Dell DRAC. These
vendor-specific implementations often predate IPMI, too, so it's never quite
right to say they are "just IPMI." They're independent systems with IPMI
characteristics. On the other hand, more upstart manufacturers are more likely
to just call it IPMI, in which case it may just be the standard offering from
their firmware vendor.
Further confusing matters is a fair amount of terminological overlap. The IPMI
software runs on a processor conventionally called the baseboard management
controller or BMC, and the terms IPMI and BMC are sometimes used
interchangeably. Lights-out management or LOM is mostly an obsolete term but
sticks around because HP(E) is a fan of it and continues to call their IPMI
implementation Integrated Lights-Out. The BMC should not be confused with the
System Management Controller or SMC, which is one of a few terms used for a
component present in client computers to handle tasks like fan speed control.
These have an interrelated history and, indeed, the BMC handles those functions
in most servers.
IPMI also specifies two interfaces: an out-of-band interface available over the
network or a serial connection, and an in-band interface available to the
operating system via a driver (and, in practice, I believe communication
between the CPU and the baseboard management controller via the low-pin-count
or LPC bus, which is a weird little holdover of ISA present in most modern
computers). The result is that you can interact with the IPMI from a tool
running in the operating system, like ipmitool on Linux. That makes it a little
confusing what exactly is going on, if you don't understand that the IPMI is a
completely independent system that has a local interface to the running
operating system for convenience.
What does the IPMI actually do? Well, like most things, it's mostly become a
webapp. Web interfaces are just too convenient to turn down, so while a lot of
IPMI products do have dedicated client software, they're porting all the
features into an embedded web application. The quality of these web interfaces
varies widely but is mostly not very good. That raises a question, of course,
of how you get to the IPMI web interface.
Most servers on the market have a dedicated ethernet interface for the IPMI,
often labelled "IPMI" or "management" or something like that. Most people would
agree that the best way to use IPMI is to put the management network interface
onto a dedicated physical network, for reasons of both security and reliability
(IPMI should remain accessible even in case of performance or reliability
problems with your main network). A dedicated physical network costs time,
space, and money, though, so there are compromises. For one, your "management
network" is very likely to be a VLAN on your normal network equipment. That's
sort of like what AT&T calls a common-carrier switching arrangement, meaning
that it behaves like an independent, private network but shares all of the
actual equipment with everything else, the isolation being implemented in
software. That was a weird comparison to make and I probably just need to write
a whole article on CCSAs like I've been meaning to.
Even that approach requires extra cabling, though, so IPMI offers "sideband"
networking. With sideband management, the BMC communicates directly with the
same NIC that the operating system uses. The implementation is a little bit
weird: the NIC will pretend to be two different interfaces, mixing IPMI traffic
into the same packet stream as host traffic but using a different MAC
address. This way, it appears to other network equipment as if there are two
different network interfaces in use, as usual. I will leave judgment as to how
good of an idea this is to you, but there are obvious security considerations
around reducing the segregation between IPMI and application traffic.
And yes, it should be said, a lot of IPMI implementations have proven to be
security nightmares. They should never be accessible to any untrusted person.
Details of network features vary between IPMI implementations, but there is a
standard interface on UDP 623 that can be used for discovery and basic
commands. There's often SSH and a web interface, and VNC is pretty common for
remote console.
There are some neat basic functions you can perform with the IPMI, either over
the network or locally using an in-band IPMI client. A useful one, if you are
forgetful and keep poor records like I do, is listing the hardware modules
making up the machine at an FRU or vendor part number level. You can also
interact with basic hardware functions like sensors, power state, fans, etc.
IPMI offers a standard watchdog timer, which can be combined with software
running on the operating system to ensure that the server will be reset if
the application gets into an unhealthy state. You should set a long enough
timeout to allow the system to boot and for you to connect and disable the
watchdog timer, ask me how I know.
One of the reasons I thought to write about IPMI is its strange relationship to
the world of everyday client computers. IPMI is very common in enterprise
servers but very rare elsewhere, much to the consternation of people like me
that don't have the space or noise tolerance for a 1U pizzabox in their homes.
If you are trying to stick to compact or low-power computers, you'll pretty
much have to go without.
But then, there's kind of a weird exception. What about Intel ME and AMD ST?
These are essentially OOB management controllers that are present in virtually
all Intel and AMD processors. This is kind of an odd story. Intel ME, the
Management Engine, is an enabling component of Intel Active Management
Technology (Intel AMT). AMT was pretty much an attempt at popularizing OOB
management for client machines, and offers most of the same capabilities as
IPMI. It has been considerably less successful. Most of that is probably due to
pricing, Intel has limited almost all AMT features to use with their very
costly enterprise management platforms. Perhaps there is some industry in which
these sell well, but I am apparently not in it. There are open-source AMT
clients, but the next problem you will run into is finding a machine where AMT
is actually usable.
The fact that Intel AMT has sideband management capability, and that therefore
the Intel ME component on which AMT runs has sideband management capability,
was the topic of quite some consternation in the security community. Here is a
mitigating factor: sideband management is only possible if the processor,
motherboard chipset, and NIC are all AMT-capable. Options for all three devices
are limited to Intel products with the vPro badge. The unpopularity of Intel
NICs in consumer devices alone means that sideband access is rarely possible.
vPro is also limited to relatively high-end processors and chipsets. The bad
news is that you will have a hard time using AMT in your homelab, although some
people certainly do. The upside is that the widely-reported "fact" that Intel
ME is accessible via sideband networking on consumer devices is typically
untrue, and for reasons beyond Intel software licensing.
That leaves an odd question around Intel ME itself, though, which is certainly
OOB management-like but doesn't really have any OOB management features without
AMT. So why do nearly all processors have it? Well, this is somewhat
speculative, but the impression I get is that Intel ME exists mostly as a
convenient way to host and manage trusted execution components that are used
for things like Secure Boot and DRM. These features all run on the same
processor as ME and share some common technology stack. The "management"
portion of Intel ME is thus largely vestigial, and it's part of the secure
computing infrastructure.
This is not to make excuses for Intel ME, which is entirely unauditable by
third parties and has harbored significant security vulnerabilities in the
past. But, remember, we all use one processor architecture from one of two
vendors, so Intel doesn't have a whole lot of motivation to do better. Lest
you respond that ARM is the way, remember that modern ARM SOCs used in
consumer devices have pretty much identical capabilities.
It is what it is.
[1] The definition of "headless" is sticky and we have to not get stuck on it
too much. People tend to say "headless" to mean no monitor and keyboard
attached, but keep in mind that slide-out rack consoles and IP KVMs have been
common for a long time and so in non-hyperscale environments truly headless
machines are rarer than you would think. Part of this is because using a serial
console is a monumental pain in the ass, so your typical computer operator will
do a lot to avoid dealing with it. Before LCD displays, this meant a CRT and
keyboard on an Anthro cart with wheels, but now that we are an enlightened
society, you can cram a whole monitor and keyboard into 1U and get a KVM
switching fabric that can cover the whole rack. Or swap cables. Mostly swap
cables.