When power cycling your (x86) server isn't enough to recover it

GlenTheMachine · 2024-12-25T19:46:03 1735155963

In grad school we built a lot of logic boards from scratch. They were used for submersible robots, and we had a 350,000 gallon water tank that we kept heated to 88 degrees. This was three stories above the ground in a metal building. You can't really air condition that, so in the summer it got quite hot.

It was not uncommon to return from lunch to find than an embedded computer board that had been working when you left wasn't any more. One way to debug them was to put them in the refrigerator for a while. If they then worked, you knew you had a bad solder joint or an IC that was on the verge of failing.

nielsbot · 2024-12-25T21:52:05 1735163525

Wow. That's 3M lbs of water. (1.34M kg)

Szpadel · 2024-12-25T13:40:05 1735134005

I have at least 2 regular cases where full power off was required to resolve the issue.

First one is dell latitude laptop with fingerprint reader, randomly after few days of operation, fingerprint reader stops responding and login screens freeze for a minute until it timeouts few times. reboot does not solve it, nor suspending machine. it needs to be powered off and on again (hibernation to disk also works).

second case is my pc with ASRock creator x570, after long time if keeping it suspended, WiFi card stopped to function and just throwed some errors in dmesg on driver initialization. here even power off and on did not help, but flipping switch on power supply for few second resolved the issue

apfsx · 2024-12-25T19:51:15 1735156275

I’ve actually had some strange anomalies happen like this on a couple laptops I have. Rebooting or even holding the power button long enough to do according to the manufacturer some kind of CMOS or hard reset didn’t work either. I had to open up the bottom, cover unplugged, the battery completely Then re-plugged in and everything went back to operational condition.

speckx · 2024-12-25T19:40:24 1735155624

My friend had an issue with a laptop that did not resolve until the battery was fully drained.

Aachen · 2024-12-25T20:13:03 1735157583

Wouldn't the quicker solution be disconnecting the battery for 2 seconds?

pests · 2024-12-25T21:04:38 1735160678

Not everyone has the skills or knowledge to disassemble their laptop. I haven’t had a removable easily replaceable battery since I feel 2006ish. My current one requires 8 security screws on the bottom, a bracket removed, and even I had some issues when I did a swap earlier this year.

Latty · 2024-12-25T14:09:56 1735135796

The WiFi/Bluetooth one was common on AM4, I think, I also had that issue.

duffyjp · 2024-12-25T15:49:39 1735141779

The integrated wifi/bt on my AM5 board was so bad I had to disable it and use a PCIe card.

For obvious reasons AMD boards don’t tend to ship with Intel wifi, but in my experience anything else sucks. The intel 6e cards are amazing and dirt cheap.

jmb99 · 2024-12-25T17:56:58 1735149418

> For obvious reasons AMD boards don’t tend to ship with Intel wifi

Funnily enough, the threadripper (at least WRX90, and at least asrock) come with an Intel dual 10Gb LAN card. Probably because none of the alternatives are good enough for a pro board.

doubled112 · 2024-12-25T16:06:00 1735142760

I’ve had some weirdness with Intel WiFi cards over the years too, especially when dual booting.

duffyjp · 2024-12-25T16:57:32 1735145852

Was it the 9560 by chance? (The original AC / wifi5 one) Those were terrible. Our house isn’t practical to wire, so I had a lot of them. All swapped to AX210 cards (6E) and those work phenomenally.

I also dual boot, in addition to being an incurable distro hopper, and these AX210 cards worked out of the box in basically everything.

tonyarkles · 2024-12-25T21:26:12 1735161972

Yeah I’ve got a Lenovo Legion laptop that I dual-boot Windows and Linux. I haven’t tried in a while but for at least a year it was impossible to soft-reboot to switch OSes if you wanted wifi to work. My best theory was that Windows and Linux had different firmware that they loaded into it at boot and they weren’t reloading that after a soft reboot (just using whatever was already running on the card).

toast0 · 2024-12-25T16:18:59 1735143539

> For obvious reasons AMD boards don’t tend to ship with Intel wifi, but in my experience anything else sucks.

Cause realtek checks the box for has wifi and costs probably $3 less? If you care, you can swap it, and if you don't, you don't.

trebligdivad · 2024-12-25T12:00:30 1735128030

A BIOS can forget to reset some devices. A physical device might have a design flaw where it forgets to reset some registers on reset. A BIOS (including device firmware) can forget to zero some RAM/initialise a structure and get lucky.

garganzol · 2024-12-25T12:34:54 1735130094

Yep, this is a typical flaw and it can cause annoying situations. I met it in my practice.

kyrofa · 2024-12-25T14:25:38 1735136738

The Linux kernel supports rebooting using a number of different strategies[1]. Some PCs need a different one than the default in order to make sure everything is properly reset.

[1]: https://github.com/torvalds/linux/blob/9b2ffa6148b1e4468d08f...

mjg59 · 2024-12-25T14:45:27 1735137927

Linux now uses exactly the same reboot strategy as Windows does, so no PC should "need" a different one - it may be the case that driver code leaves the hardware in a state the system vendor didn't test, and using a different reboot approach may work around that, but it's not fundamentally the reboot method that's causing the problem there (https://mjg59.dreamwidth.org/3561.html goes into some more detail on how all this actually works)

kyrofa · 2024-12-25T18:36:34 1735151794

Yes, I didn't mean to imply that Linux was doing anything wrong, just that some hardware seems to work better with other approaches, for the reasons you state.

geocrasher · 2024-12-25T16:19:41 1735143581

Whenever I power cycle something that doesn't go right the first time, I leave it off for at least 30 seconds so all the caps can discharge and any saved state can reset. Especially true of routers etc.

ijustlovemath · 2024-12-25T17:45:36 1735148736

You can further be sure of this by pressing the On button while the power supply is disconnected. Ofc make sure it's always off when you connect or disconnect the power supply.

klysm · 2024-12-25T17:52:26 1735149146

Depends on how the on button is implemented, and the power management of the system. On older devices I would expect this to be more reliable.

geocrasher · 2024-12-25T19:39:59 1735155599

Indeed, this used to be my "secret trick" for laptops that wouldn't power on: Disconnect the battery and power supply, hold the power button for 30 seconds, then power it back up. Worked every time.

wanderr · 2024-12-26T01:19:39 1735175979

I had a desktop that would so something similar occasionally ~15 years ago. I am impatient, so rather than leaving it off for a while I would unplug, hit the power button, plug it back in and turn it on. Usually the fans would even spin for a fraction of a second, there was so much residual power in the caps.

chasil · 2024-12-25T12:18:15 1735129095

I ran MECleaner once, and removed power from a desktop, waited ten seconds, plugged it back in, and the test for the presence of the ME was still positive.

I unplugged it and left it overnight, and the next day, the ME was gone.

This was the ARC version, but it can remain operational for some time after power is removed.

trilbyglens · 2024-12-25T15:01:44 1735138904

Probably a capacitor in there somewhere that slowly discharges when unplugged for a longer time.

Joel_Mckay · 2024-12-25T13:57:00 1735135020

IIRC, on most modern intel cpus removing/blanking the ME will reboot the machine every 20 minutes or so. It is unfortunately an irremovable OEM hardware RAT on most modern systems.

That being said, there are some versions of BIOS that do allow turning the ME off, but most motherboard and laptop manufacturers will not allow general consumers to install that version of the firmware. There are some groups that have figured out how to sign a patched fully feature-unlocked BIOS on a per machine basis (disabling ME is a simple Y/N flag), but YMMV given these tools are nearly impossible to get working.

AMD should end the clown show of RATs, and eat the remaining Intel market. =3

doublepg23 · 2024-12-25T14:50:20 1735138220

I was under the impression some boutique Linux laptop manufacturers like System76 and StarLabs flashed Coreboot.

Joel_Mckay · 2024-12-25T20:21:06 1735158066

Indeed, they used the coreboot nvramtool to set the disable IME flag.

It's still there, but unlike most consumer BIOS can apparently be turned off (whatever that means to Intel.)

Personally, I don't hold a lot of hope outdated on-chip minix OS can't be exploited/activated anyway. =3

chasil · 2024-12-25T23:48:29 1735170509

This was on a Core 2 duo, the last generation where it could be totally removed.

DaSHacka · 2024-12-25T14:58:37 1735138717

> IIRC, on most modern intel cpus removing/blanking the ME will reboot the machine every 20 minutes or so. It is unfortunately an irremovable OEM hardware RAT on most modern systems.

Yes, if ME detects a problem when initializing it grants you a 20 minute window as a grace period, presumably to allow users to attempt to fix it.

> There are some groups that have figured out how to sign a patched fully feature-unlocked BIOS on a per machine basis (disabling ME is a simple Y/N flag), but YMMV given these tools are nearly impossible to get working.

You can also just flip the HAP bit[0], I'd assume that's what those advanced (usually leaked dev build) BIOS firmwares do anyway.

> AMD should end the clown show of RATs, and eat the remaining Intel market. =3

AMD has PSP[1], which is functionally equivalent (though with a significantly smaller attack surface, when left enabled)

I personally am of the belief that both technologies are likely backdoored. There's so much pointing against them[2], that the simplest explanation is they're more likely than not a mandated backdoor that chipmakers eventually expanded for other purposes (such as recent versions of ME handling suspend-related power management)

[0] https://github.com/corna/me_cleaner/wiki/HAP-AltMeDisable-bi...

[1] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_Proces...

[2] https://en.m.wikipedia.org/wiki/Intel_Management_Engine#Asse...

Joel_Mckay · 2024-12-25T20:09:45 1735157385

Computrace was replaced by the Absolute BIOS module, so yes... 100% RAT features have been active for sometime. Whatever legitimate asset recovery and remote drive deletion features it offers, is superseded by potential backdoors on the refurbished PC market.

This is why we can't have nice things. =3

guerrilla · 2024-12-25T14:02:04 1735135324

The AMD equivalent is the PSL, right? Can that be disabled on any CPUs?

DaSHacka · 2024-12-25T14:30:37 1735137037

I am unaware of the PSL, but I know AMD PSP is the equivalent to ME for most AMD chips [0].

Some motherboards allow you to disable it, and it doesn't do as much as ME in the first place (no network modules or built-in remote access purpose like ME)

[0] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_Proces...

guerrilla · 2024-12-25T17:11:03 1735146663

Typo, I meant PSP.

jcalvinowens · 2024-12-25T14:57:30 1735138650

OP, what Linux is this? I'm really curious, I don't recognize that trace format and I can't find the code to print exception traces with the eight bangs on the first line like that anywhere in the upstream git history. I think they're actually from the BIOS?

   !!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!

My story: I had an Intel NUC running Linux back in the day, which would get stuck in standby such that I had to remove and replace the CMOS battery to get it to boot again! I never figured that one out...

pzmarzly · 2024-12-25T21:59:29 1735163969

This is a trace from the BIOS, it is not uncommon to have them printed over the serial console. Potentially the BIOS is based on EDK2 source code, in which case you can take a look here for the implementation of the trace printing logic: https://github.com/tianocore/edk2/blob/9e6537469d4700d9d793e...

snakeyjake · 2024-12-25T21:37:39 1735162659

Are these Dells?

Some Dells have a "feature" where something, somewhere, in their mess of a UEFI/iDRAC stack will get corrupted and will stay wrong through power cycles until you physically unplug the servers from power and hold down the power button to discharge a capacitor and clear out the NVRAM where the corrupted value is.

Most recently this impacted a PowerEdge R7525 server we have where the iDRAC was enforcing a power cap of ~300 watts leaving the system to be less than 1/10th as performant as it should have been. Manually setting a new power cap did nothing except update the values displayed in the UI. Multiple six minute (because of their mess of a UEFI/iDRAC stack) reboots of both the server and the iDRAC did nothing.

Dell was less than useful except for the fact that they hosted the answer. After raging against their CSA script/LLM auto-reply bullshit for days an aggrieved user with the same issue looking for help in their forums finally posted that he did the cap drain trick and it worked.

Saved me tons of wasted time. Thanks, anonymous fellow frustrated dell customer!

markhahn · 2024-12-25T16:58:26 1735145906

I usually prefer the 'reset' option (such as in IPMI). After all, this is the as-designed way to politely ask all devices to re-initialize.

Yes, power-cycling is more unambiguous, but afaikt, the example here is purely that power cycling really needs a noticable off-period so that all devices can fully come down. Otherwise, there's no real standard on what should happen - this or that component might stay up or retain state.

The other reason I like 'reset' is that lots of devices (fans, disks, probably all power systems - definitely including PSUs) have lifetime limits in power cycles. Mostly this is minor, unless you do something like reboot cluster nodes after a job (concievably a paranoid security requirement), or some automation gets in a loop and continually zaps a server.

neuroelectron · 2024-12-25T15:16:27 1735139787

I've seen similar behavior when trying out a fork bomb in the terminal on both Linux and Windows. My guess is that on windows the fork bomb made it into the virtual memory and was recorded to disk and wasn't cleaned out completely during boot.

It too, 3 reboots to clear up the errors. Generally on the linux system one extra reboot was necessary about half of the time.

vachina · 2024-12-25T14:57:15 1735138635

I’ve came across Acer laptops that’d always bluescreen on restart after a PROCHOT shutdown. The fix is to pull out the battery for a few seconds and then plug it back in, magically fixes the bluescreen.

wibbily · 2024-12-25T22:34:56 1735166096

Something like this happened to me once. Lost power in a lightning storm and when it came back my computer could no longer shut off.

Like, at all. Would just hang when you tried. Couldn’t exit from BIOS after changing settings, couldn’t suspend to RAM. Had to yoink the cord whenever I needed to restart. Wild stuff.

Perhaps like Frankenstein the lightning was a breath of life, and with its new sentience my PC was trying to preserve its existence. At any rate I reflashed the BIOS after a few months and it never happened again.

kccqzy · 2024-12-25T20:36:47 1735159007

I've experienced a similar problem with a Thunderbolt port on a machine. Nothing that plugs into the machine would be recognized. Not even a simple USB device. Power cycling multiple times didn't fix it. But powering off and leaving the machine off for a few minutes fixed it.

Given the problem occurred only once, I didn't do any more investigation on why.

zoky · 2024-12-25T20:52:12 1735159932

Bad electrons. Turning off the power lets them drain out.

petemc_ · 2024-12-25T19:10:00 1735153800

When managing large numbers of Dell rack mounted servers, a flea power drain is something you become very familiar with.

NBJack · 2024-12-25T18:57:38 1735153058

I have had several laptops over the years like this. Full shutdown and power on does not reset some problems, like missing audio, missing wifi, etc. For Lenovo devices, I have to go as far as using the 'recovery' button. This goes for DP Alt Mode as well. Kinda annoying, but at least there's a solution.

dxdxdt · 2024-12-25T19:28:03 1735154883

I don't get it. That post was a whole bag of nothing. Why are you guys upvoting it?

magicalhippo · 2024-12-25T20:12:43 1735157563

I've had it happen to me, so not a whole bag of nothing, and might be surprising to some.

Also, a topic which can spur some interesting comments.

Reventlov · 2024-12-25T14:16:49 1735136209

I had the problem on APU4C4, iirc. You install openwrt on it, everything is working fine, then, you reboot and you get nothing on the serial port.

You unplug/plug it, cold boot it, and then it works again.

bell-cot · 2024-12-25T16:38:06 1735144686

I think it was the 1970's when I first heard of the "remove power, wait a good while, try again" strategy.

The subject was a cheap little black & white TV set that my folks had. Dad was an amateur radio operator, who mostly built his own equipment. He could have dissembled it, traced circuits, and calculated the wait time if he'd cared to.