In grad school we built a lot of logic boards from scratch. They were used for submersible robots, and we had a 350,000 gallon water tank that we kept heated to 88 degrees. This was three stories above the ground in a metal building. You can't really air condition that, so in the summer it got quite hot.
It was not uncommon to return from lunch to find than an embedded computer board that had been working when you left wasn't any more. One way to debug them was to put them in the refrigerator for a while. If they then worked, you knew you had a bad solder joint or an IC that was on the verge of failing.
I have at least 2 regular cases where full power off was required to resolve the issue.
First one is dell latitude laptop with fingerprint reader, randomly after few days of operation, fingerprint reader stops responding and login screens freeze for a minute until it timeouts few times.
reboot does not solve it, nor suspending machine. it needs to be powered off and on again (hibernation to disk also works).
second case is my pc with ASRock creator x570, after long time if keeping it suspended, WiFi card stopped to function and just throwed some errors in dmesg on driver initialization. here even power off and on did not help, but flipping switch on power supply for few second resolved the issue
I’ve actually had some strange anomalies happen like this on a couple laptops I have. Rebooting or even holding the power button long enough to do according to the manufacturer some kind of CMOS or hard reset didn’t work either. I had to open up the bottom, cover unplugged, the battery completely Then re-plugged in and everything went back to operational condition.
Not everyone has the skills or knowledge to disassemble their laptop. I haven’t had a removable easily replaceable battery since I feel 2006ish. My current one requires 8 security screws on the bottom, a bracket removed, and even I had some issues when I did a swap earlier this year.
The integrated wifi/bt on my AM5 board was so bad I had to disable it and use a PCIe card.
For obvious reasons AMD boards don’t tend to ship with Intel wifi, but in my experience anything else sucks. The intel 6e cards are amazing and dirt cheap.
> For obvious reasons AMD boards don’t tend to ship with Intel wifi
Funnily enough, the threadripper (at least WRX90, and at least asrock) come with an Intel dual 10Gb LAN card. Probably because none of the alternatives are good enough for a pro board.
Was it the 9560 by chance? (The original AC / wifi5 one) Those were terrible. Our house isn’t practical to wire, so I had a lot of them. All swapped to AX210 cards (6E) and those work phenomenally.
I also dual boot, in addition to being an incurable distro hopper, and these AX210 cards worked out of the box in basically everything.
Yeah I’ve got a Lenovo Legion laptop that I dual-boot Windows and Linux. I haven’t tried in a while but for at least a year it was impossible to soft-reboot to switch OSes if you wanted wifi to work. My best theory was that Windows and Linux had different firmware that they loaded into it at boot and they weren’t reloading that after a soft reboot (just using whatever was already running on the card).
A BIOS can forget to reset some devices.
A physical device might have a design flaw where it forgets to reset some registers on reset.
A BIOS (including device firmware) can forget to zero some RAM/initialise a structure and get lucky.
The Linux kernel supports rebooting using a number of different strategies[1]. Some PCs need a different one than the default in order to make sure everything is properly reset.
Linux now uses exactly the same reboot strategy as Windows does, so no PC should "need" a different one - it may be the case that driver code leaves the hardware in a state the system vendor didn't test, and using a different reboot approach may work around that, but it's not fundamentally the reboot method that's causing the problem there (https://mjg59.dreamwidth.org/3561.html goes into some more detail on how all this actually works)
Yes, I didn't mean to imply that Linux was doing anything wrong, just that some hardware seems to work better with other approaches, for the reasons you state.
Whenever I power cycle something that doesn't go right the first time, I leave it off for at least 30 seconds so all the caps can discharge and any saved state can reset. Especially true of routers etc.
You can further be sure of this by pressing the On button while the power supply is disconnected. Ofc make sure it's always off when you connect or disconnect the power supply.
Indeed, this used to be my "secret trick" for laptops that wouldn't power on: Disconnect the battery and power supply, hold the power button for 30 seconds, then power it back up. Worked every time.
I had a desktop that would so something similar occasionally ~15 years ago. I am impatient, so rather than leaving it off for a while I would unplug, hit the power button, plug it back in and turn it on. Usually the fans would even spin for a fraction of a second, there was so much residual power in the caps.
I ran MECleaner once, and removed power from a desktop, waited ten seconds, plugged it back in, and the test for the presence of the ME was still positive.
I unplugged it and left it overnight, and the next day, the ME was gone.
This was the ARC version, but it can remain operational for some time after power is removed.
IIRC, on most modern intel cpus removing/blanking the ME will reboot the machine every 20 minutes or so. It is unfortunately an irremovable OEM hardware RAT on most modern systems.
That being said, there are some versions of BIOS that do allow turning the ME off, but most motherboard and laptop manufacturers will not allow general consumers to install that version of the firmware. There are some groups that have figured out how to sign a patched fully feature-unlocked BIOS on a per machine basis (disabling ME is a simple Y/N flag), but YMMV given these tools are nearly impossible to get working.
AMD should end the clown show of RATs, and eat the remaining Intel market. =3
> IIRC, on most modern intel cpus removing/blanking the ME will reboot the machine every 20 minutes or so. It is unfortunately an irremovable OEM hardware RAT on most modern systems.
Yes, if ME detects a problem when initializing it grants you a 20 minute window as a grace period, presumably to allow users to attempt to fix it.
> There are some groups that have figured out how to sign a patched fully feature-unlocked BIOS on a per machine basis (disabling ME is a simple Y/N flag), but YMMV given these tools are nearly impossible to get working.
You can also just flip the HAP bit[0], I'd assume that's what those advanced (usually leaked dev build) BIOS firmwares do anyway.
> AMD should end the clown show of RATs, and eat the remaining Intel market. =3
AMD has PSP[1], which is functionally equivalent (though with a significantly smaller attack surface, when left enabled)
I personally am of the belief that both technologies are likely backdoored. There's so much pointing against them[2], that the simplest explanation is they're more likely than not a mandated backdoor that chipmakers eventually expanded for other purposes (such as recent versions of ME handling suspend-related power management)
Computrace was replaced by the Absolute BIOS module, so yes... 100% RAT features have been active for sometime. Whatever legitimate asset recovery and remote drive deletion features it offers, is superseded by potential backdoors on the refurbished PC market.
I am unaware of the PSL, but I know AMD PSP is the equivalent to ME for most AMD chips [0].
Some motherboards allow you to disable it, and it doesn't do as much as ME in the first place (no network modules or built-in remote access purpose like ME)
OP, what Linux is this? I'm really curious, I don't recognize that trace format and I can't find the code to print exception traces with the eight bangs on the first line like that anywhere in the upstream git history. I think they're actually from the BIOS?
!!!! X64 Exception Type - 12(#MC - Machine-Check) CPU Apic ID - 00000000 !!!!
My story: I had an Intel NUC running Linux back in the day, which would get stuck in standby such that I had to remove and replace the CMOS battery to get it to boot again! I never figured that one out...
This is a trace from the BIOS, it is not uncommon to have them printed over the serial console. Potentially the BIOS is based on EDK2 source code, in which case you can take a look here for the implementation of the trace printing logic: https://github.com/tianocore/edk2/blob/9e6537469d4700d9d793e...
Some Dells have a "feature" where something, somewhere, in their mess of a UEFI/iDRAC stack will get corrupted and will stay wrong through power cycles until you physically unplug the servers from power and hold down the power button to discharge a capacitor and clear out the NVRAM where the corrupted value is.
Most recently this impacted a PowerEdge R7525 server we have where the iDRAC was enforcing a power cap of ~300 watts leaving the system to be less than 1/10th as performant as it should have been. Manually setting a new power cap did nothing except update the values displayed in the UI. Multiple six minute (because of their mess of a UEFI/iDRAC stack) reboots of both the server and the iDRAC did nothing.
Dell was less than useful except for the fact that they hosted the answer. After raging against their CSA script/LLM auto-reply bullshit for days an aggrieved user with the same issue looking for help in their forums finally posted that he did the cap drain trick and it worked.
Saved me tons of wasted time. Thanks, anonymous fellow frustrated dell customer!
I usually prefer the 'reset' option (such as in IPMI). After all, this is the as-designed way to politely ask all devices to re-initialize.
Yes, power-cycling is more unambiguous, but afaikt, the example here is purely that power cycling really needs a noticable off-period so that all devices can fully come down. Otherwise, there's no real standard on what should happen - this or that component might stay up or retain state.
The other reason I like 'reset' is that lots of devices (fans, disks, probably all power systems - definitely including PSUs) have lifetime limits in power cycles. Mostly this is minor, unless you do something like reboot cluster nodes after a job (concievably a paranoid security requirement), or some automation gets in a loop and continually zaps a server.
I've seen similar behavior when trying out a fork bomb in the terminal on both Linux and Windows. My guess is that on windows the fork bomb made it into the virtual memory and was recorded to disk and wasn't cleaned out completely during boot.
It too, 3 reboots to clear up the errors. Generally on the linux system one extra reboot was necessary about half of the time.
I’ve came across Acer laptops that’d always bluescreen on restart after a PROCHOT shutdown. The fix is to pull out the battery for a few seconds and then plug it back in, magically fixes the bluescreen.
Something like this happened to me once. Lost power in a lightning storm and when it came back my computer could no longer shut off.
Like, at all. Would just hang when you tried. Couldn’t exit from BIOS after changing settings, couldn’t suspend to RAM. Had to yoink the cord whenever I needed to restart. Wild stuff.
Perhaps like Frankenstein the lightning was a breath of life, and with its new sentience my PC was trying to preserve its existence. At any rate I reflashed the BIOS after a few months and it never happened again.
I've experienced a similar problem with a Thunderbolt port on a machine. Nothing that plugs into the machine would be recognized. Not even a simple USB device. Power cycling multiple times didn't fix it. But powering off and leaving the machine off for a few minutes fixed it.
Given the problem occurred only once, I didn't do any more investigation on why.
I have had several laptops over the years like this. Full shutdown and power on does not reset some problems, like missing audio, missing wifi, etc. For Lenovo devices, I have to go as far as using the 'recovery' button. This goes for DP Alt Mode as well. Kinda annoying, but at least there's a solution.
I think it was the 1970's when I first heard of the "remove power, wait a good while, try again" strategy.
The subject was a cheap little black & white TV set that my folks had. Dad was an amateur radio operator, who mostly built his own equipment. He could have dissembled it, traced circuits, and calculated the wait time if he'd cared to.
It was not uncommon to return from lunch to find than an embedded computer board that had been working when you left wasn't any more. One way to debug them was to put them in the refrigerator for a while. If they then worked, you knew you had a bad solder joint or an IC that was on the verge of failing.
reply