523. Random Reboots -- troubleshooting. Diagnosed: incompatible motherboard.

Update 8 Jan 2014: I've been putting the FX8350 through its paces together with the other mobo and it's completely stable.  The FX8150 box is also stable. Note that I thought I had a crash a few days after making the swap -- I have not had any issues since whatsoever in spite of running very heavy jobs. Either way, it should remind me to check whether a mobo is compatible with a CPU before making my purchase in the future.

Update 18 Nov 2013: I swapped the CPUs between to boxes, so that I was now using a mobo that officially supported FX8350. Only the CPU moved, nothing else.

Update 5 Nov 2013: Note that the motherboard doesn't support the CPU and this leads to spontaneous reboots under certain conditions. Make sure to look at the list over supported CPUs for the motherboard you use (in retrospect, obvious -- but as a linux person you get used to ignoring those things since everything's for just OSX or Win).

See here for the troubleshooting thread:
 http://verahill.blogspot.com.au/2013/10/523-random-reboots-troubleshooting-in.html

Also see this thread: http://www.techpowerup.com/forums/showthread.php?t=184061
I'll need to read up on...stuff...but the bottom line seems to be that one would expect issues with this board/cpu combo:


Still only a 4+1 phase board the FX chips pull a bit more power than that can put out comfortably and stable. [..] Those would be your three best to choose from all are the better 8+2 phase designs...

and

my opinion is to stay away from the asus FX ive seen many people asking why their boards are throttling at full load, vrm protection causes voltages to drop at full load when vrms hit a certain temp.


and it seemed that low (CPU) voltages precipitated crashes.

Update 4 Nov 2013: swapped CPUs with a different box. Will test in a couple of days.
Update 4 Nov 2013: Changing the multiplier back to 20 from 17 (but keeping voltage stable) caused a crash -- this time in a record 13 minutes.
Update 4 Nov 2013: System stable with new voltage/multiplier settings.
Update 27 Oct 2013: I'm currently looking at BIOS voltage.
---

I recently built a new node (http://verahill.blogspot.com.au/2013/10/520-new-node-amd-fx-835032-gb-ram990-fx.html). While that's always exciting, it quickly left a sour taste due to random reboots when running long (days) computational jobs.

Note that the motherboard (asrock 990 fx extreme3) does not officially support FX8350, which is something that I shouldn't have ignored. I might eventually move my fx 8350 to my gigabyte 990 fxa and put my 8150 on my asrock instead.

Short description
* Both Gaussian 09 and NWChem 6.3 cause the reboots.
* I've set up a cron job that logs a lot of data every minute and there's nothing odd in there. No overheating, the wattage seems ok etc.
* Running only smaller jobs (even though they are running non-stop) which take less than a day, the node has stayed up for 11 days now.
* I have never seen it reboot, so I don't know if there's any beeping etc.
* There's nothing in the logs, and nothing in the output from tailing dmesg using a cronjob.
* The only real output is in last:

reboot system boot 3.11.5 Fri Oct 18 14:08 - 11:57 (2+21:48)
reboot system boot 3.8.10 Fri Oct 18 13:23 - 14:07 (00:44)
reboot system boot 3.8.10 Tue Oct 8 10:46 - 13:18 (10+02:31)
me tty1 Mon Oct 7 13:25 - crash (21:21)
me pts/0 beryllium Mon Oct 7 12:29 - crash (22:17)
reboot system boot 3.8.10 Mon Oct 7 12:27 - 13:18 (11+00:51)
me pts/0 beryllium Sat Oct 5 20:59 - crash (1+14:27)
reboot system boot 3.8.10 Sat Oct 5 20:58 - 13:18 (12+15:19)
reboot system boot 3.8.10 Tue Oct 1 14:09 - 11:54 (19+20:45)
me pts/0 beryllium Sun Sep 29 11:39 - crash (2+02:29)
reboot system boot 3.8.10 Sun Sep 29 11:39 - 11:54 (21+23:14)
me pts/0 beryllium Mon Sep 23 11:09 - crash (6+00:30)
reboot system boot 3.8.10 Mon Sep 23 11:07 - 11:54 (27+23:46)
me pts/0 beryllium Fri Sep 20 12:59 - crash (2+22:08)
reboot system boot 3.8.0 Fri Sep 20 12:50 - 11:54 (30+22:04)
reboot system boot 3.8.0 Fri Sep 20 12:49 - 12:49 (00:00)
reboot system boot 3.2.0-4-amd64 Fri Sep 20 11:52 - 12:48 (00:56)
reboot system boot 3.2.0-4-amd64 Fri Sep 20 06:29 - 08:08 (01:38)
me pts/0 beryllium Wed Sep 18 14:51 - crash (1+15:38)
reboot system boot 3.2.0-4-amd64 Wed Sep 18 14:40 - 08:08 (1+17:27)
me pts/8 beryllium Wed Sep 18 09:02 - crash (05:38)
reboot system boot 3.2.0-4-amd64 Wed Sep 18 01:51 - 08:08 (2+06:17)
me pts/0 beryllium Tue Sep 17 18:11 - crash (07:40)
reboot system boot 3.2.0-4-amd64 Tue Sep 17 18:08 - 08:08 (2+14:00)
reboot system boot 3.2.0-4-amd64 Tue Sep 17 17:55 - 17:56 (00:01)
me pts/0 beryllium Tue Sep 17 13:12 - crash (04:43)
reboot system boot 3.2.0-4-amd64 Tue Sep 17 12:23 - 17:56 (05:33)
reboot system boot 3.2.0-4-amd64 Mon Sep 16 20:05 - 12:17 (16:12)
me pts/0 beryllium Mon Sep 16 16:03 - crash (04:02)
reboot system boot 3.2.0-4-amd64 Mon Sep 16 15:31 - 12:17 (20:46)
reboot system boot 3.2.0-4-amd64 Mon Sep 16 15:20 - 15:30 (00:09)

Looking at the output it does seems that the crashes are happening less frequently. Part of the reason for the is probably a change in how I use the node, but I don't think that explains everything, and I don't like the idea of a piece of electronic hardware 'fixing' itself.

Another thing that puzzles me is the repeating numbers -- e.g. 08:08, 11:54 and 13:18  -- in the ouput. There's no cronjob or anything like that running at any of those times.

Other things that have changed are the kernel versions and that I removed the UPS around the 1st of October  (the UPS died, which is a bad sign, power-wise. I should probably also look into the warranty on it).

The chief challenge here is that I can't reliable trigger the reboots, which makes it difficult to see whether I've solved the issue or not.

On an older node I could trigger errors by compiling the kernel, but not using any other technique. On that node the RAM was faulty: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

==> <== indicates what I'm currently doing.


0.  RAM
The most common reason for unstable nodes if faulty RAM, so if your computer is behaving strangely and randomly crashes, always suspect the RAM first. It's a more likely culprit than software, and the most likely of the hardware components to be at fault.

I ran a full cycle of memtest86+ which took some 4-5 hours if I remember correctly. No errors shown. Note that if memtest86+ does not show any errors it is no guarantee that the RAM is fine. However, the likelihood that it is indeed corrupt goes down.

1. Overheating
The second thing to investigate when something like this happens, in particular if it's associated with prolonged and heavy use, is the possibility of overheating. You can install sensors-lm and configure it to track various temperatures. Note that these aren't always correct.

At any rate, I've logged the output from sensors every minute and there's nothing indicating that the temperature is rising prior to a crash.

--------------------------------------------------
Intermission -- trying to trigger a reboot

* It's stable while compiling a kernel (in my case 11.5). Not surprising as it is intense, but short.

* Prime95

Number of torture test threads to run (8):
Choose a type of torture test to run.
1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested much).
2 = In-place large FFTs (maximum heat and power consumption, some RAM
tested).
3 = Blend (tests some of everything, lots of RAM tested).
11,12,13 = Allows you to fine tune the above three selections.
Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller.
Type of torture test to run (3): 2

Accept the answers above? (Y): y

I ran this for three days and the node was stable.
I then ran test type 3 for 30 hours and it too was stable.

I accidentally ran the tests above without mounting ~/oxygen to the head node using NFS. Shouldn't matter, but in order to troubleshoot it's better to keep everything as constant as possible.

* PES scan
I think I saw reboots triggered using all sort of jobs, but due to their long run times, I saw it more consistently with PES scans.

So I ran a long PES scan in nwchem 6.3, and lo, it crashed after just under two days running this job (and having been up for 6 day and 30 minutes). It's not quite the quick, efficient way of crashing the computer that I was looking for, but it will do.

Note that this crash didn't lead to a reboot, but simply to the computer locking up and become unresponsive. No screen, no network, no harddrive activity.

The only errors I can spot in the dmesg are two warnings about 'perf samples too long (2545>2500)' at 30 minutes and at 14 hours of uptime, i.e. well before I started the PES job.


me pts/1 beryllium Fri Oct 25 08:59 still logged in
me pts/0 beryllium Fri Oct 25 08:58 still logged in
me pts/0 beryllium Fri Oct 25 08:57 - 08:58 (00:01)
me tty1 Fri Oct 25 08:52 still logged in
reboot system boot 3.11.5 Fri Oct 25 08:52 - 08:59 (00:07)
me pts/4 beryllium Sun Oct 20 17:32 - 17:32 (00:00)



--------------------------------------------------

3. BIOS version
The BIOS version (1.5) at the time of purchase of the motherboard was the same version as the BIOS available at the motherboard manufacturer's site. Since then an update has been released, as pointed out by a commentator. Nothing in the description of the update indicates that it would fix the issue I'm having, but upgrading the bios is just one of those things that should be tried.

mkdir ~/tmp/bios -p
cd ~/tmp/bios
wget 'ftp://download.asrock.com/bios/AM3+/990FX%20Extreme3(1.70)ROM.zip'
unzip 990FX\ Extreme3\(1.70\)ROM.zip


Archive: 990FX Extreme3(1.70)ROM.zip
inflating: 990EX31.70


I copied the file to a USB stick formatted with FAT32 since my guess is that the uefi might not recognise extX. Booting with the USB stick plugged in and hitting F6 ('Instant flash') lead to the UEFI finding the flash file. Now click on the file name -- don't click on the buttons (e.g. 'Configuration' and 'Refresh device'). During the bios update the usual goes: don't power off during the update, and make sure that your usb stick isn't old and damaged.

Once the update is done you get a message saying 'Programming success, press Enter to reboot system'.

I reran the PES scan, and I had a crash after less than two days (ca 40 hours). This crash caused a reboot.
me       tty1                          Sun Oct 27 12:00   still logged in   
me pts/1 beryllium Sun Oct 27 11:55 still logged in
me pts/0 beryllium Sun Oct 27 11:54 still logged in
reboot system boot 3.11.5 Sun Oct 27 11:52 - 12:14 (00:22)
me tty1 Fri Oct 25 09:36 - crash (2+02:16)
me pts/0 beryllium Fri Oct 25 09:17 - crash (2+02:35)
me pts/2 beryllium Fri Oct 25 09:17 - crash (2+02:35)
reboot system boot 3.11.5 Fri Oct 25 09:17 - 12:14 (2+02:57)

.

4. BIOS settings -- voltage
The older I get, the more comfortable I become with admitting when I don't really know what I'm doing. This  -- the tweaking of voltage settings -- is one of those areas where I definitely lack expertise.

Luckily, I've got some advice from a commentator: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

Anyway, I've always treated electricity a bit like magic (I always had issues with electrochemistry as a youngster, which is why I'm forcing myself to teach it these days), and the older I get the more I wish I had done chemical engineering rather than chemistry. Perhaps we want to benefit society more with age, rather than just benefit from it?

Anyway, here are some literal screenshots -- taken with my trusty old phone:

Main
 I don't see anything odd here.

First half of OC Tweaker
The alternative to 'Manual' in the OC Mode is 'CPU OC Mode', which sounds like something I want to avoid. Anyway, what bothers me is that there's no 'OC OFF' button. I don't know if the BIOS is doing something odd.

More OC Tweaker

HW Monitor. The Vcore and +12v lines fluctuate by about 5-10 mv.
Changes: 
* turn off Cool'n'Quiet.
* Change Multiplier/Voltage Change from Automatic to Manual
** Set CPU Freq multiplier to 17.0x (3400 MHz) instead of 20x (4.0 GHZ) under OC Tweaker
** Set CPU voltage to 1.35 instead of stock 1.3750 V, (OC Tweaker/CPU Voltage)

I've managed to run a full PES scan -- which hasn't worked before -- and re-ran it without issue. Looks like the issue was solved.

I then set the multiplier back to 20 (4000 MHz)  while keeping the CPU voltage at 1.35 V, and relaunched the PES scan. Almost immediate crash:.
me       pts/0        beryllium        Mon Nov  4 13:48   still logged in   
reboot system boot 3.11.5 Mon Nov 4 13:48 - 14:30 (00:42)
me pts/0 beryllium Mon Nov 4 13:35 - crash (00:13)
me tty1 Mon Nov 4 13:34 - 13:34 (00:00)
reboot system boot 3.11.5 Mon Nov 4 13:34 - 14:30 (00:56)


Not sure what to do now. Did it crash faster because the CPU voltage is low while the multiplier is high? Can it be solved by increasing -- rather than decreasing -- the voltage?

Re-running the PES job again gave a more interesting result -- the job crashed, but not the node (note that the job is exactly the same each time, so it's not a matter of the input):

 Grid integrated density:     191.999968853836
Requested integration accuracy: 0.10E-06
d= 0,ls=0.0,diis 13 -1092.4755719318 -1.63D-05 8.49D-06 2.73D-05 719.5
Grid integrated density: 191.999968846937
Requested integration accuracy: 0.10E-06
Singularity in Pulay matrix. Error and Fock matrices removed.
PeIGS error from dstebz 4 ...trying dsterf
error from dsterf 516
error from dsterf 516
Error in pstein5. me = 0 argument 10 has an illegal value.
Error in pstein5. me = 1 argument 10 has an illegal value.
Error in pstein5. me = 2 argument 10 has an illegal value.
ME = 2 Exiting via
Error in pstein5. me = 4 argument 10 has an illegal value.
ME = 4 Exiting via
Error in pstein5. me = 5 argument 10 has an illegal value.
ME = 5 Exiting via
5:5: peigs error: mxpend:: -1
Error in pstein5. me = 6 argument 10 has an illegal value.
ME = 6 Exiting via
Error in pstein5. me = 7 argument 10 has an illegal value.
ME = 7 Exiting via
7:7: peigs error: mxpend:: -1
(rank:7 hostname:oxygen pid:3741):ARMCI DASSERT fail. ../../ga-5-2/armci/src/common/armci.c:ARMCI_Error():208 cond:0

ME = 0

And, while the frequency affects the thermal output, a thermal issue shouldn't lead to garbled stuff. This is looking more and more like what the poster wazoo42 recounted: http://verahill.blogspot.com.au/2013/09/517-very-briefly-prime95-on-linux.html?showComment=1381459311645#c1080803985593788821

5. Swapping the CPU to an approved MOBO.
This is a bit of a cop-out. Most people don't have multiple computers and which happen to have compatible hardware. On the other hand, I have a job to do.

To be able to continue to follow this post you'll need to know this:
There are two nodes (cpu, mobo, psu):
* oxygen: FX 8350, asrock 990fx extreme3, corsair GS700.
* neon: FX 8150, gigabyte GA-FX990D3, corsair GS800.
Both nodes have otherwise similar hardware: 32 gb ram, GT210 nvidia, one PCI network card. Both motherboards support 8150, but only GA-FX990D3 supports 8350.

So we'll move FX8350 to neon, and FX 8150 to oxygen.

I first set the Multiplier/Voltage Change back to Automatic on oxygen in preparation for the FX 8150.
I then shut down the two nodes and unplugged them. And here's where it's not funny anymore: I tried to gently remove the heatsink but the heatsink together with the CPU popped off in spite of the lever being locked. On both nodes.

The CPU was solidly glued to the heatsink in both cases. I managed to get the FX 8350 off its heatsink by gently scraping off the excess thermal paste (dry and solid), but the FX 8150 was a real struggle. In the end I used the back of a knife as a lever (gently). Not ideal.

Anyway, cleaned the fx 8350 heatsink and cpu, applied new thermal paste and installed on the gigabyte fxa990-d3 motherboard. Turned on -- no lights on the mobo. Fans etc all working. Dammit. Googled and saw that bios only supports 8350 from version 9 (incorrect -- looked at wrong mobo, but didn't discover that until later)

So now I have a CPU that'd work, but which I can't install since it's stuck to the heatsink, and one which I can install but not use, since the bios is wrong.

When I put the old CPU (fx8150) back in neon it wouldn't boot either -- the fans were spinning but no motherboard lights went on (e.g. LAN). PCI cards lit up, but nothing on Mobo. Took out the CPU, put back in, took out, put back in.

Put the FX8350 back in the oxygen case. Didn't work either, although here the LAN mobo light was on. Still didn't work though -- no video output and couldn't connect via LAN. Great. Killed two working nodes in on afternoon.

Finally, somehow, I managed to get neon working again. Popped a USB stick in. Set up a USB stick with a 1 GB W95 partition as shown in this post: http://verahill.blogspot.com.au/2013/04/401-amd-fx-8150issues-building-kernel.html

Downloaded bios etc. Couldn't install -- BIOS check error. Googled again -- dammit. BIOS for wrong mother board. And the bios that's installed actually supports 8350.

OK, installed all the CPUs, and now they booted up. I must have installed the CPUs badly -- which doesn't speak well of my attention to detail.

For some reason the card that used to be eth0 now gets assigned as eth2 on oxygen. Checked udev -- doesn't make sense. Turned everything off and checked that the pci card (eth0) was seated properly. Booted -- now ok.

Not sure if I have to recompile all the computational code but I did anyway -- the only difference, according to the acml cpuid.exe util, is that 8350 supports FMA3 while 8150 doesn't. Both support SSE, SSE2, SSE3, AVX, FMA4.





[edit]
After ca two months both boxes are stable in spite of being subjected to heavy work loads. The reason for the crashes/reboots originally must have been due to incompatible mobo/cpu.
Previous
Next Post »