[Discuss] Issues with new Linux computer mostly addressed
Alan W. Irwin
Alan.W.Irwin1234 at gmail.com
Sun Nov 11 16:15:08 PST 2018
On 2018-11-11 09:29-0800 Peter Willis wrote:
> What, exactly, is your hardware configuration? Include the case, PSU, and cooling.
> What you’re describing could be any number of issues external to Ryzen or bios.
>
> Top candidates:
>
> -Bad noisy power supply
> -Cheap or mismatched RAM in system
> -Failing HD on power side (recently had one SSD burst into flames by itself at work. User complained of smoke)
> -solder whiskers or cold flow on MoBo (Defective MoBo components or assembly)
> -brass standoffs holding MoBo where plastic ones should be used
> -Your household wiring has ground noise problems
> -Bad network device is acting as noisy current sink on CAT5
> -USB current sink issues from cheap USB devices
> -Bad power bar creating noise
> -Bad monitor acting as current sink
> -Overheating of CPU or other components due to mal-designed case airflow or defective coolant flow
> -LED indicator leads plugged into MoBo with incorrect polarity
> -Ribbon cabled Floppy Disk Drives on modern MoBo (in General, get one that’s USB or SATA instead)
> -Your Ryzen was previously RMA’d due to overheat and resold again by your vendor
> -Your RAM needs cooling
> -You have an old ATI video card running generic or custom kernel build driver that is mismatched (try a different brand)
> -Believing in Overclocking and actually using it
Hi Peter:
The above part of your response is more relevant
to the thread with the subject line "Issues with new Linux computer
mostly addressed" so I have changed the subject line appropriately.
For the record, my new system was assembled using the following components:
MB: ASUS Prime B350 Plus (with integrated Realtek RTL8111/8168/8411 Gigabit Ethernet)
CPU: AMD Ryzen 7 1700 (3.0GHz, 8 real cpu's)
GPU: AMD RX 550 (4GB)
RAM: Kingston KVR (2.666 GHz, 4x16GB)
SSD: Samsung 960 Pro (0.5 TB)
HD: Western Digital "Gold" (2TB)
Optical: Generic ASUS DVDRW
Extra Ethernet: Intel 82574L Gigabit Ethernet
PSU: EVGA 850 Watt B3 80+ Bronze
The sensors command reports reasonable temperatures. So some of the
possibilities you have mentioned above are ruled out, but others are
not completely ruled out for this fairly high-end box.
Note that when the new box is loaded with rsync workloads it is
perfect. For example, I transferred 1 TB of disk files from my old
system to new system via two rsync's (one from old box to external
drive and one from external drive to the HDD on my new box). In each
case after the first rsync, I then tried the --checksum option of
rsync after the original rysync which effectively checks (if it turns
out no files need to be changed as a result of this option) that there
were no bit flips for each of the 8 trillion (!) bits transferred.
And all those checks were perfect. During that transfer I also
checked which cpu's were being used by rsync, and that turned out to
be all of them, and all memory was used as well. So (rsync
--checksum) not only gives peace of mind on the transfer itself, but
is a pretty good check of the fundamental health of my power supply,
motherboard, cpu's, memory, and internal HDD of the system under high
load during however long the transfer takes (roughly an hour in this
case). I have also used rsync --checksum to check the health of all
the above system components plus the SSD periodically since when I got
the box when I do my normal SSD + HDD backups to an external drive.
Furthermore, you can check git repositories for bit issues and
also all Debian installs have checksum checks. And all is well
on those fronts.
However, such repeated rsync workloads do not check the health of the
RX 550 GPU, or the two networking interfaces so hardware issues or
more likely driver issues for some/all of these three system
components *could* be the source of the system lockups, and in fact a
small subset of the NMI (non-masked interrupt, usually a sign of
hardware or driver trouble) messages I see in the logs specifically
mention the e1000e kernel module that drives the Intel 82574L card.
But the rest of the lockups either generate no NMI messages at all,
generate ascii null error messages, or generate NMI messages which do
not mention any hardware component other than the MB. So they
appeared to be fairly random lockups which could not be ascribed to
anything specific *AND* virtually all these lockups tend to occur when
the system is idle. So all this has been quite frustrating
until I finally got
the help I mentioned in my prior post which indicated random lockups
are the norm for idle Linux Ryzen systems (which I verified by a subsequent
google search) because of yet another hardware fault by AMD (beyond
the well-known Spectre issues). And there are only two ways to work
around these "idle" hardware issues which are (1) custom kernel recompilation
plus using a special kernel run-time option that is enabled by that
recompilation, and (2) a BIOS update and using a special non-default
BIOS option that is implemented by that update. Currently, I am
working on (1), and plan to report back here on this thread whether that works or
not. But if I need to go to (2), then I will discuss that further on
the other thread "Risks of BIOS (actually UEFI) updates"
Alan
__________________________
Alan W. Irwin
Programming affiliations with the FreeEOS equation-of-state
implementation for stellar interiors (freeeos.sf.net); the Time
Ephemerides project (timeephem.sf.net); PLplot scientific plotting
software package (plplot.sf.net); the libLASi project
(unifont.org/lasi); the Loads of Linux Links project (loll.sf.net);
and the Linux Brochure Project (lbproject.sf.net).
__________________________
Linux-powered Science
__________________________
More information about the Discuss
mailing list