Another Ryzen-based server

See below for our first Ryzen-based servers.

In the meantime, AMD has released Ryzen Pro CPUs, supposedly with official ECC support, but there are only few offerings (and no Ryzen Pro 3xxx ones) at the dealers we buy from. So we again went for the inofficial ECC support of the Ryzen 3900X.

Components

  1 x ASUS TUF B450M-Plus Gaming (90MB0YQ0-M0EAY0)
  1 x AMD Ryzen 9 3900X, 12x 3.80GHz, boxed (100-100000023BOX)
  2 x Kingston Server Premier DIMM 16GB, DDR4-2666, CL19-19-19, ECC (KSM26ED8/16ME)
  1 x Samsung SSD PM883 480GB, SATA (MZ7LH480HAHQ-00005)
  1 x Micron 5200 ECO 480GB, SATA (MTFDDAK480TDC-1AT1ZABYY)
  1 x Intel Gigabit CT Desktop, RJ-45, PCIe 1.1 x1, bulk (EXPI9301CTBLK)
  1 x Sapphire Radeon HD 6450, 1GB DDR3, VGA, DVI, HDMI, lite retail (11190-02-20G)
  1 x LC-Power 2014MB schwarz (LC-2014MB-ON)
  1 x Xilence Performance A+ Serie 530W ATX 2.4 (XP530R8/XN061)

After several months we upgraded the RAM by replacing both 16GB DIMMs shown above with

  4 x Samsung DIMM 32GB, DDR4-2666, CL19-19-19, ECC (M391A4G43MB1-CTD)
for a total of 128GB RAM. You can get links and more information for the 128GB configuration here.

We chose the ASUS TUF B450M-Plus Gaming mainboard because of ECC support (and to buy something different after an Asrock board died in the last iteration). Unfortunately, this board exceeds the normal power limits of the CPU; the CPU has 105W TDP and that should limit power to ~140W as long as the cooling is sufficient. Yet we saw 190W difference in mains power between idle and full CPU (only) load. Given the results below, apparently only CPU temperature limits the power consumption, and if we had used a more powerful cooler (we used the boxed cooler), we might have seen even higher power consumption; now combine that with a 3950X... This makes me wonder how long the board's voltage regulators can survive this load.

The board's BIOS does not offer a way to limit power directly. You can limit CPU temperature; limiting it to 70 resulted in the same 190W difference at the start, but it fell to a 135W difference in our build after several minutes, with 5% lower performance. Limiting the temperature to 45 cut the power consumption drastically (~65W difference), but also roughly halved the performance.

The manual of the board is not satisfactory: we missed proper descriptions of the connectors.

The CPU comes with a cooler and fan with a nice (though completely unnecessary for our purposes) light show, and we used this cooler.

We chose two SSDs with power-loss protection from different manufacturers (to reduce common-mode failures).

The case is very light and probably not particularly sturdy, but good enough for our purposes. It has nice rubber feet.

Upgrading the RAM was interesting. We first put in all 4 DIMMs. 1 long, 2 short beeps (RAM problem). Then we took out two DIMMs, and 32GB were recognized. Reduced to 1 DIMM, RAM problem. 2 DIMMs in different slots, 64GB; installed the third DIMM, 96GB; installed the fourth DIMM, 128GB. My guess is that the DIMMs were not sitting correctly at first (the feeling while putting the DIMMs in did not give any indication, though). Note that the board first checks the RAM for a while before giving any sound or video.

ECC operation

This time around we did not go to the lengths of last time, but just checked that the Linux kernel reports ECC in dmesg output:
[    8.658747] EDAC amd64: Node 0: DRAM ECC enabled.
The Debian 10 (buster) kernel (4.19) does not report that, but the 5.4 kernel from buster-backports does.

Power usage

It consumes 40W when idle (with 32GB RAM), and up to 230W when the CPU is loaded with multiplying 2000x2000 matrices in 12 threads with libopenblas (using 24 threads did not increase the power, nor the performance); this runs the cores at ~4000MHz. With the CPU temperature limit at 70, it starts out the same, but after heating up settles at 175W power consumption at 3800MHz.

Performance

To be measured.

Experience

We ran the system for about 5 months, 4 months of them in production, before upgrading it from 32GB to 128GB, and the system was rock stable during this time. We used Debian 10, Docker containers with gitlab, Jitsi Meet, and other software.

A Ryzen-based server

AMD has introduced EPYC for servers that need a lot of cores and/or a lot of RAM or RAM-bandwidth, but currently (July 2017) nothing that officially competes with the Xeon E3 (i.e., something similar to the desktop CPUs, but with ECC enabled). However, Ryzen CPUs contain ECC logic that is not disabled, although it is not officially supported (and if you are unlucky, you might get one where it does not work), so if you don't need official certification etc., you can build a server based on Ryzen. That's what we did, and here I report about that.

Note that the currently available motherboards are not designed for servers, so they may miss features you may be interested in. In our case what we miss is slightly better ECC support and on-board graphics; we decided to live with the limited ECC support, and use a discrete graphics card.

Components

The components we use for our server are (we built two similar ones, with the components of the smaller one in parentheses, if they differ):

CPU: Ryzen 7 1800X (Ryzen 5 1600X)
Motherboard: Asrock A320M Pro4
RAM: 4 Kingston ValueRAM Server Premier DIMM 16GB, DDR4-2400 (2 of these DIMMs)
Cooler: Thermalright AXP-200R ROG
Case and PSU: LC-Power 2002MB, 300W ATX 2.2
Graphics: Sapphire Radeon R5 230
Ethernet (2nd port): Intel Gigabit CT Desktop Adapter
Mass Storage: Intel SSD DC S3520 480GB, Seagate Nytro XF1230 480GB
              (Western Digital WD Purple 2TB, Seagate SkyHawk 2TB)
Some comments on the components:

Motherboard: You may wonder about the low-end A320-based board, but it has all that we need (apart from a second Ethernet port) and is therefore sufficient for our needs. If you want to overclock, you need a B350-based board, though; but who wants to overclock a server? We chose an Asrock board, because Asrock and ASUS are reported to have the best ECC support for AM4 boards.

Cooler: We decided on a relatively small case, which eliminated powerful tower coolers from the selection. This cooler fits in this configuration only with the ends of the heat pipes oriented towards the back of the case (I/O Panel). The part that holds the cooler to the CPU is designed to lock with the stabilizing wires on the cooler, but that does not fit (the heat pipes collide with the mounting frame), so we went without this locking (should not hurt given that we don't move our servers a lot). Also, at first it looked as if the supplied back plate collides with some stuff on the board, but once we put the plastic washers in the right place between the board and the back plate, this proved not to be a problem.

Mass Storage: For SSDs we decided on SATA models with power loss protection, and for both SSDs and hard disks, we chose two models per server from two different manufacturers (to be used in a RAID1; we have experienced that drives from the same manufacturer failed at the same time). While PCIe M.2 SSDs are the rage, and the board has space for two M.2 SSDs (but apparently only one of them PCIe), we chose SATA so that we can also access them on our legacy machines if necessary.

The components cost a little shy of EUR3000 (including 20% VAT) in July 2017, with the components for the big box being about EUR1900, and the components for the smaller box a little over EUR1000.

ECC testing

Given that neither AMD nor Asrock give any guarantee wrt ECC functionality, we wanted to check ourselves whether it works, and followed the example set by Hardware Canucks in testing it.

For the smaller machine (2 DIMMs), we did all that they did in Linux, and got pretty much the same results, with a few differences: We changed the timings to DDR4-2400 13-13-13-13-21 in order to see correctable errors, and then it soon crashed.

For the bigger machine (4 DIMMs), we saw the EDAC entries reporting ECC, but I had a hard time finding timings that would run, but produce errors reported by EDAC. Eventually I found that changing the first two parameters can easily cross the border into Crashland (in one case we needed to take out the CMOS battery to get to sane BIOS settings again), while varying the third and/or fourth parameters (Trcdwr, Trp) resulted in a setting that was stable enough to run, yet also produced (correctable) ECC errors; the setting I used was 14-14-11-10-21. I first tested with "stress -m50" and (on a RAM-disk) with "stress -m 50 -d 50 --hdd-bytes 100M"; this produced reports of correctable ECC errors. In order to test whether the correction actually works correctly, we then ran "memtester 60G" (as root); this produced correctable error reports at a slower rate than stress (often with 5 minutes between reports), but in >1h of memtesting (with over 10 errors corrected), no error was reported by memtester, so it looks like the correction is working.

Power usage

The 1600X box (including PSU, i.e., we measured the power coming through the leads) consumes 43W idle and up to 155W with an integer load: 1 instance of memtester (where the different phases seem to have a measurably different power consumption) and 11 instances of "yes >/dev/null".

Proper results for the 1800X box are not yet done, but the first impression is 38W idle, i.e., a little less when idle (thanks to the SSDs, and obviously AMD now implements power-gating of idle cores well), and quite a bit more when loaded (I have seen 180W; makes me wonder if the CPU stays within its TDP).

Stability

These machines worked fine for half a year, then started hanging about once a week. We tried a number of measures to get it to become stable (e.g., disabling deep sleep and switching power supplies), but they did not help permanently. Eventually we (temporarily) disabled SMT, and we have not seen crashes since then (for 133 days as of this writing). However, after a power outage on one of the boxes we did not disable SMT again, and yet it has not hung since then; so maybe the measure that worked was not SMT-disabling, but something else we did at the same time.

Anyway, if you have this problem and want to disable SMT, you can do so at the BIOS level; but alternatively, you can ask the Linux kernel to use only one logical thread per core. For our Ryzen 1xxx CPUs, you can do it like this (as root with bash):

  for i in /sys/devices/system/cpu/cpu[0-9]*; do
    if test $(( ${i##*cpu} % 2)) = 1; then
      echo 0 >$i/online
    fi
  done
(Note that the logical cores are numbered differently on Intel CPUs, so you need to change this for Intel CPUs.)

In June 2019 (after 23 months), the mainboard of the 1800X died. We replaced it with an Asus Prime X370-A, but had to use a bigger case for that.

Performance

Compared to fast Intel CPUs I measured:

Single-thread performance

On our Latex Benchmark (the numbers are the user time in seconds):

- Ryzen 5 1600X, 4000MHz, 8MB L2, Debian 9 (64-bit)               0.287
- Core i7-4790K, 4400MHz (Turbo), 8MB L3, Debian Jessie (64-bit)  0.204
- Core i7-6700K, 4200MHz (Turbo), 8MB L3, Debian Jessie (64-bit)  0.200
On the Gforth benchmarks (again, user time in seconds):
sieve bubble matrix fib   fft  release; CPU; gcc
0.093 0.099  0.042 0.104 0.030 2017-07-05; AMD Ryzen 1600X 4GHz; gcc-6.3
0.076 0.104  0.040 0.076 0.032 2016-05-03; Intel Core i7-4790K 4.4GHz; gcc-4.9
0.076 0.112  0.040 0.080 0.028 2015-12-26; Intel Core i7-6700K 4.0GHz; gcc-4.9

Multi-thread performance

To be measured.
Anton Ertl