17/07/23 10:36:16

JayP-NAS 2.0 - A Real Home Server, Part 2.2: Benchmarking

Whenever building a new system that's going to run 24/7, I highly recommend running it through it's paces early on. You need to figure out if any hardware is DOA or a lemon. You also need to figure out how things will run: speeds you'll be expecting, how hot things will get, etc. In this part of JayP-NAS 2.0 - A Real Home Server, we'll be reviewing the major steps of benchmarking the new server I built. Many of these steps are very portable to other hardware, software platforms, etc.

Testing Memory

The first thing to do is run a memory test, generally recommended to happen with memtest86+. The only reason I recommend running it first is it's run outside of an OS, usually off a USB boot drive, so you might as well use the time it's running to plan the rest of your setup process. I use the latest version, 5.01. There are a lot of ways to get this setup on a USB drive so I'll leave that you. Download the latest version (available here) and get it setup however you'd like.

When you boot it up you'll likely want to make sure it runs in multi-processor mode. You'll be prompted when it starts up to allow SMP support, allow it. This will speed up the testing by quite a bit. It's pretty straight forward from here, just let it do it's thing. A single pass can take hours, and it's recommended to do at least 10 passes.

There's been some debate over the years on how necessary it is to do as many passes as possible. For a brand new system I'm on the side of "a couple passes is probably fine." Generally speaking, if fresh out of the box RAM is bad, it'll show errors pretty quickly. If it does, return it. If not, you're probably fine. If you want to be completely certain everything is solid, give it at least 10 passes.

Setting up a Testing Environment

I recommend using a temporary benchmarking setup that you can wipe out and start fresh when you're ready to get going for real. This can be a Live CD environment or directly installing your OS knowing you'll be reinstalling it shortly after. This allows you to be a little messy on the software side of things without worrying about cleaning up after yourself. You can set things up close to the real deal without getting invested in decisions you might want to change a day or two later. The other advantage is you can go through certain automated tasks and see what they'll do. If you don't like it you can change things later.

I went ahead and used a USB drive to install Ubuntu Server 16.04 onto my M.2 NVMe SSD, allowing it to partition itself as it wanted. I setup the basics and just hit go. Once done I remoted in and took a quick stock of things: Since I have 64GB of RAM it made a 64GB swap partition, which is ridiculous for my needs. The system should never hit swap with this much RAM, so having that much is a waste of OS drive space. I made a note to fix that for the real go. Otherwise everything looked good, and all the hardware was detected properly, so I moved forward with testing.

Monitoring

While doing your stress testing you should probably monitor what's happening. This is obvious, right? Well, for my setup there are a few tools I want. For watching what programs are running and what resources they're using I like htop over top for a bit more data (although I find top is usually more accurate for a quick "wtf is using so many CPU cycles" check). For monitoring system temperatures you'll need lm_sensors and hddtemp.

sudo apt install htop lm-sensors hddtemp

Just installing these isn't quite enough. You'll need to do a bit of setup. First lets get lm_sensors configured.

sudo sensors-detect

This will start an interactive system where it will run through all known sensors and look for them. My general rule of thumb: just answer yes to everything. It will stop at certain points, prompting you to enter Yes or No. If you just hit enter it'll assume the default answer, which will be capitalized like this: (YES/No) or (Yes/NO). Most default to Yes, but a few will be No, so make sure to keep an eye out for those and enter Yes. The final step will write the necessary modules to a file so they are automatically loaded on boot up. You can reboot now, or simply enter the following command to get those modules loaded so we can make sure it's working.

sudo service kmod start

Now that it's all loaded up you can test it to see what temperatures look like with this command:

sudo sensors

You'll see output like this:

acpitz-virtual-0
Adapter: Virtual device
temp1: +42.0°C (crit = +119.0°C)
temp2: +42.0°C (crit = +119.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +42.0°C (high = +80.0°C, crit = +100.0°C)
Core 0: +40.0°C (high = +80.0°C, crit = +100.0°C)
Core 1: +41.0°C (high = +80.0°C, crit = +100.0°C)
Core 2: +42.0°C (high = +80.0°C, crit = +100.0°C)
Core 3: +41.0°C (high = +80.0°C, crit = +100.0°C)

power_meter-acpi-0
Adapter: ACPI interface
power1: 4.29 MW (interval = 4294967.29 s)

There you have it. Things should be quite cool at this point, the system's been running long enough to be at idle temperatures but you haven't done anything to stress it yet.

Luckily, hddtemp requires little configuration, you just need to run the command. All you really need to know is what devices you'll want it monitoring. The easiest way to figure that out is running this command:

sudo lsblk -f

This will show you the device names and a bit of basic information about each drive the system sees. Next we just need to pass each drive to hddtemp to see their temperatures. The device name needs to be passed to hddtemp by it's location, so basically just add /dev/ to the beginning of each. We only care about the drives that will be part of our drive pool, so use the data from lsblk to figure out which those are. So for me it looks like this:

hddtemp /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh

In my case, I see the following:

/dev/sda: ST6000VN0041-2EL11C: 45°C
/dev/sdb: ST6000VN0041-2EL11C: 44°C
/dev/sdc: ST6000VN0041-2EL11C: 47°C
/dev/sdd: ST6000VN0041-2EL11C: 47°C
/dev/sde: ST6000VN0041-2EL11C: 43°C
/dev/sdf: ST6000VN0041-2EL11C: 46°C
/dev/sdg: ST6000VN0041-2EL11C: 45°C
/dev/sdh: ST6000VN0041-2EL11C: 43°C

NOTE: At this point you'll probably want to open a few different ssh sessions since we can monitor each test we're running while doing other stuff. You'll probably need at least 3 sessions open.

Lets go ahead and pop open a ssh session to load up htop. It will show you a graph with each CPU thread so you can see how things are running, along with a list of active processes, memory utilization, etc. It's really easy to start, simply:

htop

Now lets open another session to keep an eye on temperatures. To make things easy I like to write a quick little shell script (by quick I mean it runs 2 commands, that's it) so I can see both sensors and hddtemp data at once. To do that you can run this command to create the file:

nano ~/temps

Nano is a command line text editor that's really easy to use. If you have a preference for something else, go for it. Either way, enter the following into your text editor:

sudo sensors
sudo hddtemp /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh

Now save and close it. In Nano that's as easy as holding control and hitting X, then Y to confirm to save it. Now to make things easy make that text file executable and run it to make sure it works.

chmod +x ~/temps
sh ~/temps

Now whenever you run ~/temps you'll see all the temperatures. To make it easy to just let it run in the background, run this instead:

watch -n5 sh ~/temps

This effectively runs the commands every 5 seconds for you. You can change that -n5 to -n<whatever> number of seconds you want.

With a ssh session dedicated to that watch command for temps and another for htop, you've got an eye on things. Pop open (or over to) a third ssh session, this will be our "working" environment now that we're monitoring things else where.

CPU Stress Test

Stress testing the CPU does 2 things for us: First, it lets us know the CPU is solid. It's not common with Xeon processors but you can never really know if you got a dud or not until you've created a situation where problems will arise. Second, it stresses out the CPU to such a degree it's going to give you an idea of what the maximum temp the CPU will hit in your system is. It also gives you an idea how fast it'll drop back to normal temperatures.

There are a lot of ways to go about this, but for my purposes I decided to go with one of the easier routes: cpuburn. Install it with:

sudo apt install cpuburn

The cpuburn program is actually made up of a few commands, specific versions for specific CPU architectures. This allows you to pick the right burn in test for your system. You can read more about this here. In my case I need to use the burnP6 command. Running this command once spins off a single process of burnP6, but we're on a multi core system so we need to spin off a copy of burnP6 for each thread the CPU supports. Notice, I said thread, not core. The idea is to fully max out the CPU and Intel's HyperThreading can "cheat" the system if we don't create 8 copies of burnP6. The easiest way to do this is to run burnP6 in the background 8 times. There's no need to run more copies of burnP6 than the number of threads you have. I also recommend starting a timer of some kind when doing this. However you want to, be it just writing down the time or a literal stop watch, you want to keep track of how long it's been since you started this command. We'll come back to this later. The easiest way to spin off 8 copies of cpuburn for me is simply:

burnP6 & burnP6 & burnP6 & burnP6 & burnP6 & burnP6 & burnP6 & burnP6 &

With that running on it's own you'll see all the htop graphs are at 100%, and chances are your sensors report is showing the CPU temp rising. You can keep continuing down the guide, but I want to note a couple things about our burnP6 stress test. What I like to do is kill the first instance of this test after about an hour and a half, then watch the CPU temperature drop. To stop the cpuburn processes we simply issue this command:

killall burnP6

This will give you an idea of how quickly you should expect the CPU to return to idle temperatures. A good cooling solution will see the temps drop back to idle quickly. My system got back to idle temps in under 20 seconds, mostly thanks to the beefy Noctua CPU cooler. If your CPU drops back to idle slowly, then there's probably some "dead air pockets" in your system. The cooling isn't able to move all the air around, or there's areas where an air vortex is being created, meaning hot air can't get out. I'd recommend addressing this ASAP, as this will create problems in the rest of benchmarking. In the real world your system should rarely be maxing out for such a long period of time, so in reality the temperature should be far more managed.

Once you get that figured out, start that burnP6 command again and let it run for at least another 4 hours. In my case since I saw the temperature drop back to idle so fast I just ran it again immediately for about 2 and a half hours. If you had to shutdown the system to address cooling issues, give it a good run of 2-4 hours to check on it. This is the actual stress portion, where we're letting the CPU get hot and stay hot. Ideally we'll see the CPU hit a temperature and stay steady there, and it will hopefully be below the CPU's thermal throttle number. My setup hit ~68C within a few minutes, then over the next hour crept up to 71C and stayed there for the remainder of my testing.

If you're happy with that, you're done with the CPU testing... for now. I recommend doing a good 12 to 24 hour "heat soak." This gives you a better idea of how well your system's cooling can handle things. Short stress tests of just an hour or two won't allow heat to build in all areas of the computer, so even if the CPU is dropping to idle temps quick you may have a component that's not getting air moved over and not know about it. A long heat soak will make these problems come to light. I'd recommend waiting for the other tests to complete, then coming back to setup the long cpuburn to run over night.

Storage Stress Test

From here we'll want to start getting information on our drives. We'll use a nice guide from qwertymodo over on the FreeNAS forums as a starting place. We can convert it almost verbatim for Linux but there's at least one step we'll skip. First, lets make sure the programs we need are installed.

sudo apt install smartctl badblocks

Next, we'll start the SMART tests. You'll want to run these in order for each drive. Replace the X in /dev/sdX with each individual drive.

smartctl -t short /dev/sdX

Run that for each individual drive. The command runs only on the drive given, so you can start them in sequence and let it run since they happen in the background. Each short test takes about 5 minutes, but it will report an estimated time remaining when you run the command. Since you can only run one type of test at a time on a given drive you'll want to give it enough time to complete, so give it however long it wants before moving to the next test.

smartctl -t conveyance /dev/sdX

The conveyance test usually won't take as long as the short test. Do note, not every drive supports conveyance tests, so if it fails just move on.

smartctl -t long /dev/sdX

The long test takes quite a while. You can check the estimated time and come back later, or you can continue on while it's doing this long test. Generally speaking it's best to let the long test happen first.

Once that's all done we'll stress out the drives a bit. The goal here is to see if any drives are duds out of the box, have serious performance problems, or in my case see if the LSI HBA is overheating. LSI HBA's are known to have serious performance problems when they start to overheat, and they can overheat pretty easily.

Now we want to start up a tmux session. tmux will allow us to run a bunch of commands that cannot be run in the background at the same time, in the same window.

tmux

Next, we'll start our first instance of badblocks. Similar to the smartctl commands, replace X with the appropriate drive letter. Note: if you're using drives <2TB, you can drop the -b 4096 portion of this command.

badblocks -b 4096 -ws /dev/sdX

Note: you can change the -ws to a -ns if you want the test to run non-destructively. If these drives are fresh out of the box and there's no data on it, it doesn't matter.

Once the test starts, hit Ctrl+B then " (that's the double quote character, not a single quote twice). You'll see a line separating the screen in half. You'll enter the same badblocks command as above, but go to the next drive letter. If the first was /dev/sda, now enter /dev/sdb. Repeat the Ctrl+B then " step and the screen will be split in thirds. Again, next drive's badblocks command. Repeat until badblocks is running for each drive.

If for whatever reason your sessions gets disconnected you can run this command to reconnect to that tmux session:

tmux attach

This badblocks test is the core of this drive stress test. Having it run on all of your drives at once will speed up the process versus doing them one by one (it still takes a long time with large drives), but also generate a TON of heat. Don't be surprised if your drives reach the "christ that's hot" point during this. Since we're only doing this test the one time, keep an eye on it but don't be too concerned about this ruining the drives. They'd have to be at those temps for days on end before it does real damage. My drives hit 65C but didn't get much worse. That's crazy hot for drives, which is why I started pursuing ways to get them cooler (and ended up on an all new chassis). In reality the drives will never be stressed like this for such long periods, so you probably don't need to be overly concerned with it. If they get as hot as mine did you should still pursue getting them cooler, though.

Once those badblocks tests are done, you'll want to run the long SMART test again:

smartctl -t long /dev/sdX

We're running this again so SMART has a chance to check on things after the extreme badblocks test. Once those tests are done we can check the status of each drive:

smartctl -A /dev/sdX

You'll see a bunch of different stats that SMART keeps data on. What you're looking at here is the Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable lines. If everything with the drive is fine, the raw value will be listed as 0. If any aren't 0, return/RMA it.

I mentioned earlier about the LSI HBA overheating. I noticed it because 2 out of 8 drives were running their badblocks test significantly slower than the other 6. Before wrapping up the storage tests with the final long SMART test I gave the drives a while to cool down and tried again, and the same thing happen but this time on a different pair of drives. the HBA was just working super hard and for whatever reason was deciding randomly on two drives to cut performance to. In the real world, on a home server, you'll never hit the drives as hard as a badblocks test, and the HBA will rarely if ever be that stressed. That said, they're designed to be in "real" servers and not in a home environment, so they do a terrible job cooling themselves. Most people attach a 40mm fan to their HBA's heatsink, or find some way to get air moving across it. I ended up getting a 120mm fan mount that fits in the PCI slots 3D printed and mounted just below the card, so it's getting a lot more air moving across the card which helps.

Putting it All Together

So now you've got all the information you need to decide on a few things. Are your parts all good or do you need to get some stuff replaced? Is your cooling setup solid or does it need work? You probably also now know how loud your system is. Is it too loud or quiet enough? Here are some guide lines I've stood by:

Memory: If you got any errors, try to figure out which stick of RAM it is. Pull out all the other sticks of RAM and run memtest on that stick alone to ensure it's the one with a problem. Pull it out, put in the others, and run it again to make sure all the other RAM is definitely OK. Replace the bad stick. If there's more than one bad stick, I'd return it all and see about trying to get some from a different vendor, or possibly even a different manufacturer. Chances are it's just a bad batch of RAM, but if it's all bad it's difficult to trust that manufacturer. Do the tests again with your replacements to ensure you're solid. If it's still bad... You could have a dud motherboard. That's a hassle to replace but better to do it now than later.
CPU temperatures: The lower the idle temperature the better. Every CPU is different, but if you got something modern it should typically be in the mid 30C's to low 40C's. Any idle temp above 45C is a bit high for Intel but not terrible. 50C+ for Intel is hot for idle temps. AMD CPU's run hotter than Intel, in my experience by about 5-10C (part of the reason I'm stepping awy from AMD again). That said, the latest AMD parts like Ryzen/Threadripper/EYPC are a different ball game and I don't know what to expect from them. Load temperatures again are different for every CPU. The real goal is to look up the thermal throttle temperature for a CPU. Most CPU's intended for workstations or servers, like mine, will throttle around 85C. If you're CPU is maxed out but below that number, you're solid. If it's near it, see if you can drop it a bit to be safer. If you're at the throttle temperature, it's not the worst thing in the world. The CPU will throttle itself so it doesn't cook itself, but that can have drastic effects on your CPU performance. Basically, to get the best out of it, get things below the throttle temperature as much as possible. All that said, if you're hitting the throttle temperature fast and seeing performance dip significantly, you've got a serious cooling problem.
Storage temperatures: The old rule of thumb here is 35-40C are normal, lower is better, but too low is bad. Higher speed drives will be hotter, and the more condensed the drives are they will obviously be heating each other up. The problem is there are a lot of differing opinions from reliable sources on this topic. Some say drives are more likely to fail at such and such temperature while others say there's no real correlation. The only thing it seems like everyone agrees on is too cold is seriously bad. Here's my opinion: Each drive has operating temperatures listed in their specs. As long as you're above the low temperature, and decently below the high, you're fine. My drives are idling in the mid to high 40's. Their low is 30C, the high is 70C. Hotter than the generally recommended 40C but not by a lot, and well within the manufacturer operating temp range. I've switched to a different chassis and drive cage, changed fans, used a couple different manners of fan control... and nothing's made a big impact. I did what I reasonably could and got the temps from the low 50's to the mid 40's, which is better than nothing. I'm fine with it.
Noise: I can't tell you what's too loud for you, and outside of buying expensive sound meters there's no way to objectively track it anyways. I've got a cheap little decibel meter but you can't trust these things... The point is, you know what's too loud and what isn't. If you built a server anything like mine, there are 2 major sources of noise: The fans and the drives. Another smaller source to consider is vibrations, which is a hell of a thing to try and solve.
1. Fans: When it comes to case fans, bigger is generally better. The larger the fan, the more air it moves at a given speed. The size of a fan and the speed it's running at directly contributes to how loud it is, but there are ways to design the blades and shell of a fan to reduce noise and increase air flow. Depending on your chassis you'll be limited in certain regards. In my Rosewill 4U chassis I can't just slap a 120mm fan in the rear to act as exhaust like I can in any off the shelf standard tower these days. You can try to go on manufacturer specs but sometimes those won't be the whole truth. I initially picked up some Noctua 12cm fans but switched them out for Nanoxia 12cm fans. The Nanoxia's are marketed as having a higher CFM (more air is moved) and lower dB (quieter) than the Noctuas, but when I installed them they were both louder and made my drives hotter... so I returned them and put the Noctuas back in. Your only real option here is experimenting.
2. Drives: If you have enough drives, chances are that's what's actually loud in your system. There's also not a lot you can do about it depending on the drives you got. Some are just loud. Depending on the case or chassis you have, you can try to install some sound damping material, but that can introduce more heat to the system. I picked up some Silverstone 10mm sound damping material to put on the chassis lid. It did very little. I can tell it's made a chance but it's so insignificant it doesn't matter. I've got another sheet of the stuff I'm thinking I'll try to put on the other sides of the chassis but I don't expect much.
3. Vibrations: Between the drives and fans, there's a lot of movement running through the system. That movement can vibrate through the metal of your chassis, creating a resonance noise, and in some cases that vibration can make it to whatever your chassis is sitting on or in. There are tons of ways to try and address it: rubber washers around screws, anti-vibration mounts for fans and drives, rubber feet for the chassis, silicone seals (like for windows) around the chassis, Dynamat, etc. You figure out what you want to do, but I suggest actually testing each potential trouble spot first by simply holding a finger or hand against anything you suspect to be vibrating. If you hear the noise drop a bit, you've spotted a potential spot to dampen vibrations. Alternatively, for difficult to reach spots, find an all metal screw drive or allen wrench, whatever will reach the spot. Holding it gingerly, rest it against the potential problem. You should feel the vibrations through the device you're carefully holding but also hear it vibrating as well.
4. For all three factors, do what you can but at the end of the day the best bet is simple: out of sight, out of mind. If you can put the server somewhere that the noise won't bother you, and it won't cook itself, do it.

One final note for you: Every piece of hardware is different. Not just between manufacturers and models, but even items from the same exact model can be a bit different. You can't expect your hardware to be prefect out of the box or match up with someone else's flawlessly. The point of such thorough benchmarking is to ensure the hardware you do have is going to be reliable and preform to your expectations. As long as you are comfortable with what you're seeing during the benchmarks, or you correct things until you're comfortable with it, you'll be fine.