4 minutes
Better, faster, stronger: Upgrading server CPUs
In my first homelab post I mentioned that after setting up my server, even quad E7-4807s were a little weak. I couldn’t resist the urge to immediately upgrade my processors – 48 threads just doesn’t feel too impressive for quad CPUs.
I went on Ebay and snagged 4x E7-4870s for ~$15, and in just a few days they were at my front door waiting for me. Compared to the E7-4807s, these are a solid upgrade.
E7-4807 | E7-4870 | |
---|---|---|
Cores/Threads | 6/12 | 10/20 |
TDP | 95W | 130W |
Clock | 1.86 GHz | 2.4 GHz |
Turbo | No | 2.8 GHz |
Cache | 18M | 30M |
Bus Speed | 4.6 GT/s | 6.4 GT/s |
Max memory speed | 800 MHz | 1066 MHz |
T_Case | 70 C | 69 C (Nice) |
So with 4 of these, my total thread count goes from 48 to 80, which is a little more satisfying! Additionally, the 4870s boost substantially higher than the 4807s (which don’t boost) and have a significantly larger cache, so I expect much better performance from them.
**Side note: Whoever decided on these naming conventions should be punished. 4807 and 4870? Seriously? :(
CPU Benchmarking
I feel like as a computational scientist, I can happily say that few things get my rocks off like a good benchmark. Now that I have a few conditions to compare, I can meaningfully benchmark performance across
- My desktop
- The server, with the old 4807s
- The server, with the new 4870s and get an idea for what kind of performance I can actually expect from the server.
I did my benchmarking with a Docker sysbench
image. In hindsight, I could’ve just run it natively, but the container should give me a really consistent benchmark.
All benchmarks were performed with
docker run --rm ljishen/sysbench output.prof \
--cpu-max-prime=20000 --num-threads=X --test=cpu run
where X is the number of threads I’m benchmarking for.
Runtime (1 thread) | Runtime (16 threads) | Runtime (48 threads) | Runtime (80 threads) | |
---|---|---|---|---|
Desktop (8c/16t) | 12.54 | 1.56 | 1.58 | 1.56 |
Server - 4807 (4x 6c/12t) | 43.36 | 2.95 | 1.28 | 1.28 |
Server - 4870 (4x 10c/20t) | 32.11 | 2.73 | 1.03 | 0.70 |
This is pretty cool to see, because it’s exactly what we’d expect! There are two questions we can ask from this data:
- Which processor is the fastest, on a single core?
- Which processor is the fastest, over all cores?
In the single-threaded benchmark, the desktop of course blows the others out of the water. The 4807s are about 4x slower, and even the 4870s are still about 3x slower.
At 16 threads, each benchmark thread can run in parallel on the desktop CPU, and the other two are still underutilized. The relative performance difference between the 3 is the same as for the single-thread case, because we’re still parallelizing the same amount. We’re not yet taking advantage of the extra cores on the server chips.
You’ll notice in each row, after we saturate the processors threads, we stop getting any performance improvement as we scale out to more threads. We’re already parallelizing as much as we possibly can, so those extra threads have nowhere to go, and we don’t get any improved parallelism.
Past 16 threads, we’re in territory where the server chips excel, and where we can actually taking advantage of all that extra parallelism. When we’re at 48 threads, the 4807 is now slightly faster than the desktop, but the relative performance of the 4807 and 4870 stays about the same.
At 80 threads, only the 4870 is able to fully parallelize the workload and continue scaling. Out here, it’s now over twice as fast as the desktop!
Discussion
(Lol, this section heading. Can you tell the science writing instincts kicked in?)
This is exactly what I was hoping to get. The 4807s didn’t really provide enough threads to get a meaningful speedup over my desktop CPU in this benchmark. But with the 4870s, I’m now running twice as fast as on my desktop.
Of course the usual caveats about benchmarking (and parallel code) come in here – this doesn’t mean I get a 2x speedup by using the server, it means this particular benchmark gave me one. And that only applies, as we can see, on a parallelizable workload.
However, in my research I spend a lot of time parallelizing my analysis. I can certainly take advantage of 80 parallel workers! Very excited to see what sort of performance boosts I can get on actual workloads with this.