Saturday, March 25, 2017

AMDGPU-Pro 16.60 on Ubuntu kernel 4.10.5 with ROCM-smi

Although AMDGPU-Pro 16.40 with kernel 4.8 has been working fine for me, I decided to try 16.60 with kernel 4.10.  After my problems with 16.60 on 4.8, I read a few reports claiming it works well with kernel 4.10.

I started with a fresh Ubuntu desktop 16.04.2 install, and then installed 4.10.5 from the Ubuntu ppa.  Although the process is not very complicated, I wrote a small script which downloads the files and installs them.  After rebooting, I downloaded and installed the AMDGPU-Pro 16.60 drivers according to the instructions.  Finally, I installed ROC-smi, a utility which simplifies clock control using the sysfs interface.  To test the install, run "rocm-smi -a" which will show all info for any amdgpu cards installed.

Unfortunately, the new drivers no longer work with my ethminer fork, but sgminer-gm 5.5.5 works as was well as it did with 4.8/16.40.  On GCN3 and newer cards like Tonga and Polaris, the optimal core clock for mining ETH is often between 55% and 56% of the memory clock.  On my Sapphire Rx470 I have the memory overclocked to 2100Mhz, so dpm 6 at 1169Mhz is a perfect fit:
./rocm-smi -d 0 --setsclk 6

Once sgminer was running for a couple minutes, the speed settled at about 29.1Mh/s.  Note that the clock setting is only temporary for the next opencl program to run.  Just run the rocm-smi command each time.

Tuesday, March 14, 2017

Riser Recycling

If you build multi-GPU servers, you'll likely encounter flaky or bad risers.  I've had a bad riser where I could see a burned trace on the PCB, and I've had flaky risers that appeared to be caused by poor soldering of the ribbon cable.  While the problem risers may not work with a GPU, chances are the power connectors are still good.  The riser shown above has a 6-pin PCI-e and a 4-pin molex connector, both of which I tested for continuity with a multi-meter.  With some fresh flux I was able to desolder the ribbon cable, so I could re-use the riser as a PCI-e to molex power adapter.  If you are wondering what I would use it for, look at the photo below.

Heat has caused the yellow 12V line to turn brown.  The cable was plugged into the motherboard's supplemental PCI-e power which is used when more than two GPUs are plugged in.  Each GPU will usually draw between 50 and 75 watts over the PCI-e bus, which is pushing the 18AWG (or even 20AWG on some power supplies) cable well beyond it's recommended rating.  By plugging the next molex connector in the chain into the riser, and by providing power to the 6-pin connector on the same riser, current will flow into the motherboard molex connector from both directions.

With the current through the brown wire cut in half, the power dissipated (and therefore the heat generated) is reduced by 75%, since P = I^2 * R.

Supplemental mod

Bitcointalk user BChydro questioned the current-carrying ability of the riser PCB, which turns out to be rather poor for the 12V trace.  The solder mask over the 12V trace was starting to turn brown after only a couple days of use, and a thermal image shows the trace getting hot.

To solve the problem I added a 18AWG jumper wire between the 12V pins:

Sunday, March 5, 2017

AMDGPU-Pro on Ubuntu

It's been almost a year since the first AMDGPU-Pro driver release.  There are now two main release versions; 16.40 and 16.60.  Although both versions supposedly support Ubuntu 16.04, version 16.40 with Ubuntu Desktop 16.04.2 is the only combination that works without a kernel update.

Ubuntu 16.04.2 is the first 16.04 release to use kernel version 4.8 instead of version 4.4.  Using 16.40 with kernel version 4.4 would sometimes lead to problems such as kernel message log floods or powerplay problems.  The typical powerplay problem was that the card would not switch to the full system and memory clock when running OpenCL programs.

Before a fresh Ubuntu install, I suggest disabling safeboot, since the AMDGPU-Pro drivers are not signed and therefore do not work with safeboot.  If safeboot is already set up on your system, the driver install script will prompt you to disable it.  Unlike the fglrx drivers, I have found the AMDGPU-Pro drivers will work along with the Intel i915 drivers.  In a multi-GPU system, I like to leave a monitor connected to the on-board video for a system console.  GPUs can easily be swapped in and out without having to move the monitor connection.

Before installing the driver, make sure your card is detected by running, "lspci | grep VGA".  The installation instructions are straightforward, and don't forget to update the video group as mentioned at the end of the instructions.  Otherwise OpenCL programs will not detect the GPU.  Note that there is a bug in clinfo (/opt/amdgpu-pro/bin/clinfo) that causes it to display 14 for "Max compute units" instead of the actual number of GPU compute units.  This bug is fixed in 16.60, which requires kernel 4.10 to work properly.

To test your GPU and the driver, you could try my ethminer fork.  Although I built and tested it on Ubuntu 14.04/fglrx, it works perfectly on Ubuntu 16.04.2 with AMDGPU-Pro 16.40.  Once you've started ethminer (or any other OpenCL program), you can check the core and memory clocks with the following commands:
 cat /sys/class/drm/card0/device/pp_dpm_sclk
 cat /sys/class/drm/card0/device/pp_dpm_mclk

The driver does not come with a tool like aticonfig for custom clock control.  The driver does expose ways of controlling the clocks and voltage, and some developers have written custom programs using information from the kernel headers.  Although nobody seems to have released a utility, the sgminer-gm sysfs code could likely be used as a template to create a stand-alone utility.