Friday, June 23, 2017

Server PSU interlock

On my multi-GPU rigs, I use server PSUs like the Dell N750P to provide the 12V power to the PCI-E connectors.  These PSUs do not have power switches, so initially I would just pull the power cord out when I wanted to power them down.  After experimenting with the PSU control pins, I realized they have an active low "power on" pin.  Instead of using a jumper to connect it to ground, I decided to use an electronic switch to power the server PSU when the motherboard powers up.

The switch I used is a common, cheap model 817 optocoupler (pdf datasheet).  When current flows from pin 1 to 2, the optocoupler is turned on, creating a short from pin 4 to pin 3.  For my small circuit shown above, pin 4 is connected to the PS_ON signal, and pin 3 is connected to ground on the server PSU.  Pin 1 is connected to 12V (from the 4-pin 3.5" floppy drive power connector), and pin 2 is connected to ground.  On the back of the board is a 1K current-limiting resistor in series with the red LED which is a power on indicator.

I also made an even simpler interlock using only an optocoupler with the pins straightened and 0.1" header pins:
I connect pins 1 and 2 to the motherboard's power LED pins, which would normally light up a LED  when the motherboard powers up.  The motherboard already has a current-limiting resistor for the power LED, which typically limits the current to around 10mA.

Friday, May 12, 2017

Dummy plugs for headless GPU rigs

I've read about people claiming they needed to plug a monitor (or dummy plug) into one GPU card or else they couldn't use the card.  I had never encountered any problems with either fglrx or AMDGPU-Pro drivers until recently.  I moved a 4GB R9 380 card from an Ubuntu 14.04/fglrx rig to a Ubuntu 16.04/AMDGPU-Pro rig.  The remaining cards are 2GB R7 370 cards, and I started getting memory allocation errors for the primary card.  After checking with "ethminer --list-devices", I noticed the first card had about half the maximum memory allocation limit of the others:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
Forked from
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

Please consider a donation to:
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
        CL_DEVICE_GLOBAL_MEM_SIZE: 1920991232
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 970981376
[1] Pitcairn
        CL_DEVICE_GLOBAL_MEM_SIZE: 2095054848
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1868562432

I have an old VGA LCD monitor that I connected using a HDMI-VGA adapter.  After connecting the monitor, nearly the full amount became available:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
Forked from
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

Please consider a donation to:
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
        CL_DEVICE_GLOBAL_MEM_SIZE: 1969225728
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
[1] Pitcairn
        CL_DEVICE_GLOBAL_MEM_SIZE: 1968177152
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344

I also found the monitor doesn't have to be plugged in, just the HDMI-VGA adapter.  While there might be a way to configure fglrx so that the full memory is available without the adapter, I'm more interested in learning more about AMDGPU-Pro.

Wednesday, May 10, 2017

GDDR5 memory timing details

In my Advanced Tonga BIOS editing post, I discussed some basic memory timing information, but did not get into the details.  GDDR5 memory is much more complex than the asynchronous DRAM of 20 years ago.  There are many sources of information on SDRAM, while GDDR information is harder to come by.  Although a thorough description of GDDR5 can be found in the spec published by JEDEC, neither nVIDIA nor AMD share information on how their memory controllers are programmed with memory timing information.  By analyzing the AMD video driver source, and with help from people contributing to a discussion on bitcointalk, I have come to understand most of the workings of AMD BIOS timing straps.

When a modern (R9 series and Rx series) AMD GPU card boots up, memory timing information (straps) are copied from the BIOS to registers in the memory controller.  Some timing information such as refresh frequency is not dependent on the memory speed and therefore is not contained in the memory strap table, but much of the important timing information is.  The memory controller registers are 32-bits wide, and so the 48-byte memory straps map to 12 different memory controller registers.  The shift masks in the Linux driver source are therefore non-functional, and can only be taken as hints as to the meaning of the individual bits.  Due to an apparently bureaucratic process for releasing open-source code, AMD engineers are generally reluctant to update such code.

Jumping right to the code, here's a C structure definition for the Rx memory straps:
uint32_t SEQ_MISC1;
uint32_t SEQ_MISC3;
uint32_t SEQ_MISC8;

Looking at the RAS timing, it consists of 6 fields: RCDW, RCDWA, RCDR, RCDRA, RRD, and RC.  The full field definitions can be found in my fork of Kristy-Leigh's code.  Many of the "pad" fields are likely the high bits of the preceding field that are not currently used.  I tested a couple pad fields already (MISC RP_RDA & RP), confirming that the pad bits were actually the high bits of the fields.

For GDDR5, some timing values have both Long and Short versions that apply for access within a bank group or to different bank groups.  The RRD field of RAS timing is likely RRDL, because the values typically seen for this field are 5 and 6.  If RRDS was 5, this would mean at most one page could be opened every five cycles, limiting 32-byte random read performance to 2/5 or 40% of the maximum interface speed.  From my work with Ethereum mining, I know that RRDS can be no more than 4.  In addition, performance tests with RRD timing reduced to 5 from 6 are congruent with it being RRDL.  The actual value of RRDS used by the memory controller does not seem to be contained in the timing strap.  The default 1750Mhz strap for Samsung K4G4 memory has a value of 10 for FAW, which can be no more than 4 * RRDS.  Therefore RRDS is most likely less than 4, and possibly as low as 2.

To simplify the process of modifying memory straps for improved performance, I wrote strapmod.  I also wrote a cgi wrapper for the program, which you can run from my server  For example, this is the output with the 1750Mhz strap for Samsung K4G4 memory:
Rx strap detected
Old, new RRD: 6 , 5
Old, new FAW: A , 0
Old, new 32AW: 7 , 0
Old, new ACTRD: 19 , 0x10

Saturday, March 25, 2017

AMDGPU-Pro 16.60 on Ubuntu kernel 4.10.5 with ROCM-smi

Although AMDGPU-Pro 16.40 with kernel 4.8 has been working fine for me, I decided to try 16.60 with kernel 4.10.  After my problems with 16.60 on 4.8, I read a few reports claiming it works well with kernel 4.10.

I started with a fresh Ubuntu desktop 16.04.2 install, and then installed 4.10.5 from the Ubuntu ppa.  Although the process is not very complicated, I wrote a small script which downloads the files and installs them.  After rebooting, I downloaded and installed the AMDGPU-Pro 16.60 drivers according to the instructions.  Finally, I installed ROC-smi, a utility which simplifies clock control using the sysfs interface.  To test the install, run "rocm-smi -a" which will show all info for any amdgpu cards installed.

Unfortunately, the new drivers no longer work with my ethminer fork, but sgminer-gm 5.5.5 works as was well as it did with 4.8/16.40.  On GCN3 and newer cards like Tonga and Polaris, the optimal core clock for mining ETH is often between 55% and 56% of the memory clock.  On my Sapphire Rx470 I have the memory overclocked to 2100Mhz, so dpm 6 at 1169Mhz is a perfect fit:
./rocm-smi -d 0 --setsclk 6

Once sgminer was running for a couple minutes, the speed settled at about 29.1Mh/s.  Note that the clock setting is only temporary for the next opencl program to run.  Just run the rocm-smi command each time.

Update 2017-04-08

 4.10.9 was uploaded to the Ubuntu ppa today, so I would recommend it instead of 4.10.5.

Tuesday, March 14, 2017

Riser Recycling

If you build multi-GPU servers, you'll likely encounter flaky or bad risers.  I've had a bad riser where I could see a burned trace on the PCB, and I've had flaky risers that appeared to be caused by poor soldering of the ribbon cable.  While the problem risers may not work with a GPU, chances are the power connectors are still good.  The riser shown above has a 6-pin PCI-e and a 4-pin molex connector, both of which I tested for continuity with a multi-meter.  With some fresh flux I was able to desolder the ribbon cable, so I could re-use the riser as a PCI-e to molex power adapter.  If you are wondering what I would use it for, look at the photo below.

Heat has caused the yellow 12V line to turn brown.  The cable was plugged into the motherboard's supplemental PCI-e power which is used when more than two GPUs are plugged in.  Each GPU will usually draw between 50 and 75 watts over the PCI-e bus, which is pushing the 18AWG (or even 20AWG on some power supplies) cable well beyond it's recommended rating.  By plugging the next molex connector in the chain into the riser, and by providing power to the 6-pin connector on the same riser, current will flow into the motherboard molex connector from both directions.

With the current through the brown wire cut in half, the power dissipated (and therefore the heat generated) is reduced by 75%, since P = I^2 * R.

Supplemental mod

Bitcointalk user BChydro questioned the current-carrying ability of the riser PCB, which turns out to be rather poor for the 12V trace.  The solder mask over the 12V trace was starting to turn brown after only a couple days of use, and a thermal image shows the trace getting hot.

To solve the problem I added a 18AWG jumper wire between the 12V pins:

Sunday, March 5, 2017

AMDGPU-Pro on Ubuntu

It's been almost a year since the first AMDGPU-Pro driver release.  There are now two main release versions; 16.40 and 16.60.  Although both versions supposedly support Ubuntu 16.04, version 16.40 with Ubuntu Desktop 16.04.2 is the only combination that works without a kernel update.

Ubuntu 16.04.2 is the first 16.04 release to use kernel version 4.8 instead of version 4.4.  Using 16.40 with kernel version 4.4 would sometimes lead to problems such as kernel message log floods or powerplay problems.  The typical powerplay problem was that the card would not switch to the full system and memory clock when running OpenCL programs.

Before a fresh Ubuntu install, I suggest disabling safeboot, since the AMDGPU-Pro drivers are not signed and therefore do not work with safeboot.  If safeboot is already set up on your system, the driver install script will prompt you to disable it.  Unlike the fglrx drivers, I have found the AMDGPU-Pro drivers will work along with the Intel i915 drivers.  In a multi-GPU system, I like to leave a monitor connected to the on-board video for a system console.  GPUs can easily be swapped in and out without having to move the monitor connection.

Before installing the driver, make sure your card is detected by running, "lspci | grep VGA".  The installation instructions are straightforward, and don't forget to update the video group as mentioned at the end of the instructions.  Otherwise OpenCL programs will not detect the GPU.  Note that there is a bug in clinfo (/opt/amdgpu-pro/bin/clinfo) that causes it to display 14 for "Max compute units" instead of the actual number of GPU compute units.  This bug is fixed in 16.60, which requires kernel 4.10 to work properly.

To test your GPU and the driver, you could try my ethminer fork.  Although I built and tested it on Ubuntu 14.04/fglrx, it works perfectly on Ubuntu 16.04.2 with AMDGPU-Pro 16.40.  Once you've started ethminer (or any other OpenCL program), you can check the core and memory clocks with the following commands:
 cat /sys/class/drm/card0/device/pp_dpm_sclk
 cat /sys/class/drm/card0/device/pp_dpm_mclk

The driver does not come with a tool like aticonfig for custom clock control.  The driver does expose ways of controlling the clocks and voltage, and some developers have written custom programs using information from the kernel headers.  Although nobody seems to have released a utility, the sgminer-gm sysfs code could likely be used as a template to create a stand-alone utility.

Monday, February 20, 2017

Inside AMD GCN code execution

AMD's Graphics Core Next architecture was introduced over five years ago.  Although there have been many documents written to help developers understand the architecture, and thereby write better code, I have yet to find one that is clear and concise.  AMD's best GCN documentation is often cluttered with unnecessary details on the old VLIW architecture, when the GCN architecture is already complicated enough on it's own.  I intend to summarize my research on GCN, and what that means for OpenCL and GCN assembler kernel developers.

As shown in the top diagram (GCN Compute Unit), the GPU consists of groups of four compute units.  Each CU has four SIMD units, each of which can perform 16 simultaneous 32-bit operations.  Each of these 16 SIMD "lanes" is also called a shading unit, so the R9 380 with 28 CUs has 28 * 4 * 64 = 1792 shading units.

AMD's documentation makes frequent reference to "wavefronts".  A wavefront is a group of 64 operations that executes on a single SIMD.  The SIMD operations take a minimum of four clock cycles to complete, however SIMD pipelines allow a new operation to be started every clock.  "The compute unit selects a single SIMD to decode and issue each cycle, using round-robin arbitration." (AMD GCN whitepaper pg 5, para 3).  So four cycles after SIMD0 has been issued an instruction, the CU is ready to issue it another.

In OpenCL, when the local work size is 64, the 64 work-items will be executed on a single SIMD.  Since a maximum of four SIMD units can access the same local memory (LDS), AMD GCN devices support a maximum local work size of 256.  When the local work size is 64, the OpenCL compiler can leave out barrier instructions, so performance will often (but not always) be better than using a local work size of 128, 192, or 256.

The SIMD units only perform vector operations such as mulitply, add, xor, etc.  Branching for loops or function calls is performed by the scalar unit, which is shared by all four SIMD units.  This means that when a kernel executes a branch instruction, it is executed by the scalar unit, leaving a SIMD unit available to perform a vector operation.  The two operations (scalar and vector) must come from different waves, so to ensure the SIMD units are fully utilized, the kernel must allow for 2 simultaneous wavefronts to execute.  For information on how resource usage such as registers and LDS impacts the number of simultaneous wavefronts that can execute, I suggest reading AMD's OpenCL Optimization Guide.  Note that some sources state that full SIMD occupancy requires four waves, when it is technically possible with just one wave using only vector instructions.  Most kernels will require some scalar instructions, so two waves is the practical minimum.

Monday, January 9, 2017

Hot Video Cards

When I read discussions about video card temperatures, the vast majority are about the GPU core temperature.  With older GPUs like the R9 290, temperature-based throttling when the GPU core temperature hits 94C can be a problem.  With newer GPUs like the R9 380 and especially with the Rx series cards, there is rarely issues with GPU core temperatures, even with low-end cooling systems.  While the GPU core is always cooled with a heatsink and fans, often the RAM is not.  The infrared image above shows how much of a difference that can make in RAM temperatures.

The image was taken of a 4GB MSI R9 380 card with the memory clocked at 1600Mhz while running ethminer-nr.  The memory chips above the GPU are connected to the heatsink through a thermal pad, but the chips to the left of the GPU are not.  Using an infrared thermometer I measured temperatures between 95 and 100C on the back side of the PCB from the RAM, so the RAM die temperatures are likely well in excess of 100C.

Keeping RAM cool can make a material difference in the clock speeds that can be achieved.  Instead of 1600Mhz, I have found that 1500Mhz-rated GDDR5 can reach stable speeds of 1700Mhz when connected to a basic heat spreader.  The brand of the memory, Elpida, Hynix, or Samsung, makes little difference in performance when compared to cooling.

While manufacturers will rarely provide enough detail in their specifications or product images to determine if the RAM is cooled, card tear-down reviews will often show the connection between the heatsink and RAM.  Of the cards I have used, only a MSI R9 380 Gaming card had all the RAM cooled.  Neither MSI Armor2X cards nor Gigabyte Windforce cards have all the RAM chips cooled with a heatsink or heat spreader.  I also own an Asus Rx 470 Strix card, and that also lacks active cooling for some of the memory chips.