Friday, May 12, 2017

Dummy plugs for headless GPU rigs


I've read about people claiming they needed to plug a monitor (or dummy plug) into one GPU card or else they couldn't use the card.  I had never encountered any problems with either fglrx or AMDGPU-Pro drivers until recently.  I moved a 4GB R9 380 card from an Ubuntu 14.04/fglrx rig to a Ubuntu 16.04/AMDGPU-Pro rig.  The remaining cards are 2GB R7 370 cards, and I started getting memory allocation errors for the primary card.  After checking with "ethminer --list-devices", I noticed the first card had about half the maximum memory allocation limit of the others:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
=====================================================================
Forked from github.com/ethereum/cpp-ethereum
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

Please consider a donation to:
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

[OPENCL]:
Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 1920991232
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 970981376
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[1] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 2095054848
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1868562432
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

I have an old VGA LCD monitor that I connected using a HDMI-VGA adapter.  After connecting the monitor, nearly the full amount became available:
Genoil's ethminer 0.9.41-genoil-1.2.0nr
=====================================================================
Forked from github.com/ethereum/cpp-ethereum
CUDA kernel ported from Tim Hughes' OpenCL kernel
With contributions from nicehash, nerdralph, RoBiK and sp_

Please consider a donation to:
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d

[OPENCL]:
Listing OpenCL devices.
FORMAT: [deviceID] deviceName
[0] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 1969225728
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
[1] Pitcairn
        CL_DEVICE_TYPE: GPU
        CL_DEVICE_GLOBAL_MEM_SIZE: 1968177152
        CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1750073344
        CL_DEVICE_MAX_WORK_GROUP_SIZE: 256

I also found the monitor doesn't have to be plugged in, just the HDMI-VGA adapter.  While there might be a way to configure fglrx so that the full memory is available without the adapter, I'm more interested in learning more about AMDGPU-Pro.

Wednesday, May 10, 2017

GDDR5 memory timing details



In my Advanced Tonga BIOS editing post, I discussed some basic memory timing information, but did not get into the details.  GDDR5 memory is much more complex than the asynchronous DRAM of 20 years ago.  There are many sources of information on SDRAM, while GDDR information is harder to come by.  Although a thorough description of GDDR5 can be found in the spec published by JEDEC, neither nVIDIA nor AMD share information on how their memory controllers are programmed with memory timing information.  By analyzing the AMD video driver source, and with help from people contributing to a discussion on bitcointalk, I have come to understand most of the workings of AMD BIOS timing straps.

When a modern (R9 series and Rx series) AMD GPU card boots up, memory timing information (straps) are copied from the BIOS to registers in the memory controller.  Some timing information such as refresh frequency is not dependent on the memory speed and therefore is not contained in the memory strap table, but much of the important timing information is.  The memory controller registers are 32-bits wide, and so the 48-byte memory straps map to 12 different memory controller registers.  The shift masks in the Linux driver source are therefore non-functional, and can only be taken as hints as to the meaning of the individual bits.  Due to an apparently bureaucratic process for releasing open-source code, AMD engineers are generally reluctant to update such code.

Jumping right to the code, here's a C structure definition for the Rx memory straps:
SEQ_WR_CTL_D1_FORMAT SEQ_WR_CTL_D1;
SEQ_WR_CTL_2_FORMAT SEQ_WR_CTL_2;
SEQ_PMG_TIMING_FORMAT SEQ_PMG_TIMING;
SEQ_RAS_TIMING_FORMAT SEQ_RAS_TIMING;
SEQ_CAS_TIMING_FORMAT SEQ_CAS_TIMING;
SEQ_MISC_TIMING_FORMAT SEQ_MISC_TIMING;
SEQ_MISC_TIMING2_FORMAT SEQ_MISC_TIMING2;
uint32_t SEQ_MISC1;
uint32_t SEQ_MISC3;
uint32_t SEQ_MISC8;
ARB_DRAM_TIMING_FORMAT ARB_DRAM_TIMING;
ARB_DRAM_TIMING2_FORMAT ARB_DRAM_TIMING2;

Looking at the RAS timing, it consists of 6 fields: RCDW, RCDWA, RCDR, RCDRA, RRD, and RC.  The full field definitions can be found in my fork of Kristy-Leigh's code.  Many of the "pad" fields are likely the high bits of the preceding field that are not currently used.  I tested a couple pad fields already (MISC RP_RDA & RP), confirming that the pad bits were actually the high bits of the fields.


For GDDR5, some timing values have both Long and Short versions that apply for access within a bank group or to different bank groups.  The RRD field of RAS timing is likely RRDL, because the values typically seen for this field are 5 and 6.  If RRDS was 5, this would mean at most one page could be opened every five cycles, limiting 32-byte random read performance to 2/5 or 40% of the maximum interface speed.  From my work with Ethereum mining, I know that RRDS can be no more than 4.  In addition, performance tests with RRD timing reduced to 5 from 6 are congruent with it being RRDL.  The actual value of RRDS used by the memory controller does not seem to be contained in the timing strap.  The default 1750Mhz strap for Samsung K4G4 memory has a value of 10 for FAW, which can be no more than 4 * RRDS.  Therefore RRDS is most likely less than 4, and possibly as low as 2.

To simplify the process of modifying memory straps for improved performance, I wrote strapmod.  I also wrote a cgi wrapper for the program, which you can run from my server http://45.62.227.192/cgi-bin/strapmod.  For example, this is the output with the 1750Mhz strap for Samsung K4G4 memory:
Rx strap detected
Old, new RRD: 6 , 5
Old, new FAW: A , 0
Old, new 32AW: 7 , 0
Old, new ACTRD: 19 , 0x10
777000000000000022CC1C0010626C49D0571016B50BD509004AE700140514207A8900A003000000191131399D2C3617
777000000000000022CC1C0010625C49D0571016B50BD50900400700140514207A8900A003000000101131399D2C3617