Monday, April 20, 2015

Adapting an ESP-01 module for breadboard use

While esp8266 ESP-01 modules can easily be programmed after hooking up some dupont jumpers to a USB-TTL module, using them on a breadboard without an adapter or modification is not possible.  The obvious method of making an adapter with a 2x4-pin female header, some stripboard, and 2 1x4 male headers.  I thought of an even simpler way of adapting the ESP-01 for a breadboard without using any extra parts.

Of the 8 pins on the ESP-01, the CH_PD pin should be permanently tied to Vcc, so only 3 of the four middle pins are needed.  If you use my zero-wire reset solution, only the GPIO0 and GPIO2 pins are needed.  To modify the ESP-01 for breadboard use, heat up the CH_PD pin with a soldering iron, then pull it out with a pair of needle-nose pliers after the solder is liquid.  Then solder a short wire from Vcc to the CH_PD pad.  Next heat up the remaining 3 middle pins, and push them until they stick up out of the PCB.  To do this I put my needle-nose pliars under the pins, then pushed down on the module.  If you go too fast and get lumps of solder on the pin, add some flux and heat up the solder to level out the solder so jumper wires can smoothly plug into the pins.  If you're not using my reset solution, I still recommend a capacitor on RST as it will reduce or eliminate spurious resets.  The RST line on the esp8266 is very sensitive, at least compared to RST on 8-bit AVR MCUs.  When you are done you your module should look like one below, and can easily plug into a breadboard.

Wednesday, April 15, 2015

Continuous Integration: the wonderful world of free build servers

While contributing to the esp8266/Arduino project, Ivan posted a link to a test build using Appveyor.  After a bit of research, I learned that there is a whole slew of companies offering cloud-based build servers in this space called continuous integration.  More impressive is that it is free for open-source projects with most companies in this space.

When I first started writing software it was in basic and assembler on a Commodore 64.  When writing small programs on fixed-configuration systems like that, the development cycle was reasonably quick, with even "large" assembler programs taking seconds to build.  Deployment testing was simple as well; if it worked on my C64 it would work on everyone else's.  As computers got bigger and more complex, so did the development cycle.  While working on large projects at places like Nortel, full system builds could take several hours or even days.  Being able get quick feedback on small code changes is very important to software development productivity.  The availability of low-cost, on-demand services like Amazon Web Services has enabled companies in the CI space to offer build services with minimal infrastructure investment.

Who's Who

The esp8266/Arduino project uses Appveyor for Windows builds and Travis for Linux builds.  Other CI companies offering Linux-based build services include, Snap CI, and my favorite, Codeship.  All of these companies offer some level of service for free to open-source developers, so I decided to try all four of the Linux-based CI services.

For my work with embedded systems, I have been writing build scripts for avr-gcc, which I intend to extend to building a gcc cross-compiler for the xtensa lx106 CPU on the esp8266.  Full builds of binutils, avr-gcc, and avr-libc take a few hours on an Intel Core i5, so getting a working build was a slow process.  Having a large build like this also turned out to be helpful differentiating between the different CI services.

One thing all the CI services have in common is they make it easy to set up an account and try their service if you already have a github account.  With Google Code shutting down, everyone in open-source development should have a github account already.  While the CI services support a number of different languages, I was only concerned with C++ using the gcc compiler.  All the servers had gcc >= 4.4 and typical tools like autoconf, flex, and bison.


Travis was the first CI service I tried, and it turned out to be one of the more complicated.  In order to get Travis to set up for your build (set environment variables, download dependencies), you need to make a .travis.yml file in the root of your repository.  The format is similar to a shell script, so it wasn't too hard to figure out.  After a bit of experimenting I was able to get a build started.

From some of the posts I read online, I was concerned whether the build would complete in the allowed time of 50 minutes.  The problem I ended up having was not build time but build output.  After 4MB of log output, Travis terminates the build.  If my build failed I wanted to see where in the build process the problem occurred.  So I turned off minor log output from things like tar extracting dependency libraries, but I still hit the 4MB limit.

Another problem you might have with large amounts of log output relates to how your browser handles it.  Firefox started freezing on me when I tried to view a 4MB log file, but Chrome was OK.

Drone's service was easier to setup, allowing a shell script to be written in their web interface, which would be run to start your build.  Drone has a limit of only 15 minutes on free builds, which turned out to be the showstopper with their service.


I almost missed the boat on Codeship since they don't even list C/C++ in their supported languages.  I guess a gcc installation is taken for granted for Linux-based CI.  Codeship, like Drone, allows you to write a build script in their web interface.  Unlike Travis and Drone, no sudo access is availabe on the build servers, so installing updated packages is not possible.  Since the servers have a reasonably recent version of gcc (4.8.2-19ubuntu1) and gnu tools, this was not a problem for me.  Their build servers (running on AWS) are nice and fast, with a full build, using make -j4, taking about 13 minutes.

Codeship doesn't seem to have any limit on build time, though they do have a 10MB log output limit.  Fortunately that is just a limit of what is displayed in the web inteface, and the build does not stop.   The most impressive thing about Codeship's service is that they give you ssh access to a build server instance for debugging!  Clicking on Open SSH Debug Session gives you the IP and port to ssh into, assuming you've already updated your account with your ssh public key.

On the debug server, your code is already copied to the "clone" directory.  The servers are running Ubuntu 14.04.2 LTS, The servers seem to have un-throttled GigE ports, as a download of gcc 4.9.2 clocks in at 24MB/s or 200mbps.  With the debug server I was able to manually run my build, review config.log files, and copy files using scp to my computer for later review.

Snap CI

One way Snap CI is different than the other services, is that their servers run CentOS instead of Ubuntu.  I started using RedHat before Debian and Ubuntu existed, and never had a good reason to leave rpm-based distributions, so I like the CentOS support.  The version of gcc on their servers (4.4.7) is rather old, but they enable sudo so you can upgrade that with a newer RPM.  They also have a limited shell interface called snap-shell.  It's not full ssh access like Codeship, but it does make it easy to check the environment by running things like "gcc --version".

Snap CI also uses AWS servers, and build times were very similar to Codeship.  If your builds require downloading a lot of prerequisite files, Snap may take a bit longer than Codeship though, as gcc 4.9.2 took about twice as long to download on Snap.


CI services eliminate the time and cost of setting up and maintaining build servers.  They simplify software testing by having a clean server instantiated for each build.  No more broken or incompatible builds because someone installed a custom library version on the build server that normal users don't have on their machine.  I closed my account, and will probably close my Travis account too.  I'll keep using both Codeship and Snap, to be sure the software I'm working on can build on both Ubunto and Centos.  If I continue to support programs like picoboot-avrdude that builds under Linux and MinGW, I'll also try out Appveyor.

Thursday, April 9, 2015

Building avr-gcc from source

Although 8-bit AVR MCUs are widely used, it is hard to find recent builds of avr-gcc.  The latest releases of gcc are 4.9.2 and 4.8.4, yet the latest release of the Atmel AVR Toolchain only includes gcc 4.8.1.  For CentOS 6, the most recent RPM I could find is 4.7.1.  To take advantage of the latest improvements to compiler features like link-time optimization, it is often necessary to build gcc from source.  When I first attempted to build gcc for AVR targets, I quickly discovered it's not as simple as  downloading the source then running "configure; make install".  Picking away at it over the course of several months, I've figured out how to do it.  This method should work for avr targets, and with a small change to the build options, for other targets like Arm and Mips.

Although both the avr-libc and gcc sites have some information on building gcc, I found both fell short of being concise and unambiguous.  The biggest source of problems I encountered was other libraries that gcc requires for building.  GCC's prerequisites indicates, "Several support libraries are necessary to build GCC, some are required, others optional."  I (mis)interpreted the list of libraries starting with GNU Multiple Precision Library as being required only if those features were enabled in the compiler.  In the end I was only able to build avr-gcc 4.9.2 when I included GMP, MPFR, and MPC.  The ISL library was not required.

Required source files

The first thing to download before GCC is gnu binutils, which includes utilities like objdump for disassembling files.  If you already have an earlier version of binutils, it is not necessary to build a new version.  For example, on my server I have Atmel AVR Toolchain 3.4.4, which includes avr-gcc 4.8.1 and binutils 2.24.  In order to build avr-gcc 4.9.2 I don't need to make a new build of binutils.  I did try building binutils 2.25 (the latest), but instead of debugging a compile error I decided to stick with 2.24.  For building binutils, the following configure options were sufficient (though perhaps not necessary):
-v --target=avr --quiet --enable-install-libbfd --with-dwarf2 --disable-werror CFLAGS="-Wno-format-security"

The next thing to download is GCC, followed by GMP, MPFR, and MPC.  I used GMP 5.1.3, MPFR 3.1.2, and MPC 1.0.3.  After extracting all the packages, links in the GCC source directory need to be made, named gmp, mpfr, and mpc respectively, linking to their source trees.  Then gcc can be configured with the following options before running make:
-v --target=avr --disable-nls --enable-languages="c,c++" --disable-libssp --with-dwarf2

Build script

Rather than downloading, extracting, and building gcc manually, I started with a build script made by Rod Moffitt and a couple other contributors.  To use it, first run which will download the files, then  After a long build process, the binaries will be in /usr/local/avr/bin/, which you should then add to your shell PATH variable.


You can download my build for Linux x86_64.  It's dynamically linked, and built on CentOS 6, so you may have to add some symlinks in /lib64 for other Linux distributions.

Monday, April 6, 2015

Zero-wire auto-reset for esp8266/Arduino

A little over a year ago I developed a zero-wire auto-reset solution for Arduino.  After I started using Arduino for the esp8266, I realized I could do the same thing with the ESP-01.

Flashing the esp8266

In order to download code to the esp8266 after reset, GPIO0 and GPIO15 must be low, and GPIO2 must be high.  The ESP-01 has GPIO15 grounded, and GPIO2 is set high after reset.  GPIO0 is pulled up to Vcc after reset, so in order to download code to the flash, this must be pulled low.  Although esptool-ck supports using RTS and DTR for flashing the esp8266, many cheap USB-TTL modules don't break out those lines.  With USB-TTL modules that break out DTR, the DTR line should be connected to GPIO0 in order to pull the line low after reset.  Otherwise DTR needs to be grounded with a jumper or by connecting a push-button switch to ground.

The circuit

The auto-reset circuit I used on the esp8266 is a simplified version of the circuit I used with the pro mini.  It consists of just a 7.5K resistor between Rx and RST, and a 4.7uF capacitor between RST and Vcc.  The values are not critical, as long as the RC constant is between 10ms and 100ms, so if what you have on hand is a 15K resistor and 1uF capacitor, that should work fine.  A serial break signal is 250ms long, which is why I suggest a an RC constant of less than 100ms to allow the capacitor to discharge and trigger a reset before the break signal ends.  If the RC constant is less than 10ms, a sequence of zero bytes transmitted to the esp8266 could unintentionally trigger a reset.  At 9600bps, each bit is 104.2us long, so 8 zero bits plus the start bit would last 938us.  Several zero-bytes in a row, even with the high voltage of the stop bit in between, could trigger a reset.  The esptool default upload speed is 115.2kbps, so unwanted resets are quite unlikely.

The  auto-reset circuit has an added benefit of improving the stability of the esp8266 module.  The RST pin on the esp8266 is extremely sensitive.  Before I added the auto-reset circuit, simply touching a probe from my multimeter to the RST pin would usually reset the module, even when I tried adding a 15K pullup resistor to Vcc.  I would also get intermittent "espcomm_sync failed" messages when trying to upload code.  Since adding the auto-reset circuit, I can probe the RST pin without triggering a reset, and my uploads have been error-free.

Getting the updated esptool-ck

By the time you are reading this, Ivan may have already integrated my patch for esptool-ck.  If the issue is still open, then you can download the updated esptool-ck.  Extract the esptool.exe into hardware\tools\esp8266.  This version also includes support for 921.6kbps uploads, which can be enabled by modifying putting esp01.upload.speed=921600 in hardware\esp8266com\esp8266\boards.txt.

Wednesday, April 1, 2015

A 4mbps shiftOut for esp8266/Arduino

Since I finished writing the fastest possible bit-banged SPI for AVR, I wanted to see how fast the ESP8266 is at bit-banging SPI.  The NodeMCU eLua interpreter I initially tested out on my ESP-01 has little hope of high-performance since it is at best byte-code compiled.  For a simple way to develop C programs for the ESP8266, I decided to use ESP8266/Arduino, using Jeroen's installer for my existing Arduino 1.6.1 installation.  Starting with a basic shiftOut function that worked at around 640kbps, I was able to write an optimized version that is six times faster at almost 4mbps.

I modified the spi_byte AVR C code to use digitalWrite(), and call it twice in loop():
void shiftOut(byte dataPin, byte clkPin, byte data)
  byte i = 8;
      digitalWrite(clkPin, LOW);
      digitalWrite(dataPin, LOW);
      if(data & 0x80) digitalWrite(dataPin, HIGH);
      digitalWrite(clkPin, HIGH);
      data <<= 1;

void loop() {
  shiftOut(DATA, CLOCK, 'h'); 
  shiftOut(DATA, CLOCK, 'i'); 

Since I don't have a datasheet for the esp8266 that provides instruction timing, and am just starting to learn the lx106 assembler code, I used my oscilloscope to measure the timing of the data line:

The time to shift out 8 bits of data is around 12.5us, for a speed of 640kbps.  Looking at the signal in more detail I could see that the time between digitalWrite(dataPin, LOW) and digitalWrite(dataPin, HIGH) was 425ns.  Rather than setting the data pin low, then setting it high if the bit to shift out was a 1, I changed the code to do a single digitalWrite based on the bit being a 0 or a 1:
void shiftOut(byte dataPin, byte clkPin, byte data)
  byte i = 8;
      digitalWrite(clkPin, LOW);
      if(data & 0x80) digitalWrite(dataPin, HIGH);
      else digitalWrite(dataPin, LOW);
      digitalWrite(clkPin, HIGH);
      data <<= 1;

This change increased the speed slightly to 770kbps.  Suspecting the overhead of calling digitalWrite as being a large part of the performance limitations, I looked at the source for the digitalWrite function.  If I could get the compiler to inline the digitalWrite function, I figured it would provide a significant speedup.  From my previous investigation of the performance of digitalWrite, I knew gcc's link-time optimization could do this kind of global inlining.  I enabled lto by adding -flto to the compiler options in platform.txt.  Unfortunately, the xtensa-lx106-elf build of gcc 4.8.2 does not yet support lto.

After looking at the source for the digitalWrite function, I could see that I could replace the digitalWrite with a call to a esp8266 library function GPIO_REG_WRITE:
void shiftOutFast(byte data)
  byte i = 8;
      if(data & 0x80)
      data <<= 1;

This modified version was much faster - the oscilloscope screen shot at the beginning of this article shows the performance of shiftOutFast.  One bit time is 262.5ns, for a speed of 3.81mbps.  This would be quite adequate for driving a Nokia 5110 black and white LCD which has a maximum speed of 4mbps.


While 4mbps is fast enough for a low-resolution LCD display or some LEDs controlled by a shift register like the 74595, it's quite slow compared to the 80Mhz clock speed of the esp8266.  Each bit, at 262.5ns is taking 21 clock cycles.  I doubt the esp8266 supports modifying an I/O register in a single cyle like the AVR does, but it should be able to do it in two or three cycles.  While I don't have a proper datasheet for the esp8266, the Xtensa LX data book is a good start.  Combined with disassembling the compiled C, I should be able to further optimize the code, and maybe even figure out how to write the code in lx106 assembler.

Monday, March 30, 2015

ESP8266 SPI flash performance

Despite the popularity of the ESP8266, I have yet to see a detailed datasheet published.  Nava Whiteford, on his blog, has links to a summary datasheet and a Cadence tensilica core that the chip is based on.  None of this provides any details on how the memory controller pages in data from the SPI flash, nor the speed of the communications.  About all that is clear from the datasheet and chip markings is that it uses a quad-SPI serial flash chip.

I decided to find out the performance of the SPI flash, as well as get an idea of what the cache line fill size of the chip is.  By looking at the pin-out of the flash chip, I determined that pin 6 is the clock.  After some probing and playing with the settings on my scope, I captured the clock burst shown above.

Based on the 500ns horizontal scale, the clock burst lasts a little more than 2uS.  Zooming in shows that the clock is exactly 40Mhz, or half of the ESP8266 80Mhz clock and have of the maximum 80Mhz speed rating of the SPI flash.  Given the burst lasted a little more than 2uS, the total number of clock pulses is in the range of 85-90.  Accounting for the overhead of commands to enable quad SPI mode and address setup, it seems the burst corresponds to reading 32 bytes from the flash, and therefore the cache line size is likely 32 bytes.


The clock signal is clean, and with a rise + fall time of 11.1ns, could be increased to 90Mhz without significant distortion or attenuation.  With documentation on the registers to change the clock speed to 160Mhz, the ESP8266 can be run at double speed without overclocking the SPI flash.

Sunday, March 29, 2015

Fastest AVR software SPI in the West

Most AVR MCUs like the ATmega328p have a hardware I/O shift register, but only on a fixed set of pins.  Arduino's shiftOut function is horribly slow, so a number of faster implementations have been written.  I'll look at how fast they are, and explain an implementation in AVR assembler that's faster than any C implementation, and I'll even claim that it is the fastest software SPI for the AVR.

Adafruit spiWrite

void spiWrite(uint8_t data)
 uint8_t bit;
 for(bit = 0x80; bit; bit >>= 1) {
  SPIPORT &= ~clkpinmask;
  if(data & bit) SPIPORT |= mosipinmask;
  else SPIPORT &= ~mosipinmask;
  SPIPORT |= clkpinmask;

This code comes from the Adafruit Nokia 5110 LCD library.  It's a bit odd because it doesn't use a loop counting down from 8 to 0 for the bits to be shifted, it shifts a bit through the 8 bits in a byte.  While it is much faster than Arduino's shiftOut from using direct port manipulation instead of the slow digitalWrite, its far from an optimal implementation in C.  I compiled the code using avr-gcc 4.8 with -Os, and disassembled the code using avr-objdump -D:
00000000 <spiWrite>:
   0:   28 e0           ldi     r18, 0x08       ; 8
   2:   30 e0           ldi     r19, 0x00       ; 0
   4:   90 e8           ldi     r25, 0x80       ; 128
   6:   2d 98           cbi     0x05, 5 ; 5
   8:   49 2f           mov     r20, r25
   a:   48 23           and     r20, r24
   c:   11 f0           breq    .+4             ; 0x12
   e:   2c 9a           sbi     0x05, 4 ; 5
  10:   01 c0           rjmp    .+2             ; 0x14
  12:   2c 98           cbi     0x05, 4 ; 5
  14:   2d 9a           sbi     0x05, 5 ; 5
  16:   96 95           lsr     r25
  18:   21 50           subi    r18, 0x01       ; 1
  1a:   31 09           sbc     r19, r1
  1c:   21 15           cp      r18, r1
  1e:   31 05           cpc     r19, r1
  20:   91 f7           brne    .-28            ; 0x6
  22:   08 95           ret

The loop for each bit compiles to 14 instructions, and takes 17 clock cycles for a 0 and 18 for a 1.  Although most AVR instructions take a single cycle, the cbi and sbi instructions for clearing and setting a single bit take two cycles.  Branches, when taken, are also two-cycle instructions.

Generic spi_byte

void spi_byte(uint8_t byte){
    uint8_t i = 8;
        SPIPORT &= ~mosipinmask;
        if(byte & 0x80) SPIPORT |= mosipinmask;
        SPIPORT |= clkpinmask;  // clk hi
        byte <<= 1;
        SPIPORT &=~ clkpinmask; // clk lo


I've seen variants of this code used not just for AVR, but also for PIC MCUs.  It is faster than the Adafruit code in part because the loop counts down to zero, and as experienced coders know, on almost every CPU, counting up from zero to eight is slower than counting down to zero.  The disassembly shows the code to be 40% faster than the Adafruit code, taking 12 cycles for a 0 and 13 for a 1.
00000024 <spi_byte>:
  24:   98 e0           ldi     r25, 0x08       ; 8
  26:   2c 98           cbi     0x05, 4 ; 5
  28:   87 fd           sbrc    r24, 7
  2a:   2c 9a           sbi     0x05, 4 ; 5
  2c:   2d 9a           sbi     0x05, 5 ; 5
  2e:   88 0f           add     r24, r24
  30:   2d 98           cbi     0x05, 5 ; 5
  32:   91 50           subi    r25, 0x01       ; 1
  34:   c1 f7           brne    .-16            ; 0x26
  36:   08 95           ret

AVR optimized in C

Looking at the assembler code, half of the loop time is taken by the cbi and sbi two-cycle instructions.  The key to further speed optimizations is to write code that will compile to single-cyle out instructions instead.  The mosi and clk pins can be cleared by reading the port state before the loop, then writing the 8 bits of the port with mosi and clk cleared:
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
        SPIPORT = portbits;      // clk and data low

This also saves having to clear the clk pin at the end of the loop, for a total savings of 3 cycles.  With this technique, the time per bit can be reduced to 9 cycles.  By using the AVR PIN register, another cycle can be shaved off the loop.  The datasheet does not describe the PIN register in detail, stating little more than, "Writing a logic one to PINxn toggles the value of PORTxn".  What this means, for example, is that writing 0x81 to PINB will toggle the state of bit 0 and bit 7, leaving the rest of the bits unchanged.  Here's the final code:
void spi_byteFast(uint8_t byte){
    uint8_t i = 8;
    uint8_t portbits = (SPIPORT & ~(mosipinmask | clkpinmask) );
        SPIPORT = portbits;      // clk and data low
        if(byte & 0x80) SPIPIN = mosipinmask;
        SPIPIN = clkpinmask;     // toggle clk
        byte <<= 1;

The disassembly shows that although the code size has increased, the loop for transmitting a bit takes only 8 cycles.  More speed can be obtained at the cost of code size by having the compiler unroll the loop (enabled with -O3 in gcc).  This would reduce the time per bit to 5 cycles.
00000050 <spi_byteFast>:
  50:   25 b1           in      r18, 0x05       ; 5
  52:   2f 7c           andi    r18, 0xCF       ; 207
  54:   98 e0           ldi     r25, 0x08       ; 8
  56:   40 e1           ldi     r20, 0x10       ; 16
  58:   30 e2           ldi     r19, 0x20       ; 32
  5a:   25 b9           out     0x05, r18       ; 5
  5c:   87 fd           sbrc    r24, 7
  5e:   43 b9           out     0x03, r20       ; 3
  60:   33 b9           out     0x03, r19       ; 3
  62:   88 0f           add     r24, r24
  64:   91 50           subi    r25, 0x01       ; 1
  66:   c9 f7           brne    .-14            ; 0x5a
  68:   08 95           ret


I learned to code in assembler (6502) over thirty years ago, and started to learn C a few years after that.  When gcc was first released in 1987, it generated code that was much larger and slower than assembler.  Although it has improved significantly over the years, what surprises me is that it or any other C compiler still rarely matches hand-optimized assembler code.  You might think that there's nothing left to optimize out of  the 7 instructions that make up the the loop above, but by making use of the carry flag, I can eliminate the loop counter.  That saves a register and reduces the loop time to 7 cycles from 8:
    in r18, SPIPORT     ; save port state
    andi r18, ~(mosipinmask | clkpinmask)
    ldi r20, mosipinmask
    ldi r19, clkpinmask
    lsl r24
    ori r24, 0x01       ; 9th bit marks end of byte
    out SPIPORT, r18
    brcc zeroBit
    out SPIPORT-2, r20  ; PORT address -2 is PIN
    lsl r24
    out SPIPORT-2, r19  ; clk hi
    brne spiBit

When looking for fast software SPI code, the best I could find was 8 cycles per bit.  I read a couple posts on AVRfreaks claiming 7 cycles is possible, but no code was posted.  Unrolled, the above assembler code is still 5 cycles per bit, the same as the optimized C version.  So to back up my claim about the fastest code and hand-optimized assembler being better than the compiler, I need to reduce the timing to 4 cycles per bit.  I can do it using the AVR's T flag, with the bst and bld instructions that transfer a single bit between the T flag and a register.
    in r25, SPIPORT     ; save port state
    andi r25, ~clkpinmask
    ldi r19, clkpinmask
    bst r24, 7
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 6
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 5
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    bst r24, 4
    bld r25, MOSI
    out SPIPORT, r25    ; clk low + data
    out SPIPORT-2, r19  ; clk hi
    swap r24
    eor r1, r19         ; r1 is zero reg
    brne halfByte

The loop is half unrolled, doing two loops of 4 bits, with the function using a total of 23 instructions.  Fully unrolled the function would use 35 instructions/cycles (plus return), saving 4 cycles over the half unrolled version.


Including overhead, the spiFast assembler code is just under half the speed of  hardware SPI running at full speed (2 cycles per bit).  With the assistance of a hardware timer to generate the clock, and a port dedicated to just the mosi line, it's theoretically possible to output one bit every two cycles using a sequence of lsl and out instructions.  But for a fully software implementation that doesn't modify anything other than the mosi and clk bits, you won't find anything faster than 4 cycles per bit.  Copies of the code are available on my github repo: spi.S and spiWrite.c.