Friday, April 25, 2014

A better digitalWrite for Arduino

Although I do a lot of AVR programming directly using avr-gcc, I sometimes use the Arduino IDE.  Some of the code I write could be useful to other people, so I'll try to write it so it can easily work in the Arduino IDE.  Sometimes I want to use a library that is available for Arduino, but not as a standalone AVR library.

One of the common complaints about the Arduino framework is the poor performance of the digitalWrite function.  Jan Dolinay has done an analysis of digitalWrite which shows just how bad it is.  As Jan points out though, if the pin passed to digitalWrite is not declared constant, the big version of digitalWrite is used.  If the pin number is never changed, then shouldn't there be a way to get the short and fast code, even though the pin number is not declared const?  My first idea was maybe some template tricks could be used, but I couldn't come up with anything.

The breakthrough came when I read about link time optimization, and realized it is smart enough to recognize constant variables that are not declared const.  Another hurdle is that the Arduino IDE uses gcc 4.3, which doesn't support LTO.  That should be out of the way with an upcoming release of version 1.5 of the IDE that will include avr-gcc 4.8.1.  Knowing that, I started thinking about how the ideal digitalWrite function will work.

First off, it will need to be different than the technique used in the Wiring framework.  It uses __builtin_constant_p( pin ) to use the fast version when the pin is constant, or call the slow version when the pin is not.  This is based on the parameter being declared const, not if LTO figures out the parameter is const in reality.  I couldn't tell this from the gcc documentation, but the results of a test program I wrote were clear.  So my goal is to write a simple digitalWrite that is easy for the compiler to optimize.  Here's what I'm starting with:
void digitalWrite(uint8_t pin, uint8_t val)
{
    if (val & 0x1)
        (IOPORT |= (1<<pin));
    else
        (IOPORT &= ~(1<<pin));
}

It's a simplified version to start with - later I'll add macros for IOPORT similar to those used in the Wiring framework.  I already figured out LTO will optimize a port assigment to a single sbi, so it was no suprise when:
int clkPin = 5;

void main(void)

{
    digitalWrite(clkPin,0);
}
Compiled to a single sbi instruction.  While the pin used in a sketch is almost always known at compile time, the value might not be.  One example would be reading a sensor, then writing the bits to a pin in attached to an ASK/OOK transmitter.  To ensure the compiler won't know what the value is, I read it from an IO register.  Here's the code:
int main(void)
{
    uint8_t data = GPIOR0;
    digitalWrite(3,data);
    GPIOR1 = data;
}
Which compiles to:
00000030 <main>:
  30:   81 b3           in      r24, 0x11       ; 17
  32:   80 ff           sbrs    r24, 0
  34:   02 c0           rjmp    .+4             ; 0x3a <main+0xa>
  36:   c3 9a           sbi     0x18, 3 ; 24
  38:   01 c0           rjmp    .+2             ; 0x3c <main+0xc>
  3a:   c3 98           cbi     0x18, 3 ; 24
  3c:   82 bb           out     0x12, r24       ; 18
  3e:   08 95           ret

If the optimizer was a bit better, the digitalWrite code could be compiled to 4 instead of 5 instructions - sbrs, cbi, sbrc, sbi.  I played with different ways of doing the if condition, but the compiler always generated 5 instructions.  The code is still good (5 or 6 cycles), but if anyone knows how to get it compiled to 4 instructions, please leave a comment.

In my next post I'll add the macros for mapping the pin to a port and mask, and see what code is generated when pin number is determined at runtime.

Starting next week (May 2014), the beta 4.8.1 nightly build of the Arduino IDE should include LTO in the build.  Even without changes to the digitalWrite function, most code should be smaller and faster.
http://downloads.arduino.cc/arduino-avr-toolchain-nightly-gcc-4.8.1-windows.zip
http://downloads.arduino.cc/arduino-avr-toolchain-nightly-gcc-4.8.1-linux64.tgz

Saturday, April 19, 2014

gcc link time optimization can fix bad programming practices

I like to write efficient code, but realize that many people relatively new to C programming don't understand some of the simple ways to write fast and efficient code.  Even some popular sites don't demonstrate best practices in their code, such as the blinking LED Arduino tutorial.  I've written a generic AVR C version:
#include <avr/io.h>
#include <util/delay.h>

int LEDPIN = 1;
void main(void)
{
  while (1) {
    PINB |= (1<<LEDPIN);
    _delay_ms(500);
  }
}

I build it with avr-gcc 4.8.0:
avr-gcc -mmcu=attiny85 -DF_CPU=8000000 -Os -Wall -Wno-main    blink.c   -o blink

Then I check the size:
$ avr-size blink
  text    data     bss     dec     hex filename
   122       2       0     120      7e blink

To show what the problem is here's part of the disassembled code (avr-objdump -D blink):
0000002a <__do_copy_data>:
  2a:   10 e0           ldi     r17, 0x00       ; 0
  2c:   a0 e6           ldi     r26, 0x60       ; 96
  2e:   b0 e0           ldi     r27, 0x00       ; 0
  30:   e4 e7           ldi     r30, 0x74       ; 116
  32:   f0 e0           ldi     r31, 0x00       ; 0
  34:   02 c0           rjmp    .+4             ; 0x3a <__do_copy_data+0x10>
  36:   05 90           lpm     r0, Z+
  38:   0d 92           st      X+, r0
  3a:   a2 36           cpi     r26, 0x62       ; 98
  3c:   b1 07           cpc     r27, r17
  3e:   d9 f7           brne    .-10            ; 0x36 <__do_copy_data+0xc>
  40:   02 d0           rcall   .+4             ; 0x46 <main>
  42:   16 c0           rjmp    .+44            ; 0x70 <_exit>

00000046 <main>:
  46:   21 e0           ldi     r18, 0x01       ; 1
  48:   30 e0           ldi     r19, 0x00       ; 0
  4a:   46 b3           in      r20, 0x16       ; 22
  4c:   c9 01           movw    r24, r18
  4e:   00 90 60 00     lds     r0, 0x0060
  52:   02 c0           rjmp    .+4             ; 0x58 <main+0x12>
  54:   88 0f           add     r24, r24
  56:   99 1f           adc     r25, r25
  58:   0a 94           dec     r0
  5a:   e2 f7           brpl    .-8             ; 0x54 <main+0xe>
  5c:   48 2b           or      r20, r24
  5e:   46 bb           out     0x16, r20       ; 22
  60:   4f ef           ldi     r20, 0xFF       ; 255
  62:   84 e3           ldi     r24, 0x34       ; 52
  64:   9c e0           ldi     r25, 0x0C       ; 12
  66:   41 50           subi    r20, 0x01       ; 1
  68:   80 40           sbci    r24, 0x00       ; 0
  6a:   90 40           sbci    r25, 0x00       ; 0
  6c:   e1 f7           brne    .-8             ; 0x66 <main+0x20>
  6e:   00 c0           rjmp    .+0             ; 0x70 <main+0x2a>
  70:   00 00           nop
  72:   eb cf           rjmp    .-42            ; 0x4a <main+0x4>

All the code in __do_copy_data is to copy any global variables (in this case LEDPIN) from flash to RAM.  The code from address 46 to 5e is to set the bit in PINB, based on the value of LEDPIN, which is stored at address 0x0060 in RAM.  This could be done with a single sbi (set bit) instruction, but the compiler doesn't do that, because the code from blink.c could be linked with other code that changes the global variable LEDPIN.  This is even when the output is a linked elf file like blink, because other object files can still be added to it.

By making the LEDPIN variable const, the compiler should know the bit to set will always be bit 1.  As expected, with that change, it generates a single sbi instruction, instead of the 12 instructions above:
  46:   b1 9a           sbi     0x16, 1 ; 22

However it still copies the value of LEDPIN to RAM with the __do_copy_data code.  This is because some other code that gets linked in may use the global variable LEDPIN.  By making the variable static, the compiler will know it is not used outside the current file.  So after changing the code as follows:
static const int LEDPIN = 1;
The code is much smaller:
   text    data     bss     dec     hex filename
     74       0       0      74      4a blink-static-const

But we can't easily make every new C programmer a really good programmer, so that's why having a compiler that can figure out that for the blink application LEDPIN is only defined once, is only used in the main function and nowhere else.  That's one of the things that gcc's link time optimization (LTO) is can do.  After adding the -flto compiler flag, the original version now compiles to 74 bytes - the same as when defining LEDPIN as static const:
   text    data     bss     dec     hex filename
     74       0       0      74      4a blink-flto

A code-generation bug was introduced in avr-gcc 4.8.0, so I suggest using 4.7.3 or waiting for 4.8.3 which will contain a fix.

Friday, April 11, 2014

Tuning 433Mhz ASK modules

I purchased a transmit/receive pair of modules from DX, but despite other people that have been able to get 20m range with the same type of modules, I was having no luck beyond 3-4m.

After doing some more research, I decided to take another shot at getting more distance out of the modules. The receiver supposedly has a wide bandwidth, and can be tuned with an adjusting screw (a variable inductor). I soldered some wires to a 3.5mm audio jack to listen to the output of the receiver, and wrote a small program to generate a 3.3kHz waveform:
Code:
/* output 3.3kHz tone on PORTB */
#define F_CPU 8000000L

#include <avr/io.h>
#include <util/delay.h>

#define TONEPIN 0     /*connect to transmitter data pin */
void main()
{
  DDRB = (1<<TONEPIN);    /* output mode */
  while (1) {
    PINB = (1<<TONEPIN);
    _delay_us(150);
  }
}
I plugged the audio jack into my computer line-in and configured it to play the audio from line-in.  A set of external speakers with a 3.5mm audio jack would work just as well. I adjusted the tuning screw until I could clearly hear the 3.3kHz tone. I then took the transmitter (on a small breadboard with battery power) around the house, and outside. With the transmitter on top of my car in the driveway some 10-12m away, I could still clearly hear the tone (with a bit of static).

At short range these modules worked doing 9600bps, it seems they don't have much of a low-pass filter. So I plan to add one with a cutoff around 2kHz, then do some tests with 1200bps data.

The 315Mhz version of these modules appears to have a very similar circuit with a tuning coil, so this technique should work them as well.

Saturday, April 5, 2014

ATtiny85 as a 433Mhz transmitter - fail

I bought a 433mhz transmitter and receiver with the intention of using them for battery-powered wireless sensor nodes.  In small quantities they can be purchased for <$1/pair, they can easily be used with libraries like VirtualWire, or they can even be used for serial UART communications.

My intention is to have multiple intermittent transmitters per receiver, so if I purchased 10 pair, I'd have 9 unused receiver modules.  I thought I might be able to get my ATtiny85's to transmit a 433Mhz signal.  I had read about spritesmod's FM transmitter hack, however 433Mhz is well beyond the 85Mhz maximum frequency spec for the t85's PLL.  And the maximum square wave frequency output is half the PLL frequency - so even if I could overclock the PLL to 100Mhz, I could at best output a 50Mhz square wave.

I considered using multiple clock doubler circuits, however after some thinking I remembered it's still possible to generate a 433Mhz signal as a harmonic of a lower frequency square wave.  After trying some differing multiples, I worked out that if I generated a 39.45Mhz square wave, the 11th harmonic would be very close to 433.92Mhz.  That would require the PLL to run at 78.9Mhz, and the internal RC oscillator to run at 78.9/8 =9.86Mhz instead of the normal 8Mhz.

The RC oscillator frequency on the tiny85 is tuned by changing the value of the OSCCAL register.  Figure 22-42 of the datasheet shows it can be tuned to over 14Mhz:
I wrote a small C program which used timer1 to output a square wave, and used a logic analyzer to measure the frequency.  After a few tries, I found that adding 23 to the default OSCCAL value tuned the RC oscillator to approximately 9.86Mhz.  Once that was done, I modified the code to enable the PLL, and output a square wave of 1/2 the PLL frequency and an ASK duty cycle of 30ms on and 50ms off.  For an antenna, I cut a 17cm long piece of 24AWG copper wire from some cat5 cable and plugged it into my breadboard connected to pin6 (OC1A) of the t85.  I hooked up my logic analyzer to the 433Mhz receiver's rx pin, and here's what I got:
The signal is recognizable, but the transmitter and receiver were only 15cm apart, and the signal wasn't completely clean.  Once I moved the t85 more than 50cm away, I could see only noise.  The 433Mhz transmitter that came with the receiver didn't work at the 10m range that some people have been able to get, but it did work well 2-3m from the receiver.

So why didn't it work?  I'm guessing that the ATtiny85 output drivers do not generate a sharp enough square wave to create strong harmonics at 433Mhz.  An EDN article I found states, "most digital output waveforms follow a nearly Gaussian profile", meaning the transitions do not have significant high-frequency components.

I think there is still potential in the idea.  I may purchase a 315Mhz receiver, or see if I can re-tune the 433Mhz receiver to 315Mhz (there is a screw on the board that looks like a variable capacitor).  If I output a 35Mhz square wave, the 9th harmonic at 315Mhz will be much stronger than the 433Mhz signal.  Running the PLL at 90Mhz would generate a 45Mhz square wave, and the 7th harmonic should be stronger than the 9th harmonic at 35Mhz.   If that still doesn't work well enough, I might try fast swtiching mosfets or transistors that are rated >500Mhz, in order to generate a sharper output waveform.  Comment if you have any other ideas.