Friday, December 27, 2013

Writing AVR assembler code with the Arduino IDE

Although I have written a lot of code in high-level languages like C++, I enjoy writing assember the most.  For inserting assembler code into Arduino sketches, you can read a gcc inline assembly guide.  If you have some assembly code and want to use it, there is an easier way than converting it to inline assembly; you can make it a library.

The Arduino Serial class consumes a lot of resources, and even the tiny cores serial class (TinyDebugSerial) adds overhead to the half duplex software UART code it seems to be based on.  I decided to integrate my implementation of AVR305 with an Arduino sketch.

I started by making a directory called BasicSerial in the libraries directory.  Inside I created a BasicSerial.S file for my assember code.  In order for assembler code to be callable from C++, it is necessary to follow the avr-gcc register layout and calling convention, and mark the function name global.  The TxByte function takes a single char as an argument, which gcc will put in r24.  The Arduino core uses interrupts which would interfere with the software UART timing, so interrupts are disabled at the start of TxByte and re-enabled at the end.  Here's the code:
#include <avr/io.h>
; correct for avr/io.h 0x20 port offset for io instructions
#define UART_Port (PORTB-0x20)
#define UART_Tx 0

#define bitcnt r18
#define delayArg 19

#if F_CPU == 8000000L
  #warning Using 8Mhz CPU timing 
  #define TXDELAY 21
#elif F_CPU == 16000000L
  #warning Using 16Mhz CPU timing 
  #define TXDELAY 44 
#else
  #error unrecognized F_CPU value
#endif

.global TxByte
; transmit byte in r24 - 15 instructions
; calling code must set Tx line to idle state (high) or 1st byte may be lost
; i.e. PORTB |= (1<<UART_Tx)
TxByte:
cli
        sbi UART_Port-1, UART_Tx              ; set Tx line to output
        ldi bitcnt, 10                              ; 1 start + 8 bit + 1 stop
        com r24                                    ; invert and set carry
TxLoop:
        ; 10 cycle loop + delay
        brcc tx1
        cbi UART_Port, UART_Tx                  ; transmit a 0
tx1:
        brcs TxDone
        sbi UART_Port, UART_Tx                  ; transmit a 1
TxDone:
        ldi delayArg, TXDELAY
TxDelay:
; delay (3 cycle * delayArg) -1
        dec delayArg
        brne TxDelay
        lsr r24
        dec bitcnt
        brne TxLoop
reti ; return and enable interrupts

The last thing to do is to create a header file called BasicSerial.h:
extern "C" {
void TxByte(char);
}
If the extern "C" is left out, C++ name mangling will cause a mismatch.  To use the code in the sketch, just include BasicSerial.h, and call TxByte as if it were a C function.  Here's a sample sketch:
#include <BasicSerial.h>

// sketch to test Serial

// change LEDPIN based on your schematic
#define LEDPIN  PINB1

void setup(){
  DDRB |= (1<<LEDPIN);    // set LED pin to output mode
}

void serOut(const char* str)
{
   while (*str) TxByte (*str++);
}

void loop(){
  serOut("Turning on LED\n");
  PORTB |= (1<<LEDPIN);  // turn on LED
  delay(500);            // 0.5 second delay
  PORTB &= ~(1<<LEDPIN); // turn off LED
  delay(1000);           // 1 second delay
}

Download and run the sketch, open the Serial Monitor at 115,200bps, an you should see this:

I've posted BasicSerial.zip containing BasicSerial.S and BasicSerial.h.  Have fun!

New year's update:

I've modified the code so the delay timing is calculated by a macro in BasicSerial.h.  Just modify the line:
#define BAUD_RATE 115200
BasicSerialv2.zip

Wednesday, December 4, 2013

Trimming the fat from avr-gcc code

Although writing in AVR assembly makes it easy to write programs that fit in a small codespace, writing in C and using AVR Libc is more convenient.  This article outlines how to write C code that avr-gcc will build to a minimal size.  There are a number of other guides for writing small AVR code including AVR 4027, but none of them seem to address the overhead of avr-gcc's start-up library (gcrt1).

Many people seem to be still using avr-gcc 4.3.3 as it usually generates smaller code than 4.5.3 and 4.7.  I recently tried avr-gcc 4.8.2 (linux RPM cross-avr-gcc-4.8.2-3.2) , and for the program I use here, it generates even smaller code than 4.3.3.

The test program uses the ATtiny85's internal temperature sensor and flashes the temperature using a LED.  When compiled using -Os it results in a 274-byte program:
avr-size temperature
   text    data     bss     dec     hex filename
    274       0       0     274     112 temperature.bu
With avr-gcc 4.8.2 that drops to 240 bytes:
 avr-size temperature-4.8
   text    data     bss     dec     hex filename
    240       0       0     240      f0 temperature-4.8

The difference is primarily in the startup files linked to the code.  Disassembling the code with avr-objdump -d shows the reset vector contains a jump to a function called __ctors_end:
   0:   0e c0           rjmp  .+28      ; 0x1e <__ctors_end>
0000001e <__ctors_end>:
  1e:   11 24           eor     r1, r1
  20:   1f be           out     0x3f, r1        ; 63
  22:   cf e5           ldi     r28, 0x5F       ; 95
  24:   d2 e0           ldi     r29, 0x02       ; 2
  26:   de bf           out     0x3e, r29       ; 62
  28:   cd bf           out     0x3d, r28       ; 61

The function __ctors_end falls into __do_copy_data, which falls into __do_clear_bss before an rcall to main followed by an rjmp to _exit.  In total it's about 50 bytes of code before calling main.  With avr-gcc 4.8.2, the only code before main is __ctors_end, or 16 bytes of what would seem to be overhead.

Before trying to cut out __ctors_end, I wanted to make sure the code in __ctors_end is really overhead that can be safely removed.  The first two lines clear SREG.  Section 8.1 of the ATtinyX5 datasheet states, "During reset, all I/O Registers are set to their initial values, and the program starts execution from the Reset Vector."  The datasheet also indicates it's initial value is 0, so the first two lines can go.  The last 4 lines set the stack pointer (SPL and SPH) to RAMEND, which section 4.6 of the datasheet indicates is their initial value.  So it is safe to get rid of __ctors_end and jump straight to main from the reset vector, for a savings of 16 bytes.

Another 30 bytes of data is used for the interrupt vector table (and even more than 30 bytes on the ATmega series MCUs).  Section 9.1 of the datasheet states, "If the program never enables an interrupt source, the Interrupt Vectors are not used, and regular program code can be placed at these locations."  My temperature blinking program doesn't use interrupts so more space can be saved by getting rid of the interrupt table.

The way to tell avr-gcc not to link in the startup code is -nostartfiles.  If that is all you do with your C code, then avr-gcc will stick the first object file at address 0 (the reset vector).  To ensure the reset vector contains a jump to main I wrote a small assembly program (crt1.S).  I this custom startup code instead of gcrt1 included with the compiler libraries.  The code isn't long, so I'll include it inline:
.org 0x0000
__vectors:
rjmp main

Compile it (avr-gcc -c crt1.S), and link it with your C code.  For compiling temperature.c here's the command line I used, including a couple of extra flags helpful for generating small code:
avr-gcc -mmcu=attiny85 -Os -fno-inline-small-functions -mrelax -nostartfiles crt1.o    temperature.c   -o temperature

The resulting program is 190 bytes, saving 84 bytes vs. avr-gcc 4.3.3 or 50 bytes vs. 4.8.2:
avr-size temperature
   text    data     bss     dec     hex filename
    190       0       0     190      be temperature

Note that many virtual bootloaders for the ATtiny MCUs will cause problems with this technique as they tend to assume application code doesn't start until after the interrupt vector table.  MCUs with hardware bootloader support (i.e. the ATmega series) will not have problems.  Picoboot, the bootloader I am writing, will only assume the reset vector contains an rjmp to the start of application code and therefore will work with my custom crt1.o.