How to precisely measure microseconds with STM32

I received this apparently simply question from a reader of this blog: how can I delay a fistful of microseconds in STM32? That is, how to measure microseconds precisely in STM32?

The answer is: there are several ways to do this, but some methods are more accurate and other ones are more versatile among different MCUs and clock configuration.

Let's consider one member of the STM32F4 family: STM32F401RE, the MCU that equips the STM32Nucleo-F401RE board. This micro is able to run up to 84Mhz using internal RC clock. This means that ever 1µs, the clock cycles 84 times. So, we need a way to count 84 clock cycles to assert that 1µs is elapsed (I'm assuming that you can tolerate the internal RC clock 1% accuracy).

Sometime is common to find code like this:

void delay1US() {
    #define CLOCK_CYCLES_PER_INSTRUCTION    X
    #define CLOCK_FREQ                      Y  //IN MHZ (e.g. 16 for 16 MHZ)

    volatile int cycleCount = CLOCK_FREQ / CLOCK_CYCLE_PER_INSTRUCTION;

    while (cycleCount--);        
    // 1uS is elapsed :-)
    // Sure?
    // :-/
}

But how to establish how many clock cycles are required to compute one step of the while(cycleCount--) instruction? Unfortunately, it's not simple to give an answer. Let's assume that cycleCount is equal to 1. Doing some tests (I'll explain later how I've done them), with compiler optimizations disabled (option -O0 to GCC), we can see that in this case the whole C instruction requires 24 cycles to execute. How is it possible that? You have to figure out that our C statement is unrolled in several assembly instructions, as we can see if we disassemble the firmware binary file:

...
 while(counter--);
 800183e:       f89d 3003       ldrb.w  r3, [sp, #3]
 8001842:       b2db            uxtb    r3, r3
 8001844:       1e5a            subs    r2, r3, #1
 8001846:       b2d2            uxtb    r2, r2
 8001848:       f88d 2003       strb.w  r2, [sp, #3]
 800184c:       2b00            cmp     r3, #0
 800184e:       d1f6            bne.n   800183e <main+0x3e>

Moreover, another source of latency is related to the fetch from internal MCU flash. So that instruction has a "basic cost" of 24 cycles. How many cycles are required if cycleCount is equal to 2? In this case the MCU requires 33 cycles, that is 9 additional cycles. This means that if we want to spin for 84 cycles cycleCount has to be equal to (84-24)/9, which is about 7.  So, we can write our delay function in a more general way:

void delayUS(uint32_t us) {
	volatile uint32_t counter = 7*us;
	while(counter--);
}

Testing this function with this code:

while(1) {
	delayUS(1);
	GPIOA->ODR = 0x0;
	delayUS(1);
	GPIOA->ODR = 0x20;
}

we can check using a decent oscilloscope, connected to a GPIO configured as GPIO_SPEED_HIGHthat this is what we expect:

[box type="warning" align="" class="" width=""]Hey dude, why are you using that code to drive the GPIO?

I can't use HAL here, because we want to reduce CPU cycles to the minimum, so we need to modify GPIO peripheral register directly! Don't forget that even a simple procedure call, without parameter passing, costs CPU cycles due to branching and invalidation of cache pipeline.[/box]

Is this way to delay 1µs always consistent? The answer is NO. First of all, it works well only when this specific MCU (STM32F401RE) works at full speed (84Mhz). If you decide to use a different clock speed, you need to rearrange it doing tests. Second, it is subject to compiler optimizations.

Let's now enable GCC optimization for "size" (-Os). What results do we obtain? In this case we have that the delayUS() function costs only 72 CPU cycles, that is ~850ns. The scope confirms this:

 

And what happens if we enable the maximum optimization for speed (-O3)? In this case we have only 64 CPU cycles, that is our delayUS() lasts only ~750ns, as confirmed by our scope:

However, this issue can be addressed using specific GCC pragma directive:

#pragma GCC push_options
#pragma GCC optimize ("O0")
void delayUS(uint32_t us) {
	volatile uint32_t counter = 7*us;
	while(counter--);
}
#pragma GCC pop_options

Having said this, the fact remains that if we want use a lower CPU frequency or we want to port our code to a different MCU we need to redo tests again.

[box type="warning" align="" class="" width=""]However, take in account that the lower the CPU frequency is the more difficult is to delay for 1µs precisely, because the number of cycles are fixed for a given instruction, but there is less amount of cycles in the same unit of time.[/box]

So, how can we obtain a precise 1µs delay without doing tests if we change hardware setup? The answer is: we need a hardware timer. And we have several options.

The first one comes from the previous tests. How have I measured CPU cycles? Cortex-M processors can have an optional debug unit that provides watchpoints, data tracing, and system profiling for the processor. One register of this unit is CYCCNT, which counts the number of cycles performed by CPU. So, we can use this special unit available in STM32 to count the number of cycles performed by the MCU during instruction execution.

uint32_t cycles = 0;

/* DWT struct is defined inside the core_cm4.h file */
DWT->CTRL |= 1 ; // enable the counter
DWT->CYCCNT = 0; // reset the counter
delayUS(1);
cycles = DWT->CYCCNT;
cycles--; /* We subtract the cycle used to transfer 
             CYCCNT content to cycles variable */

Using DWT we can build a more generic delayUS() routine in this way:

#pragma GCC push_options
#pragma GCC optimize ("O3")
void delayUS_DWT(uint32_t us) {
	volatile uint32_t cycles = (SystemCoreClock/1000000L)*us;
	volatile uint32_t start = DWT->CYCCNT;
	do  {
	} while(DWT->CYCCNT - start < cycles);
}
#pragma GCC pop_options

How much precise this function is? If you are interested to the best resolution at 1µs, this function isn't the best, as shown by the scope.

The best performance is achieved when the higher compiler optimization level is set. As you can see, for a wanted delay of 1µs the function gives about 1.22µs delay (22% slower). However, if we want to spin for 10µs, we obtain a real delay of 10.5µs (5% slower), which is more close to what we want.

Starting from 100µs delay the error is completely negligible.

Why this function is not so precise? To understand why this function is less precise from the other one, you have to figure out that we are using a series of instructions to check how many cycles are expired since the function is started (the while condition). These instructions cost CPU cycles both to update the internal CPU registers with the content of CYCCNT and to do comparison and branching. However, the advantage of this function is that it automatically detects CPU speed, and it works out of the box especially if we are working on faster processors.

Other solutions could be achieved using hardware timers, like TIMx and SYSTICK timers. However, I've obtained performances similar to delayUS_DWT() function.

If you want full control among compiler optimizations, the best 1µs delay can be reached using this macro fully written in assembler:

#define delayUS_ASM(us) do {\
	asm volatile (	"MOV R0,%[loops]\n\t"\
			"1: \n\t"\
			"SUB R0, #1\n\t"\
			"CMP R0, #0\n\t"\
			"BNE 1b \n\t" : : [loops] "r" (16*us) : "memory"\
		      );\
} while(0)

This is the most optimized way to write the while(counter--) function. Doing tests with the scope, I found that 1µs delay can be obtained when the MCU execute this loop 16 times at 84MHZ. However, this macro has to be rearranged if you processor speed is lower, and keep in mind that being a macro, it is "expanded" every time you use it, causing the increase of firmware size.

These are the best solution to obtain 1µs delay with the STM32 platform. If you think that other solutions can be better than these ones, feel free to comment below this post.


Related posts

Upgrading Grove-UART Wifi V2 to latest ESP-AT firmware

Elegoo Car Kit V4

Correct way to perform re-annotation of designators in Altium

16 comments

Erix September 5, 2015 - 9:39 am
Doesn't the actual time also depend on whether the instructions are cached or not when they are called?  
Carmine Noviello September 6, 2015 - 7:08 am
Yes, it depends also on pipeline cache. I forget to mention that if you disable cache in stm32fxxx_hal_conf.h file:
#define  PREFETCH_ENABLE              0
#define  INSTRUCTION_CACHE_ENABLE     0
#define  DATA_CACHE_ENABLE            0
it affects the duration of delay routines.
tilz0R September 5, 2015 - 9:56 am
Great post, however, if you substract fixed amount of cycles when calculating number of cycles you need with DWT, you will get most accurate delay. You have to substract cycles for function call and for calculating SystemCoreClock to MHz. Better option could be using static inline function where function will be placed directly inline without calling it directly.
Jim November 5, 2015 - 4:52 pm
Thanks Carmine, this is a very useful and informative post! Jim
Kyu Chong November 18, 2015 - 12:13 am
I was able to get 440ns pulses using SYSTICK based delay function with 20ns variance using 168Mz STM32F4-Disc board. I am using HAL_GPIO__WritePin with GCC. I am in a process of testing STM32F7 to improve accuracy as well as smaller size pulses. Using your ASM increases variance for logic 0. I was able to create any size pulses with 20s variance with minimum size of 440ns. My goal is to get it down to 10 ns accuracy with .5ns variance. Any idea on how I might be able to achieve my goal? Thanks in advance.
Carmine Noviello November 24, 2015 - 11:47 am
100Mhz switching signal is really hard to achieve with a STM32, probably even with the STM32F7. You need to manipulate directly the GPIO ODR registry, without using the HAL that adds a lot of overhead. Moreover, you need to enable also the compensation cell and to define the GPIO speed (that is, the GPIO duty cycle) to the fastest value.
Jose Luu January 10, 2016 - 1:19 am
You can checkout the high resolution timer on the F334 where resolution can go down to 217ps.
tiwag February 23, 2016 - 10:22 pm
if you want to generate exactly timed pulses why not using a timer? also the timers interrupt capabilty can be used to derive an exact delay time if necessary. i havent done this with a STM32 till now, but often with Hitachi/Renasas SH8 MCU's in the past.
Carmine Noviello February 24, 2016 - 6:56 am
Handling delays of some microseconds using interrupts would flood the MCU, and it's not that good for precise delays. Even if Cortex-M has deterministic interrupt latency, this can cost up to 16 clock cycles in some Cortex-M (formerly M0+ processors). For a STM32 MCUs running at "low speeds" this is a non-negligible overhead (moreover you have to add the cost of clearing the UIF flag, which costs other 3-5 cycles). This means a delay shift of about 500ns, which is 50% more.
bq1 bringup: take 2 – devlog May 15, 2016 - 8:00 am
[…] No idea why it is not possible to do the same thing in C. I've looked into this, but answer yet. - What this can be useful for later on, is a good microsecond sleep function. There's a nice post on that there: http://www.carminenoviello.com/2015/09/04/precisely-measure-microseconds-stm32/ […]
Jason July 1, 2016 - 9:29 am
Hi Carmine, your articles are always very intersting but I have a question on board timers. I am using TIM3 to generate an interrupt every 1 ms but my interrupt service routine lasts, more or less, 800 us; during this time the timer continues to count or it is freezed? I hope I was cleared.
Carmine Noviello July 1, 2016 - 11:10 am
During that time the timer keeps counting, and when it expires the UIF flag is set: this will cause that the IRQ is set at the NVIC, if enabled. So, if your code lasts more than 1us, you are likely to lose the IRQ sometime. Otherwise it should work.
Skalibur February 6, 2017 - 10:14 pm
Hello. Thanks for Article. I could not run the code with CYCCNT :-( but with this code: /************************************/ #pragma GCC push_options #pragma GCC optimize ("O0") void delayUS(uint32_t us) { volatile uint32_t counter = 3*us; while(counter--); } #pragma GCC pop_options /*********************************/ On STM32F103 at 72MHZ, reach 1.02 Us and 980.0 kHz I hope you can explain more in depth CYCCNT. Thx
Carmine Noviello February 8, 2017 - 8:18 am
Cortex- M3/4/7 processors can have an optional debug unit, named Data Watchpoint and Tracing (DWT), that provides watchpoints, data tracing, and system profiling for the processor. One register of this unit is CYCCNT, which counts the number of cycles performed by CPU. So, we can use this special unit available to count the number of cycles performed by the MCU during instruction execution. Which issue do you have when using CYCCNT?
Teddy July 20, 2017 - 1:20 am
I always thank you for your lecture here. I bought the 'Mastering STM32' as well. As the beginner of STM32 this blog and your book helps me well. As always, I got an issue to measure 'micro second' and search this blog and found this post. It helps well too!. But I still have one issue. According to reference manual, the 'DWT' is generally used for debugging. I am afraid that using of this 'DWT' in my code may affect bebugging. (ex. any Debugger using DWT doesn't work well). Can you please let me know your opinion on this? (Because I am not an experienced SW engineer, expecially STM32 series)
Loc July 20, 2017 - 8:36 am
Ơ#define delayUS_ASM(us) do {\ asm volatile ( "MOV R0,%[loops]\n\t"\ "1: \n\t"\ "SUB R0, #1\n\t"\ "CMP R0, #0\n\t"\ "BNE 1b \n\t" : : [loops] "r" (16*us) : "memory"\ );\ } while(0) how can I convert this code for keil C ARM.? thanks for your andvance.
Add Comment