You are on page 1of 97

STM32 F4 series

High-performance Cortex-M4 MCU

Presentation highlights
The STM32 F4 series brings to the market the worlds highest performance Cortex Cortex-M -M microcontrollers 168 MHz FCPU/210 DMIPS 363 Coremark score

The STM32 F4 series extends the STM32 portfolio 250+ compatible p devices already y in p production, , including g the F1 series, F2 series and ultra-low-power L1 series

The STM32 F4 series reinforces STs current leadership in Cortex-M microcontrollers, with 45% world market share by units in (2010 or cumulated 2007 to Q1/11) according to ARM reporting

STM32 F4 series High-performance digital signal controller


Single precision Ease of use Better code efficiency Faster time to market Eliminate scaling and saturation Easier support for meta-language tools (M tl b ) (Matlab)

FPU

What is Cortex-M4?
MCU
Ease of use of C programming Interrupt handling Ultra-low Ultra low power

DSP
Cortex-M4
Harvard architecture Single-cycle MAC Barrel shifter

STM32 F4 Series highlights 1/4


ST is introducing STM32 products based on Cortex M4 core. Over 30 new part numbers pin pin-to-pin to pin and software compatible with existing STM32 F2 Series. Th The new DSP and d FPU i instructions t ti combined bi d t to 168Mhz 168Mh performance open the door to a new level of Digital Signal Controller applications and faster development time. STM32 Releasing your creativity

STM32 F4 Series highlights 2/4


Advanced technology and process from ST:
Memory M accelerator: l t ART A Accelerator l t Multi AHB Bus Matrix 90nm process

Outstanding results:
210DMIPS at 168Mhz. Execution from Flash equivalent to 0-wait state performance up to 168Mhz thanks to ST ART Accelerator

STM32 F4 Series highlights 3/4


More Memory
U Up to t 1MB Fl Flash, h 192kB SRAM: 128kB on bus matrix + 64kB on data bus dedicated g to the CPU usage

Advanced peripherals shared with STM32 F2 Series


USB OTG High speed 480Mbit/s Ethernet MAC 10/100 with IEEE1588 PWM High speed timers: Now 168Mhz max frequency! Crypo/hash processor, 32-bit random number generator (RNG) y and 32-bit RTC with calendar: Now with sub 1 second accuracy, <1uA typ!
6

STM32 F4 Series highlights 4/4


Further improvements
Low voltage: 1 1.8V 8V to 3 3.6V 6V VDD , down to 1.7 1 7*V V on most packages Full duplex I2S peripherals 12-bit ADC: 0.41s conversion/2.4Msps (7.2Msps in interleaved mode) g speed p USART up p to 10.5Mbits/s High High speed SPI up to 37.5Mbits/s Camera interface up to 54MBytes/s

*external reset circuitry required to support 1.7V


7

STM32 F4 series applications served


Points of sale/inventory management Industrial automation and solar panels Building

Secu Security/fire/HVAC ty/ e/ C

Test and measurement Transportation Consumer Medical Communication

STM32 F4 block diagram


Feature highlight 168 MHz Cortex-M4 CPU Floating point unit (FPU) ART Accelerator TM Multi-level AHB bus matrix

1-Mbyte Flash, 192-Kbyte SRAM 1.7 to 3.6 V supply RTC: <1 A typ, sub second accuracy 2x full duplex IS IS 3x 12-bit ADC 0.41 s/2.4 MSPS 168 MHz timers

STM32 F4 portfolio

STM32 product series


4 product series

STM32 leading Cortex-M portfolio

The cheapest and quickest way to discover the STM32F4


Everything included for a quick start with the STM32F4 serie Order code: STM32F4DISCOVERY Available in ST stock from October 2011

In circuit ST-LINK/V2 debugger / programmer included to debug Discovery kit applications or other target board applications. Dedicated web site www.st.com/stm32F4discovery Large number of examples ready to run Schematics S h ti Forums and more

13

STM32F4 Discovery Board


On-board ST-LINK/V2 with selection mode switch to use the kit as stand-alone ST-LINK with SWD connector Designed to be powered by USB or by external power 5V or 3.3V supply Ca supply Can supp y target a ge app application ca o with 5 Volts o so or 3 Volts Two User LEDs (Green and Blue) Audio codec Mems Micro (MP45DT02) One user Push Button Extension header for all QFP64 I/Os for quick connection to prototyping board or easy probing Audio Jack STM32F407VGT6 ST-LINK/V2

SWD connector

User button

14

September : STM32F4 eval board


Eval board : STM3240G-EVAL : 21st of September
For any needs before contact your local ST support

Sample : 21st of September

LQFP100 LQFP144 LQFP176 BGA176 LQFP64

STM32F407VGT6 STM32F457ZGT6 STM32F457IGT6 STM32F457IGH6 STM32F455RGT6

Full p production November 2011


Advanced Information
15

STM32 F4 key features

STM32 F4 Key features

Real time performance

STM32 F4 series: Cortex M4-based


Single precision Ease of use Better code efficiency Faster time to market Eliminate scaling and saturation Easier support for meta-language tools

FPU

What is Cortex-M4?
MCU
Ease of use of C programming Interrupt handling Ultra-low Ultra low power

DSP
Cortex-M4
Harvard architecture Single-cycle MAC Barrel shifter

STM32F4 versus competitors

STM32 F4: Worlds #1 in performance


Dhrystone

It takes ART to be #1 in performance: It is a combination of core core, embedded Flash design, process, acceleration techniques.

STs ART Accelerator


The adaptive real-time memory accelerator unleashes the Cortex-M4 cores maximum processing performance equivalent to 0-wait state execution Fl h up to Flash t 168 MHz MH

rb4

Real-time performance
32-bit multi-AHB bus matrix
Decompressed MP3 decoder code Access to the MP3 DMA transfer to User interface: Compressed audio audio stream to execution bystage core data for audio output DMA transfers of stream (MP3) 112kByte SRAM decompression (I2S) the graphical icons 16kByte SRAM from Flash to block display

Slide 23 rb4 Use the updated chart provided by Olivier Ferrand. Also, we will use the example of datatransfers provided by Olivier Ferrand. Annimations as in the F4 video would be great.
renaud bouzereau, 8/30/2011

Outstanding power efficiency

Outstanding power efficiency


230 A/MHz, 38.6 mA at 168 MHz executing g Coremark benchmark from Flash memory (with peripherals off), made possible with: STs 90 nm process allowing the CPU core to run at only 1.2 V

Typical yp values in VBAT mode

ART Accelerator reducing the number of accesses to Flash Voltage V lt scaling li t to optimize ti i performance/power f / consumption ti

VDD min down to 1.7 V p modes with backup p SRAM and RTC support pp Low-power

Low power and real life applications


Low power in real life applications is not just Low-power mode Need to consider the % of time spend in LP mode and in Run mode
A/MHz
% Run Mode % Low power mode

Run Low power A/MHz

Run Low power time

Average consumption time

Superior and innovative peripherals

Superior and innovative peripherals


Audio architecture HW crypto/hash PWMs @ 168 MHz Ethernet with 2 USB coprocessor andOTG IEEE 1588v2 2 full2 duplex I IS S and ADC 2.4 4 MSPS <1 A RTC

Digital Camera Interface


Digital Camera interface, up to 54 Mbyte/s The Camera interface is a universal 8 to 14 14-bit bit parallel interface (no industry standing name). It supports the following
data formats :
- 8-bit p progressive g video monochrome or raw bayer y -YCbCr 4:2:2 progressive video -RGB 565 progressive video compressed data (like JPEG) It also supports the following features: -continuous continuous mode or snapshot (a single frame) mode -automatically image cropping -8-word FIFO. -AHB slave interface with capability to control the GP-DMA (request/acknowledge) using1 i 1 channel. h l -Various Interrupts Flags such as End Of Line, End of Frame, Vertical Synchronization, Overun or Errors Flags

29

Crypto/Hash Processor and RNG


Encryption/Decryption
DES/TDES (data encryption standard/triple data encryption standard): ECB (electronic codebook) and CBC (cipher block chaining) chaining algorithms, 64-,128- or 192-bit key AES (advanced encryption standard): ECB, CBC and CTR (counter mode) chaining algorithms, 128, 192 or 256-bit key

Universal hash
SHA SHA-1 1 (secure hash algorithm) MD5

True random number generator (RNG) that delivers 32bit random d numbers b produced d db by an i integrated t t d analog l circuit.

30

Crypto/Hash Processor performance


AES DES
64* bits
* 8 parity bits

TDES
192***, 128** or 64* bits
* 8 parity bits : Keying option 1 ** 16 parity bits: Keying option 2 ***24 parity bits: Keying option 3

Key sizes

128, 192 or 256 bits

Block sizes Time to process one block Type yp Structure First published

128 bits 14 HCLK cycle for key =


128bits

64 bits

64 bits

16 HCLK cycle for key =


192bits

16 HCLK cycles

48 HCLK cycles

18 HCLK cycle for key =


256bits

block cipher Substitutionpermutation network 1998

block cipher Feistel network 1977 (standardized on January 1979)

block cipher Feistel network

1998 (ANS X9.52)

31

Crypto/Hash Processor
DMA request for incoming data transfer

AES TDES
Data swapping

ECB

CBC

CTR

DMA request for outgoing data transfer

Key: 128-, 128 192192 and 256 256-bit bit

ECB

CBC

Data swapping

Key: y 64-, , 128- and 192-bit

DES
Key: 64 64-bit bit

ECB

CBC

CRYPTO Processor
IFEM
INMIS

INRIS

IFNF

BUSY

OFFU

OFNE

OUTRIS
OUTMIS

Flags

INIM

OUTIM

O Output FIFO
CRYPTO Global Gl b l interrupt i t t (NVIC)

Input FIFO

32

32

USB OTG
Up to 2 USB OTG peripherals (on STM32F4x7 devices) compliant with the USB 2 devices), 2.0 0 specification and with the OTG 1.0 specification.
One is Full Speed (12 Mb/s) only only, One is Full Speed or High Speed (480 Mb/s), Both embeds a FS PHY

33

USB OTG HS
The USB FS/HS peripheral supports both fullspeed and high-speed high speed operations. It features a UTMI low-pin interface (ULPI) to connect an external HS PHY device. The OTG PHY is connected to the microcontroller ULPI port through 12 signals. Features a dedicated RAM of 4 Kbytes with advanced FIFO control Dedicated D di t d DMA controller t ll

34

Audio architecture
Two PLLs are available for more flexibilty of the system:
The main PLL (PLL) clocked by HSI or HSE used to generate the System clock (up to 168MHz), and 48 MHz clock for USB OTG FS, SDIO and RNG. A dedicated PLL (PLLI2S) used to generate an accurate clock to achieve high-quality audio performance on the I2S interface.

USB OTG peripherals facilitate audio synchronization:


each time a SOF event occurs a pulse can be output on a pin dynamic trimming capability of SOF framing period in host mode d

2xI2S Full duplex peripherals with:


Less than 0.5% error on sampling frequency Clock input in case an external high quality audio PLL is needed
35

More peripherals improvements


Flexible Static Memory Interface for external LCD, SRAM, PSRAM, NOR and NAND Flash, CompactFlash running at up to 60MHz to expand memory space or support an external display 3 SPIs running at up to 37 37.5 5 Mbit/s, Mbit/s 6 USARTs running at up to 10.5Mbit/s Analog: g
ADCs and DACs work down to VDD min 3x 12-bit ADC, 2.4 MSPS, up to 7.2MSPS in interleaved mode

F Fast t GPIO (84 MHz MH toggling t li speed) d) RTC: sub second accuracy, <1uA typ

36

Maximum integration
The 1-Mbyte Flash and 192-Kbyte SRAM memories available in the product accommodate advanced software stacks and user data, with no need for external memories 4-Kbyte 4 Kbyte SRAM battery back back-up: up: EEPROM used to save application state, calibration data In addition, 528 bytes of OTP memory make it possible to store critical user data such as p Ethernet MAC addresses or cryptographic keys

Extensive tools and SW

Extensive tools and SW


Evaluation board for full product feature evaluation Hardware evaluation platform for all interfaces Possible connection to all I/Os and all peripherals Discovery kit for cost-effective evaluation and prototyping p yp g Starter kits from 3rd parties available soon Large choice of development IDE solutions from the STM32 and ARM ecosystem
STM32F4DISCOVERY

STM3240G-EVAL $349

$14.90

Software Libraries
ST software libraries free at

www.st.com/mcu
C source code for easy implementation of all STM32 peripherals in any application
Standard library source code for implementation of all standard
peripherals. Code implemented in demos for STM32 evaluation board

Motor Control library Sensorless Vector Control for 3-phase


brushless motors

DSP library PID, PID IIR, IIR FFT, FFT FIR (free with license agreement) Audio library MP3/WMA decoder, volume control, equalizer
(free with license agreement) agreement).

ST engineered, g , tested, , documented and free

40

Key messages to remember


STM32 F4 series
Worlds World s highest performance Extends the STM32 portfolio to over 250+ compatible devices One-in-two Cortex-M MCUs shipped worldwide is an STM32
Discovery kits available now

STM32F4DISCOVERY

Thank you

www.st.com/stm32f4

STM32F roadmap

STM32F series short term roadmap


STM32F4 series
Cortex-M4 @ 168 MHz

STM32F2 series i STM32F1 series STM32F0 series


Cortex-M0 Cortex-M3 Cortex M3 @ 72 MHz Cortex-M3 @ 120 MHz

44

STM32 Next 2 Major Launch


STM32F4 series
Cortex-M4 @ 168 MHz

STM32F4 Cortex M4 Increasing ST leadership in the performance race PR September 2011

STM32F0 Cortex M0 Expanding Market Reach towards 8-16 bit Early 2012

STM32F0 series
Cortex-M0

45

STM32 F4 Roadmap
Flash Size (bytes) 2 MB

STM32 F4 2MB Flash Die


1 MB

STM32 F4 1MB Flash Die


512 K

256 K
64 pins LFQFP/WLCSP 100 pins LQFP 144 pins LQFP 176 pins LQFP/UFBGA 208 pins UFBGA
46

Pin count

STM32 F4 Roadmap
Flash Size (bytes) 2 MB

STM32 F4 2MB Flash Die Samples Q3 2012 Production end of 2012


1 MB

STM32 F4 1MB Flash Die Production now


512 K

256 K
64 pins LFQFP/WLCSP 100 pins LQFP 144 pins LQFP 176 pins LQFP/UFBGA 208 pins UFBGA
47

Pin count

Backup Slides

STM32 F4 Block diagram


Cortex M4 w/FPU 168 MHz
CORTEX M4 CPU+ MPU + FPU 168 MHz
ARM 32-b bit multi-AHB bus matrix Ar rbiter (max 120MHz) )
Flash I/F

AHB2
(max 168Mhz)

pin-to-pin compatible with Mangusta More SRAM (192KB) Same IPS as Mangusta I2S: now full duplex New RTC sub second precision Faster serial I/F Faster ADC

Encryption Camera Interface USB 2 2.0 0 OTG FS

1MB Flash Memory

192KB SRAM External Memory Interface

JTAG/SW Debug ETM


Nested vect IT Ctrl

Power Supply
Reg 1.2V

POR/PDR/PVD USB 2.0 OTG HS Ethernet MAC 10/100, IEEE1588 Bridge APB1
(max 42MHz)

XTAL oscillators
32KHz + 8~25MHz

1 x Systic Timer DMA


16 Channels

Int. RC oscillators
32KHz + 16MHz

PLL RTC / AWU

64 pins to 176 pins 1.7V-3.6V Supply

Clock Control 80/112/140 I/Os 2x6x 16-bit PWM


Synchronized AC Timer

AHB1
(max 168Mhz)

5x 16-bit Timer 4KB backup RAM 2x 32-bit Timer 2x DAC + 2 Timers 2x Watchdog
(independent & window)

Bridge

3 x 16bit Timer APB2 Up to 16 Ext. ITs 1 x SPI 2 x USART/LIN

(max 84MHz)

1x SDIO 3x 12-bit ADC


24 channels / 2Msps

2x CAN 2.0B 2 x SPI/I2S 4x USART/LIN 3x I2C

Temp Sensor

49

STM32 F2 portfolio

STM32 F-2 Series portfolio


e Flash Si Size (bytes)

1MB
STM32F205RG 128 KB RAM

STM32F207VG 128 KB RAM

E*

STM32F207ZG 128 KB RAM

E*

STM32F207IG 128 KB RAM

E*
Ethernet, 2xUSB OTG camera IF OTG,

E*

STM32F205VG 128 KB RAM

E*

STM32F205ZG 128 KB RAM

E*

768 K
STM32F205RF 128 KB RAM

STM32F207VF 128 KB RAM STM32F205VF 128 KB RAM STM32F207VE 128 KB RAM STM32F205RE 128 KB RAM

STM32F207ZF 128 KB RAM STM32F205ZF 128 KB RAM

STM32F207IF 128 KB RAM

1xUSB OTG FS/HS OTG, camera IF

Encryption peripheral on
E* E*
STM32F207ZE 128 KB RAM

E*

512 K
E*

E*

STM32F207IE 128 KB RAM

STM32F217 and STM32F115

E*

STM32F205VE 128 KB RAM

STM32F205ZE 128 KB RAM E* STM32F207ZC 128 KB RAM STM32F205ZC 96 KB RAM STM32F207IC 128 KB RAM

256 K
STM32F205RC 96 KB RAM

STM32F207VC 128 KB RAM STM32F205VC 96 KB RAM

128 K

STM32F205RB 64 KB RAM

STM32F205VB 64 KB RAM

64 pins LFQFP/WLCSP

100 pins LQFP

144 pins LQFP

176 pins LQFP/UFBGA

Pin count

51

STM32 F2 and F4 Series coverage


e Flash Si Size (bytes)

1MB
STM32F205RG 128 KB RAM

STM32F207VG 128 KB RAM

E*

STM32F207ZG 128 KB RAM

E*

STM32F207IG 128 KB RAM

E*

E*

STM32F205VG 128 KB RAM

E*

STM32F205ZG 128 KB RAM

E*

STM32 F2 to F4
768 K
STM32F205RF 128 KB RAM STM32F207VF 128 KB RAM STM32F207ZF 128 KB RAM

STM32F207IF 128 KB RAM

U Upgrade d Z Zone
STM32F207VE 128 KB RAM

STM32F205VF 128 KB RAM

STM32F205ZF 128 KB RAM

512 K
STM32F205RE 128 KB RAM

E* E*

STM32F207ZE 128 KB RAM

E*

STM32F207IE 128 KB RAM

E*

E*

STM32F205VE 128 KB RAM

STM32F205ZE 128 KB RAM E* STM32F207ZC 128 KB RAM STM32F205ZC 96 KB RAM STM32F207IC 128 KB RAM

256 K
STM32F205RC 96 KB RAM

STM32F207VC 128 KB RAM STM32F205VC 96 KB RAM

128 K

STM32F205RB 64 KB RAM

STM32F205VB 64 KB RAM

64 pins LFQFP/WLCSP

100 pins LQFP

144 pins LQFP

176 pins LQFP/UFBGA

Pin count

52

STM32 F4 Hardware tools

STM32 F4 Discovery kit


Develop your applications easily with everything required for beginners and experienced users to get started quickly. Based on STM32F407 in LQFP100 Q package p g Includes on-board ST-LINK/V2,

Only $14.90*

*RRP
54

STM32 F4 Discovery kit


STM32F407VGT6 MCU in LQFP100 package, on-board ST-LINK/V2, 2x ST MEMS motion sensor and microphone, Audio DAC, USB OTG with micro-AB connector Extension header for all LQFP100 I/Os Eight Ei ht LED LEDs:

55

STM32 F4 Eval Board from ST


Evaluation board for full product feature evaluation
Hardware evaluation platform for all interfaces Possible connection to all I/Os and all peripherals

Based on STM32F407 in UFBGA176 package


STM3240G-EVAL $349*

*RRP
56

Starter kits from 3rd parties


STM32F4 starter kits from IAR and Keil available in Q4 2011 Order codes:
IAR: STM3240G-SK/IAR KEIL: STM3240G-SK/KEI

57

High Performance

How to benchmark micros


The core The first think to consider is the core : no wait states shall be introduced to decrease the result As a consequence, q , the maximum frequency q y achievable is the maximum frequency of the Flash The compiler used for the code generation of the benchmark have a significant influence on the result : for a same core, you can have two different benchmark result with two different compilers

59

0-ws performance chart


DMIPS
150 Good CPU, fast flash

125

100

75

50 Equivalent CPU, better flash for STM32F4 25

20
Competitor A M fl Max flash h frequency STM32 F4 M fl Max flash h frequency

40

60

80

100
Competitor B M fl Max flash h frequency

Fcpu (MHz)

All the results are with the best compiler for each MCU

60

How to benchmark micros


The flash acceleration As the Flash access time is limiting the micro speed, wait state have to be introduce the reach higher frequency The influence of wait state is reduced using g a flash accelerator which combined a buffer and/or a cache system taking benefit of a wide access bus to the flash (ex.128-bit wide) The quality and the efficiency of the flash accelerator can be evaluated looking at the loss of performance on a given benchmark each time a wait state is added
An excellent flash acceleration will result in no penalty each time a wait state is added A poor flash acceleration will result in a big penalty each time a wait state is added

As a consequence, q a fast flash or a p powerful CPU may y not necessary means best MCU performance
61

FPU benefits and performance

FPU benefits in real life applications


High level approach
Matrix, mathematical equations

Meta language tools


Matlab ,Scilabetc

C code generation
Floating point numbers (float)

FPU
Direct mapping No code modification High performance Optimal code efficiency

No FPU
Usage of SW lib No code modification Low performance Medium code efficiency

No FPU
Usage of integer based format Code modification Corner case behavior to be checked (saturation, scaling) Medium/high performance Medium code efficiency

63

FPU assembly code generation


float function1(float number1, float number2) { float temp1, temp2; temp1 = number1 + number2; temp2 = number1/temp1; return temp2; }

# float function1(float number1, float number2) # { # float temp1, temp2; # # temp1 1 = number1 b 1 + number2; b 2 VADD.F32 S1,S0,S1 # temp2 = number1/temp1; VDIV.F32 S0,S0,S1 # # return temp2; BX LR # }

1 assembly instruction Call Soft-FPU Soft FPU

# float function1(float number1, float number2) # { PUSH {R4,LR} MOVS R4,R0 MOVS O S R0,R1 0 1 # float temp1, temp2; # # temp1 = number1 + number2; MOVS R1,R4 BL __aeabi_fadd MOVS R1,R0 # t temp2 2 = number1/temp1; b 1/t 1 MOVS R0,R4 BL __aeabi_fdiv # # return temp2; POP {R4,PC} # }

64

Floating point benchmark


Time execution comparison for a 29 coefficient FIR on float 32 with and without FPU (CMSIS library)
Execution Time

10x improvement Best compromise Development time vs. performance

No FPU

FPU

65

DSP benefits and performance

Single-cycle multiply-accumulate (MAC)


The multiplier unit allows any MUL or MAC instructions to be executed in a single cycle
Signed/Unsigned Multiply Signed/Unsigned Multiply-Accumulate Signed/Unsigned Multiply-Accumulate Long (64-bit)

Benefits : Speed improvement vs. Cortex-M3


4x for 16-bit MAC (dual 16-bit MAC) 2x for 32-bit MAC up to 7x for 64-bit MAC

67

Saturated arithmetic
Intrinsically prevents overflow of variable by clipping to min/max boundaries and remove CPU burden due to software range checks Benefits
Audio applications
1.5 1 0.5 0 -0.5 -1 -1.5

1.5

Without saturation

1 0.5 0 -0.5 -1 -1.5 15 1.5 1

With saturation

0.5 0 -0.5 -1 -1.5

Control applications

The PID controllers integral term is continuously accumulated over time. The saturation automatically limits its value and saves several CPU cycles per regulators

Single-cycle SIMD instructions


Stands for Single Instruction Multiple Data Allows to do simultaneously several operations with 8-bit or 16-bit data format
Ex: dual 16-bit MAC (Result = 16x16 + 16x16 + 32) Ex: Quad 8-bit SUB / ADD

Benefits
Parallelizes operations (2x to 4x speed gain) Minimizes the number of Load/Store instruction for exchanges between memory and register file (2 or 4 data transferred at once), if 32-bit is not necessary Maximizes register file use (1 register holds 2 or 4 values)

69

DSP performances for filtering applications


FIR filter execution time (CMSIS library)
100

80

60

10x improvement Best compromise Development time vs. performance

17.9x improvement Best performance


Requires effort for proper data management

40

20

32-bit float no FPU

32-bit float FPU

16-bit fixed-point SIMD optimized ti i d

70

DSP performances for control application


Example based on a complex formula used for sensorless motor drive Gain comes for load operations and SIMD instructions Total gain on this part is 25 to 35%
Cortex M3 (28-38 c.)
LDRSH R12,[R4, #+12] LDRSH SXTH R0,[SP, #+20] LR,R8 R8,LR,R0 R1,[R4, #+44] R0,R1,R7 R2,[R4, #+24] R3,[R4, #+26] R10,[R4, #+22] R6,R6 R5,R6,R10,R5 R5,R9,R12,R5 SMLSD R5, R10, R6, R5 (1 SIMD instruction replacing two multiplyaccumulate. Gain: 3 cycles) LDR R2,[R4, #+22]

Cortex M4 (18-28 c.)


LDR R10,[R4, #+12] (1 single 32-bit load replacing two 16-bit load with sign extension. Gain: 2 cycles

MUL LDR SDIV LDRSH LDRSH

LDRSH SXTH MLS MLA

(1 single 32-bit load replacing to 16-bit with sign extension. Gain: 2 cycles)

ASR MLA SXTH MLS

R6,R8,#+15 R6 R8 #+15 R5,R6,R3,R5 R0,R0 R5,R0,R2,R5 SMLSD R5, R0, R2 ( SIMD instruction replacing (1 g two multiplyy accumulate. Gain: 3 cycles)

STR

R5,[SP, #+12]

71

ARM Cortex M4 in few words

Cortex-M processors
Forget traditional 8/16/32-bit classifications
Seamless architecture across all applications Every E product d t optimised ti i d f for ultra lt l low power and d ease of f use

Cortex-M0 Cortex-M3 Cortex-M4


8/16-bit applications 16/32-bit applications 32-bit/DSC applications

Binary and tool compatible

73

Cortex-M processors binary compatible

ARM Cortex M4 Core


Single precision Ease of use Better code efficiency Faster time to market Eliminate scaling and saturation Easier support for meta-language tools

FPU

What is Cortex-M4?
MCU
Ease of use of C programming Interrupt handling Ultra-low Ultra low power

DSP
Cortex-M4
Harvard architecture Single-cycle MAC Barrel shifter

Cortex-M4 processor microarchitecure


ARMv7ME Architecture
Thumb-2 Technology DSP and SIMD extensions Single cycle MAC (Up to 32 x 32 + 64 -> > 64) Optional single precision FPU Integrated configurable NVIC Compatible with Cortex-M3

Microarchitecture
3-stage pipeline with branch speculation 3x AHB-Lite Bus Interfaces

Configurable for ultra low power


Deep Sleep Mode, Wakeup Interrupt Controller Power down features for Floating Point Unit

Flexible configurations for wider applicability


Configurable Interrupt Controller (1-240 Interrupts and Priorities) Optional Memory Protection Unit Optional Debug & Trace
76

Cortex-M feature set comparison


Cortex-M0
Architecture Version Instruction set architecture DMIPS/MHz Bus interfaces Integrated NVIC Number interrupts Interrupt priorities Breakpoints, Watchpoints Memory Protection Unit (MPU) I t Integrated t d trace t option ti (ETM) Fault Robust Interface Single Cycle Multiply Hardware Divide WIC Support Bit banding support Single cycle DSP/SIMD Floating gp point hardware Bus protocol CMSIS Support V6M Thumb, Thumb-2 System Instructions 0.9 1 Yes 1-32 1 32 + NMI 4 4/2/0, 2/1/0 No N No No Yes (Option) No Yes No No No AHB Lite Yes

Cortex-M3
v7M Thumb + Thumb-2 1.25 3 Yes 1-240 1 240 + NMI 8-256 8/4/0, 2/1/0 Yes (Option) Y (Option) Yes (O ti ) Yes (Option) Yes Yes Yes Yes No No AHB Lite, APB Yes

Cortex-M4
v7ME Thumb + Thumb-2, DSP, SIMD, FP 1.25 3 Yes 1-240 1 240 + NMI 8-256 8/4/0, 2/1/0 Yes (Option) Y (Option) Yes (O ti ) No Yes Yes Yes Yes Yes Yes AHB Lite, APB Yes
77

Cortex-M4 extended single cycle MAC


OPERATION
16x16=32 16x16+32=32 16x16+64=64 16x32=32 (16x32)+32=32 (16x16) (16x16)=32 (16x16) (16x16)+32=32 (16x16) (16x16)+64=64 32x32=32 32 (32x32)=32 32x32=64 (32x32)+64=64 (32x32)+32+32=64 32 (32x32)=32(upper) (32x32)=32(upper)

INSTRUCTIONS
SMULBB,SMULBT,SMULTB,SMULTT SMLABB,SMLABT,SMLATB,SMLATT SMLALBB,SMLALBT,SMLALTB,SMLALTT SMULWB,SMULWT SMLAWB,SMLAWT SMUAD,SMUADX,SMUSD,SMUSDX SMLAD,SMLADX,SMLSD,SMLSDX SMLALD,SMLALDX,SMLSLD,SMLSLDX MUL MLA MLS MLA, SMULL,UMULL SMLAL,UMLAL UMAAL SMMLA,SMMLAR,SMMLS,SMMLSR SMMUL,SMMULR

CM3
n/a n/a n/a n/a n/a n/a n/a n/a 1 2 57 57 n/a n/a n/a

CM4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

All the above operations are single cycle on the Cortex-M4 processor
78

Cortex-M4 DSP instructions compared


Cyclecounts
CLASS Arithmetic INSTRUCTION ALUoperation(notPC) ALUoperationtoPC CLZ QADD,QDADD,QSUB,QDSUB QADD8 QADD16 QADD8, QADD16,QSUB8 QSUB8,QSUB16 QDADD,QDSUB QASX,QSAX,SASX,SSAX SHASX,SHSAX,UHASX,UHSAX SADD8,SADD16,SSUB8,SSUB16 SHADD8,SHADD16,SHSUB8,SHSUB16 UQADD8,UQADD16,UQSUB8,UQSUB16 UHADD8,UHADD16,UHSUB8,UHSUB16 UADD8,UADD16,USUB8,USUB16 UQASX,UQSAX,USAX,UASX UXTAB,UXTAB16,UXTAH USAD8,USADA8 MUL, ,MLA MULS,MLAS SMULL,UMULL,SMLAL,UMLAL SMULBB,SMULBT,SMULTB,SMULTT SMLABB,SMLBT,SMLATB,SMLATT SMULWB,SMULWT,SMLAWB,SMLAWT SMLALBB,SMLALBT,SMLALTB,SMLALTT SMLAD,SMLADX,SMLALD,SMLALDX SMLSD,SMLSDX SMLSLD,SMLSLD SMMLA,SMMLAR,SMMLS,SMMLSR SMMUL,SMMULR SMUAD,SMUADX,SMUSD,SMUSDX UMAAL SDIV,UDIV CORTEXM3 CortexM4 1 1 3 3 1 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 1 2 1 1 2 1 5 7 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1 2 12 2 12

p Multiplication

Single y cycle MAC

Division

79

Cortex-M4 nonDSP instructions


C l counts Cycle t
CLASS Load/Store INSTRUCTION LoadsinglebytetoR0R14 LoadsinglehalfwordtoR0R14 LoadsinglewordtoR0R14 LoadtoPC L ddouble Load d bl word d Storesingleword Storedoubleword Loadmultipleregisters(notPC) LoadmultipleregistersplusPC Storemultipleregisters Load/storeexclusive SWP B,BL,BX,BLX CBZ,CBNZ TBB,TBH IT MRS MSR CPS BFI,BFC RBIT,REV,REV16,REVSH SBFX,UBFX UXTH UXTB, UXTH, UXTB SXTH, SXTH SXTB SSAT,USAT SEL SXTAB,SXTAB16,SXTAH UXTB16,SXTB16 SSAT16,USAT16 PKHTB,PKHBT CORTEXM3 CortexM4 1 3 1 3 1 3 1 3 1 3 1 3 5 5 3 3 1 2 1 2 3 3 N+1 N+1 N+5 N+5 N+1 N+1 2 2 n/a n/a 2 3 2 3 3 3 5 5 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 n/a 1 n/a 1 n/a 1 n/a 1 n/a 1

Branch

Special

Manipulation

80

16-bit DSP functions compared


Relative cycle counts for DSP tasks running on 16-bit data shown below Smaller is better on the chart Cortex-M4 is 30% to 70% better

81

32-bit DSP functions compared


Relative cycle counts for DSP tasks running on 32-bit data shown below Smaller is better on the chart Cortex-M4 is 25% to 60% better

82

DSP application example: MP3 audio playback

MHz required for MP3 decode (smaller is better !)


DSP Concept
83

M4 Benefits
50% more performance than M3 for signal processing calculations
25% better than ARM9E at equivalent frequency

50% better than M3 for Audio (MP3 codec)


5% better than ARM9E at equivalent frequency

MMACS
72 MHz 72 MMACS (32bits) or 144 MMACS (16bits) 150 MHz 150 MMACS (32bits) or 300 MMACS (16bits)

Floating point Unit


Graphic acceleration: moves like rotations and so on... Advanced algorithms: audio (voice recognition recognition, pitch detection) or image processing Direct Matlab interface: PC tools generate floating point code, directly portable on FPU. A fixed point device will require more care and adaptation. d t ti

84

DSP lib provided for free by ARM


The benefits of software libraries for Cortex-M4
Enables end user to develop applications faster Keeps K end d user abstracted b t t df from l low l level l programming i Benchmarking vehicle during system development Clear competitive positioning against incumbent DSP/DSC offerings Accelerate third party software development

Keeping it easy to access for end user


Minimal entry barrier - very easy to access and use

One standard library no duplicated efforts


ARM channels effort/resources with software partner Value add through another level of software eg: filter config tools

85

DSP lib function list snapshot


Basic math vector mathematics Fast math sin, cos, sqrt etc Interpolation linear, bilinear Complex math Statistics max, min,RMS etc Filtering IIR, FIR, LMS etc Transforms FFT(real and complex) , Cosine transform etc Matrix functions PID Controller Support functions copy/fill arrays, data type conversions etc

86

STM32 F4 vs. STM32 F2

Differences in Core and System Architecture STM32 F2


C Core Floating gp point calculation Performance / with ART ON

STM32 F4

ARM C Cortex t M3 (r2p0) ( 2 0) ARM C Cortex t M4F * (r0p1) ( 0 1) s/w 0ws like performance thanks to ART Accelerator: 120MHz:1.65V-3.6V 128KB of system memory Single g p precision h/w
0ws like performance thanks to ART Accelerator: 168Mhz: 2.1V3.6V 144MHz:1.8V2.1V 128MHz:1.7V1.8V

SRAM internal capacity

192KB (128KB system memory + 64KB dedicated to CPU data) )

88

Differences in Core and System Architecture STM32 F2


Internal Regulator Bypass Available only on WLCSP64 (IRR_OFF pin) and BGA176 (BYPASS_REG pin) packages On WLCSP64 this O thi functionality can not be dissociated from BOR OFF VDD min i extension t i from f 1.8V 1 8V down d to 1.65V (requires BOR OFF) on F2 1.7V (requires BOR OFF) on F4 Available A il bl only l on WLCSP64 package (IRR_OFF pin) This functionality can not be dissociated from Regulator bypass None

STM32 F4
Available only on WLCSP64 and BGA176 (BYPASS_REG pin) packages

BOR OFF and Internal regulator bypass are non exclusive on the above packages Available A il bl on all ll packages k (PDR (PDR_ON ON pin) except on LQFP64 pin package This functionality can be dissociated from Regulator bypass

Voltage Scaling (Internal regulator output) )

Performance Optimization (150 MHz max) ) Power Optimization (120MHz max) 89

Differences in Peripheral System Architecture STM32 F2


FSMC (improvements) Remap capability on bank1-NE1/NE2, but no capability to access other banks while remapped

STM32 F4
Remap capability on bank1-NE1/NE2, with access to other FSMC banks while remapped. pp

I2S

2x I2S Half duplex

2x I2S Full duplex.

90

New RTC implementation STM32 F2


Calendar Sub seconds access Calendar resolution NO From RTCCLK/2 to RTCCLK/2^20

STM32 F4
YES (resolution down to RTC clock) From RTCCLK/1 to RTCCLK/2^22 YES 2 alarms Sec, Min, Hour, Date/day, Sub seconds

Calendar read and NO synchronization on the fly Alarm on calendar 2 alarms Sec, Min, Hour, Date/day

91

New RTC implementation STM32 F2


Calendar Calibration Calib window : 64min C lib ti step: Calibration t Negative:-2ppm Positive: +4ppm Range [-63ppm+126ppm] YES Sec, Min, Hour, Date YES (2 pins /1 event) Edge Detection only

STM32 F4
Calib window : 8s/16s/32s Calibration step: Negative or Positive: 3.81ppm/1.91ppm/ pp pp 0.95 pp ppm Range [-480ppm +480ppm]

Timestamp

YES Sec, Min, Hour, Date, Sub seconds YES (2 pins/ 2 events) Level Detection with Configurable filtering

Tamper

92

Compatible board design for LQFP100-144-176 and BGA 176 packages


F2xx RFU (reserved for future use) can be connected to VDD/VSS/NC F4xx PDR_ON can be connected to VDD or VSS (should be connected to VDD to maintain compatibility with the STM32 family

RFU / PDR_ON

VDD VSS

93

Compatible board design for WLCSP64+2 package


F2xx IRR_OFF(Internal Reset and Regulator OFF pin) can be connected to VDD/VSS. The BOR and the Internal Regulator is switched OFF when IRR_OFF is set to VDD. F4xx PDR_ON (BOR OFF pin). The BOR is switched OFF when PDR_ON pin is set to VSS. (Internal regulator is controlled independently using the BYPASS_REG pin)

IRR_OFF/ PDR_ON

VDD VSS

94

Thank you

www.st.com/stm32f4

Glossary
ART Accelerator : STs adaptive real-time accelerator CMSIS: C S S Co Cortex e microcontroller c oco o e so software a e interface e ace s standard a da d MCU: microcontroller unit DSC: digital signal controller DSP: digital signal processor FPU: floating point unit RTC: real-time clock MPU: memory protection unit FSMC: flexible static memory controller

You might also like