You are on page 1of 39

Challenges in High-Performance

High Performance
Embedded Designs

Alec Bath, Application Engineer, STMicroelectronics


Markus Mayr, Product Marketing Engineer, STMicroelectronics

1
1D Barcode Scanner

 Our current 1D Barcode Scanner is great - but:


 We need to reduce cost to stay competitive
 We need to add new features to react to customer
requests: USB, RTOS, etc.

2
Original 1D Barcode Scanner

ARM7TDMI UART/ PS2 (GPIO)


External Interface

GPIO
User Interface

12
50MHz External Fast ADC CCD Scan sensor
128KB Flash
16KB RAM

3
“Von-Neumann Bottleneck”

DMA
Controller

Master

AHB1 Slow Peripherals User Interface

ARM7TDMI
AHB2 Fast Peripherals
USB 1.1

SRAM
Master
12
EMI External Fast ADC

FLASH
IRQ!! Up to
128KB

4
1D Scanner design then and now

CORTEX-M3 I-bus
Master 1
I/F

D-bus
FLASH
72MHz
128KB Flash
Bus

System
20KB RAM SRAM
sMatrix

Slave
GPIOA,B,C,D,E - AFIO –
APB2 USART1- SPI1 - ADC1,2 - User Interface
TIM1 - EXTI
GP-DMA AHB AHB-APB2
Master 2 APB1 USART2,3 - SPI2 - I2C1,2 –
AHB-APB1
TIM2,3,4 - IWDG– WWDG –
USB 1.1
USB – CAN – BKP – PWR –
Arbiter Bridges

5
Innovative System Architecture
 Harvard architecture + BusMatrix allows concurrent Flash execution and DMA transfer
 Advanced Peripherals to further offload the CPU
 Low-latency deterministic interrupt controller in the Cortex-M3 core
 75% lower power at the same clock speed as ARM7
 30% better code size via THUMB2 instruction set

CORTEX-M3 I-bus
I/F

Master 1 FLASH
D-bus
72MHz
128KB Flash
BusMatrix

System
20KB RAM SRAM
Slave
GPIOA,B,C,D,E - AFIO –
APB2 USART1- SPI1 - ADC1,2 -
TIM1 - EXTI
GP DMA
GP-DMA AHB AHB-APB2
AHB APB2
Master 2 APB1 USART2,3 - SPI2 - I2C1,2 –
AHB-APB1
TIM2,3,4 - IWDG– WWDG –
USB – CAN – BKP – PWR –
Arbiter Bridges

6
Cortex-M3 Harvard Architecture

Instructions fetched over I-bus…

I-bus

Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter

System
System
SRAM
Slave

APBx AHB
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave
GP-DMA
Bridges
g Master 2

7
Cortex-M3 Harvard Architecture

Instructions fetched over I-bus…


while
hil literals
lit l fetched
f t h d on D-bus…
D b
I-bus

Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter

System
System
SRAM
Slave

APBx AHB
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave
GP-DMA
Bridges
g Master 2

8
Cortex-M3 Harvard Architecture

Instructions fetched over I-bus…


while
hil constants
t t fetched
f t h d on D-bus…
D b
I-bus

Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter

System
System
SRAM
Slave
While Core reads peripheral…

APBx AHB
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave
GP-DMA
Bridges
g Master 2

9
DMA & Cortex-M3 Data Flow

Instructions fetched over I-bus…


while
hil literals
lit l fetched
f t h d on D-bus…
D b
I-bus

Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter

System
System
SRAM
Slave
While Core reads peripheral… While
DMA
APBx AHB reads
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave SRAM!
GP-DMA
Bridges
g Master 2

10

10
The Cortex-M3 MCU Core
 High performance with low dynamic power
 Harvard Architecture
 30%
%p performance improvement
p over ARM7TDMI
 Single-cycle multiply
 Hardware divide
 Atomic Bit manipulation

 Best code density


 Thumb-2 brings 32-bit performance with 16-bit code density

 Deterministic
 Interrupt controller inside the core,
 12-cycle push / 12-cycle pop
 Just 6-cycle latency for “tail-chained” interrupts

 Improved debug features


 Serial Wire debug and JTAG
 Serial-Wire-Viewer adds real-time data trace
 ETM on select part numbers
 2 data watchpoints, 6 hardware breakpoints

11

11
What’s Thumb-2 ?
 Thumb-2 is a NEW ARM instruction set, mixing 16 & 32-bit instructions

 Backwards compatible to previous 16-bit THUMB instruction set


 12 new instructions, including several DSP-type instructions
 Memory footprint similar to THUMB, Performance similar to ARM!
 No more interworking between ARM and THUMB modes!

12

12
Compact Code and Data Memory
• Cortex-M3 supports unaligned data accesses to improve data constant and
RAM utilization

long (32) long (32)


char (8) Structure char (8) long (32) …
long (32) management … long char (8) char (8) char (8)
32bit machine char (8) char (8) char (8) example int (16) long (32) …
which does Data
int (16) … long int (16)c
not support aligned
long (32) char (8) int (16) long …
int (16)c char (8) … long (32)
unaligned data int (16)
long (32)

Unused (wasted) space Free space for the rest of the application

Reduces SRAM Memory Requirements By Over 25%

Less Memory = Lower cost devices!

13

13
Atomic Bit Manipulation via Bit Banding

14

14
NVIC Interrupt Handling
Interrupts are handled in hardware! There’s no instruction overhead
 12-cycle Entry:
• Processor state automatically saved to the stack over the data bus.
{PC, xPSR, R0-R3, R12, LR}
• In parallel, ISR is prefetched on the instruction bus.
-ISR ready to start executing as soon as stack PUSH complete.

 12-cycle Exit:
• Processor state is automatically restored from the stack.
• In parallel, interrupted instruction is prefetched ready for execution upon
completion of stack POP.
• Stack POP can be interrupted, allowing new ISR to be immediately
executed without the overhead of state saving.

15

15
Fast Interrupt Response and Tail Chaining

Here we have 2 simultaneous interrupts,


IRQ1 being of higher priority
IRQ1 When IRQ1 is finished, IRQ2 is serviced
Highest
IRQ2
ARM7 42 CYCLES
Interrupt handling in
assembler code
PUSH ISR 1 POP PUSH ISR 2 POP

26 16 26 16

Cortex-M3 Tail-chaining
Interrupt handling in HW
PUSH ISR 1 ISR 2 POP
12 6 12

6 CYCLES

In Cortex-M3, ISR2 has only a 6-cycle delay.


ISR2 has been ‘tail-chained’

16

16
System Timer (SysTick)
 Flexible system timer is part of the Cortex-M3 Core

 24-bit self-reloading down counter with end of count interrupt

 2 configurable Clock sources

 Suitable for Real Time OS or other scheduled tasks

 Can only access when executing in privileged mode

17

17
STM32 On-Chip Flash Memory Interface
 Mission:: Support 72 MHz operation directly from Flash memory
 64-bits wide Flash with Pre-fetch (2 × 64bits buffers)

Flash
Interface
32 – 16 – 16 Bits
bits
Thumb-2

ARBITER FLASH
64

MEMORY

64 bits …
Instructions-BUS

4 bits

4 bits
16 bits 32 bits 16 bits
Thumb

64

64
Thumb-2
Th b2 Thumb-2
Th b2
bitsThumb-2
32 bits

CORTEX-M3
CPU
Thumb-2 64
32 bits

ARRAY

Data/Debug-BUS
16-bit 32 bits 8 bit
Data Data Data

18

18
Low Power Modes

Low power modes names and


Functions
consumption

Low Power Modes Medium


high RTC STM32L STM32L
CPU Periphs speed Speed LSI RAM
Calendar Current consumption Wake-up Time
Osc OSC

Can be Can be
RUN (from Flash) ON ON ON ON ON 230µA/MHz
enabled enabled
Can be Can be
RUN (from RAM) ON ON ON ON ON 185µA/MHz
enabled enabled
Can be Can be
LP RUN ON ON (LS) ON ON ON 11µA -
enabled enabled
Can be
LP SLEEP OFF OFF OFF ON ON ON 6µA 0.35µs
enabled

STOP w/full RTC ON ON ON 1.3µA 8µs

STOP w/o RTC OFF ON ON 0.43µA 57µs

ON
STANDBY w/full RTC ON OFF OFF 1µA 8µs

STANDBY w/o RTC OFF OFF OFF 0.27µA 57µs

19

19
MCU Platform for Rapid Innovation
Powerful core

Fully compatible product portfolio

Complete ecosystem

Connectivity Line USB Access Line Value Line Performance Line Access Line Low Power Low Power w LCD

20
2D Barcode Scanner

 Easy and fast product transition thanks to scalable


architecture and compatible product family

21

21
2D Barcode Scanner

802.15.4 Radio

User Interface USB connector USART connector

Camera I/F GPIO HS USB Device USART

HID
Class

Image Processing
(DSP Calculations)

DAC / I2S

22

22
Innovative System Architecture
Ethernet High Speed CORTEX-M3 Dual Port Dual Port
10/100 USB2.0 120MHz w/ MPU DMA1 DMA2

Periph1
M t 5
Master M t 4
Master Master 2 Master 3
Dual Port
Slow Peripherals
FIFO/DMA FIFO/DMA Master 1 FIFO/8 Streams FIFO/8 Streams AHB1-APB1

Periph2
Mem1

Mem2
D-Bus

Dual Port
S-Bus
I-Bus

Fast Peripherals
AHB1-APB2

AHB1 GPIOs

DCMI Crypto
DCMI, Crypto,
AHB2 USB Full Speed

SRAM1
112KB

SRAM2
16KB

FSMC

D-Code
FLASH
ART Up to
Accelerator
I-Code 1Mbytes
Multi-AHB Bus Matrix

23

23
2D Barcode Scanner

802.15.4 Radio

User Interface USB connector USART connector

Camera I/F GPIO HS USB Device USART

HID
Class

Image Processing
(DSP Calculations)

DAC / I2S

24

24
Media Hubs
 Convergence of different media delivery options
 Speaker
S k systems
t
 Docking stations HTML page
on host Media Player USB SD
Host PC
w/
Media player
BT
Mobile
BT
Headset
phone

 CMOS Touch Tablet Key Card Micro


Streaming systems Camera Wifi Conn. Screen USB conn. BT Front-end phone
G2-Icon (STLC2690)

 Media players module Eth-PHY USB conn.


(Evaluation for Apple + USB conn. With analog switches)
SD conn. USB Phy. + antenna

Ethernet HS USB SPI


FS USB Host
…. DCMI
Camera
I/F
SPI
UART
MAC
FSMC
I²C
Player MS
SD
Card
I/F
Device (I²C ctrl)

TCP/IP

 Modular approach
control Class
BT
Media Audio
File System (eFSL) Stack
streaming streaming

MP3/WMA codec EC/NR
HMI Control (2 possible instance @ same time)

& Display
Volume control / Ch. Mixer

Equalizer / Loudness

I²C I2S (w/ I²C control)

Authentication
Device
AUDIO CODEC (STw5098)
AMPLIFIER
FM
(STLC2690)
Speakers / Headset

25

25
Media Hubs
Host PC BT
HTML page BT
w/ Mobile
on host Media Player USB SD Media player
Headset
phone
CMOS Touch Tablet Key Card Micro
Camera Wifi Conn. Screen USB conn. BT Front
Front-end
end p
phone
G2-Icon (STLC2690)
module Eth-PHY USB conn. SD conn. USB Phy. + antenna
(Evaluation for Apple + USB conn. With analog switches)

Ethernet HS USB SPI


FS USB Host
DCMI MAC SD Device (I²C ctrl)
SPI FSMC
Camera Card
UART I²C
I/F Player MS I/F
TCP/IP
control Class
BT
Media Audio
File System (eFSL) Stack
streaming streaming

MP3/WMA codec EC/NR
HMI Control (2 possible instance @ same time)

& Display
Volume control / Ch. Mixer

Equalizer / Loudness

I²C I2S (w/ I²C control)

Authentication
Device
AUDIO CODEC (STw5098)
AMPLIFIER
FM
(STLC2690)
Speakers / Headset

26

26
Audio Docking Station
Micro
Cortex™-M3 12-bits ADC Phone &
CPU Pre-amplifier
120MHz
SRAM

Flash DMA

Volume/ Ch Mixer

MP3 Decoder ( 2 instances )


MP3 Encoder Loudness Control
WMA ddecoderd

5 bands - Equalizer
Audio
I²S Amp
DAC
MS
File System *
Class * HMI Control
& Display PLL block Speakers / Headset
FS USB Host *

XTAL oscillators
USB FSMC SDIO 32 kHz + 3-25 MHz

USB
Key Touch
Screen
Audio Media: QVGA
USB mass storage device LCD

27

27
Architecture : DMA & Multi-Bus Matrix

Dual Port
DMA1 Dual Port
Slow Peripherals
Fastt Peripheral
F P i h l DMA bus
b AHB1-APB1
Master 2 to bypass the bus matrix

Dual Port Dual Port


Fast Peripherals
AHB1-APB2
DMA2
Master 3
AHB1 GPIOs

Hi h S
High Speedd AHB2 DCMI C
DCMI, Crypto,
t
USB Full Speed
USB2.0
Master 4 SRAM1
112KB
Ethernet
10/100 SRAM2
Master 5 16KB

FSMC
CORTEX-M3
120MHz w/ MPU
FLASH
D-Code ART Up to
Master 1 Accelerator
1Mbytes
Multi-AHB Bus Matrix I-Code

28

28
STM32 F-2 ART AcceleratorTM
 Supports 120 MHz operation without penalty
 128-bit wide Flash with Prefetch
 64 × 128-bit buffers for code, 8 x 128-bit buffers for Data)
 Intelligent Branch management
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit Arbitration and
128 Bit 128 Bit 128 Bit 128 Bit Branch
128 Bit 128 Bit 128 Bit 128 Bit FLASH
128 Bit 128 Bit 128 Bit 128 Bit
Management MEMORY
128 Bit 128 Bit 128 Bit 128 Bit

128 bits
128 bits
128bits
128 Bit 128 Bit 128 Bit 128 Bit


128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
Instruction-BUS 128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
CORTEX-M3 128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
CPU 128 Bit 128 Bit 128 Bit 128 Bit
128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit ARRAY
128 Bit
128 Bit
128 Bit
Data/Debug-BUS 128 Bit
128 Bit
128 Bit

29

29
ART AcceleratorTM: the Bottom line !

160 STM32F200
140 MCU A
120 MCU B
DMIPS)

100 STM32F200
80 performance
Performance (D

i almost
is l t linear
li
60 with frequency
40
20
0
Impact of wait states:
0 50 100 150
P

-Imperfect
I f t accelerator
l t
Core Frequency -Slow flash

30

30
Estimating Real World Performance

Coremark is a better way!

 A realistic mix of read/write, integer and control operations

 Written in ANSI C and is less than 16KB

 Matrix math operations,


operations linked
linked-list
list manipulation
manipulation, state machine operations and CRC

 Focuses on an MCU’s pipeline, memory and integer math capabilities

 Designed not to make any library calls that could be optimized away

 Coremark is part of EEMBC,


EEMBC the industry standard for benchmarking

 www.coremark.org

31

31
What is CoreMark?
 Simple, yet sophisticated
 Easily ported in hours,
hours if not minutes
 Comprehensive documentation and run rules

 Free, but not cheap


 Open C code source download from EEMBC website
 Robust CPU core functionality coverage

 Dhrystone Terminator
 The benefits of Dhrystone without all the shortcomings
 Free, small, easily portable
 C
CoreMark does real work

32
Exposing Dhrystone Weaknesses
 Major portions of Dhrystone are susceptible to a compiler’s
ability to optimize the work away - NOT CoreMark
 Library calls are made within the timed portion and dominate
the time consumed by the benchmark - NOT CoreMark
 Completely synthetic and does not mimic any behavior that
can be expected in a real application- NOT CoreMark
 No official source code resulting in different, and often
undisclosed, versions (1.1, 2.0, 2.1) - NOT CoreMark
 Very vague and ambiguous run guidelines are not universally
known and are not enforced - NOT CoreMark
 Reporting lacks standardization; various formats in use
(DMIPS, Dhrystones per second, DMIPS/MHz) - NOT
CoreMark

33
CoreMark Workload Features
 Matrix manipulation allows the use of MAC and common math ops
 Linked list manipulation exercises the common use of pointers
 State machine operation represents data dependent branches
 Cyclic Redundancy Check (CRC) is very common embedded
function

 Testing
T ti for:
f
 A processor’s basic pipeline structure
 Basic read/write operations
 Integer operations
 p
Control operations

34
Summary: The Value of CoreMark

 Simple to use, yet sufficiently sophisticated for


b
benchmarking
h ki a processor core
 Freely available, limited usage restrictions
 Provides industry standard tool to allow users to
begin
g embedded p processor analysis
y
 Introduces all processor users to the overall
value of EEMBC

35
EEMBC CoreMark 1.0 - Summary
250
CoreMark STM32F2xx
[Iter/Sec] (228.6@ 120MHz
STM32F2xx
(190.30@ 100MHz)
200

150

100

50

MHz

36

36
High-performance and Low Power!

 Who said high performance and low dynamic power


consumption were not compatible?

 Less than 23mA @ 120MHz from flash, with peripherals OFF


(running CoreMark benchmark).

 150uA in Stop mode


 < 1uA in VBAT mode with RTC running
 2uA in standby mode
 1.8 - 3.6V power supply

37

37
Let’s review!

 Not all Cortex-M3 microcontrollers are the same!


 Things
Thi to consider
id iin your next hi
high-performance
h f
embedded design:
 An innovative multi-layer bus architecture

 Intelligent peripherals and sub


sub-systems
systems

 Highly optimized flash memory controller

 The efficient Cortex-M3 core

 A cutting-edge 90nm process technology, enabling


120MHz performance with very low power consumption

38

38
Questions?

39

39

You might also like