Professional Documents
Culture Documents
High Performance
Embedded Designs
1
1D Barcode Scanner
2
Original 1D Barcode Scanner
GPIO
User Interface
12
50MHz External Fast ADC CCD Scan sensor
128KB Flash
16KB RAM
3
“Von-Neumann Bottleneck”
DMA
Controller
Master
ARM7TDMI
AHB2 Fast Peripherals
USB 1.1
SRAM
Master
12
EMI External Fast ADC
FLASH
IRQ!! Up to
128KB
4
1D Scanner design then and now
CORTEX-M3 I-bus
Master 1
I/F
D-bus
FLASH
72MHz
128KB Flash
Bus
System
20KB RAM SRAM
sMatrix
Slave
GPIOA,B,C,D,E - AFIO –
APB2 USART1- SPI1 - ADC1,2 - User Interface
TIM1 - EXTI
GP-DMA AHB AHB-APB2
Master 2 APB1 USART2,3 - SPI2 - I2C1,2 –
AHB-APB1
TIM2,3,4 - IWDG– WWDG –
USB 1.1
USB – CAN – BKP – PWR –
Arbiter Bridges
5
Innovative System Architecture
Harvard architecture + BusMatrix allows concurrent Flash execution and DMA transfer
Advanced Peripherals to further offload the CPU
Low-latency deterministic interrupt controller in the Cortex-M3 core
75% lower power at the same clock speed as ARM7
30% better code size via THUMB2 instruction set
CORTEX-M3 I-bus
I/F
Master 1 FLASH
D-bus
72MHz
128KB Flash
BusMatrix
System
20KB RAM SRAM
Slave
GPIOA,B,C,D,E - AFIO –
APB2 USART1- SPI1 - ADC1,2 -
TIM1 - EXTI
GP DMA
GP-DMA AHB AHB-APB2
AHB APB2
Master 2 APB1 USART2,3 - SPI2 - I2C1,2 –
AHB-APB1
TIM2,3,4 - IWDG– WWDG –
USB – CAN – BKP – PWR –
Arbiter Bridges
6
Cortex-M3 Harvard Architecture
I-bus
Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter
System
System
SRAM
Slave
APBx AHB
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave
GP-DMA
Bridges
g Master 2
7
Cortex-M3 Harvard Architecture
Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter
System
System
SRAM
Slave
APBx AHB
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave
GP-DMA
Bridges
g Master 2
8
Cortex-M3 Harvard Architecture
Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter
System
System
SRAM
Slave
While Core reads peripheral…
APBx AHB
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave
GP-DMA
Bridges
g Master 2
9
DMA & Cortex-M3 Data Flow
Flash I/F
Slave
FLASH
D-bus D-bus
CORTEX-M3
Master 1
Multi layer Bus Matrix / Arrbiter
System
System
SRAM
Slave
While Core reads peripheral… While
DMA
APBx AHB reads
USART / SPI / AHB/APBx
I2C / ADC/ TIM Slave SRAM!
GP-DMA
Bridges
g Master 2
10
10
The Cortex-M3 MCU Core
High performance with low dynamic power
Harvard Architecture
30%
%p performance improvement
p over ARM7TDMI
Single-cycle multiply
Hardware divide
Atomic Bit manipulation
Deterministic
Interrupt controller inside the core,
12-cycle push / 12-cycle pop
Just 6-cycle latency for “tail-chained” interrupts
11
11
What’s Thumb-2 ?
Thumb-2 is a NEW ARM instruction set, mixing 16 & 32-bit instructions
12
12
Compact Code and Data Memory
• Cortex-M3 supports unaligned data accesses to improve data constant and
RAM utilization
Unused (wasted) space Free space for the rest of the application
13
13
Atomic Bit Manipulation via Bit Banding
14
14
NVIC Interrupt Handling
Interrupts are handled in hardware! There’s no instruction overhead
12-cycle Entry:
• Processor state automatically saved to the stack over the data bus.
{PC, xPSR, R0-R3, R12, LR}
• In parallel, ISR is prefetched on the instruction bus.
-ISR ready to start executing as soon as stack PUSH complete.
12-cycle Exit:
• Processor state is automatically restored from the stack.
• In parallel, interrupted instruction is prefetched ready for execution upon
completion of stack POP.
• Stack POP can be interrupted, allowing new ISR to be immediately
executed without the overhead of state saving.
15
15
Fast Interrupt Response and Tail Chaining
26 16 26 16
Cortex-M3 Tail-chaining
Interrupt handling in HW
PUSH ISR 1 ISR 2 POP
12 6 12
6 CYCLES
16
16
System Timer (SysTick)
Flexible system timer is part of the Cortex-M3 Core
17
17
STM32 On-Chip Flash Memory Interface
Mission:: Support 72 MHz operation directly from Flash memory
64-bits wide Flash with Pre-fetch (2 × 64bits buffers)
Flash
Interface
32 – 16 – 16 Bits
bits
Thumb-2
ARBITER FLASH
64
MEMORY
64 bits …
Instructions-BUS
4 bits
4 bits
16 bits 32 bits 16 bits
Thumb
64
64
Thumb-2
Th b2 Thumb-2
Th b2
bitsThumb-2
32 bits
CORTEX-M3
CPU
Thumb-2 64
32 bits
ARRAY
Data/Debug-BUS
16-bit 32 bits 8 bit
Data Data Data
18
18
Low Power Modes
Can be Can be
RUN (from Flash) ON ON ON ON ON 230µA/MHz
enabled enabled
Can be Can be
RUN (from RAM) ON ON ON ON ON 185µA/MHz
enabled enabled
Can be Can be
LP RUN ON ON (LS) ON ON ON 11µA -
enabled enabled
Can be
LP SLEEP OFF OFF OFF ON ON ON 6µA 0.35µs
enabled
ON
STANDBY w/full RTC ON OFF OFF 1µA 8µs
19
19
MCU Platform for Rapid Innovation
Powerful core
Complete ecosystem
Connectivity Line USB Access Line Value Line Performance Line Access Line Low Power Low Power w LCD
20
2D Barcode Scanner
21
21
2D Barcode Scanner
802.15.4 Radio
HID
Class
Image Processing
(DSP Calculations)
DAC / I2S
22
22
Innovative System Architecture
Ethernet High Speed CORTEX-M3 Dual Port Dual Port
10/100 USB2.0 120MHz w/ MPU DMA1 DMA2
Periph1
M t 5
Master M t 4
Master Master 2 Master 3
Dual Port
Slow Peripherals
FIFO/DMA FIFO/DMA Master 1 FIFO/8 Streams FIFO/8 Streams AHB1-APB1
Periph2
Mem1
Mem2
D-Bus
Dual Port
S-Bus
I-Bus
Fast Peripherals
AHB1-APB2
AHB1 GPIOs
DCMI Crypto
DCMI, Crypto,
AHB2 USB Full Speed
SRAM1
112KB
SRAM2
16KB
FSMC
D-Code
FLASH
ART Up to
Accelerator
I-Code 1Mbytes
Multi-AHB Bus Matrix
23
23
2D Barcode Scanner
802.15.4 Radio
HID
Class
Image Processing
(DSP Calculations)
DAC / I2S
24
24
Media Hubs
Convergence of different media delivery options
Speaker
S k systems
t
Docking stations HTML page
on host Media Player USB SD
Host PC
w/
Media player
BT
Mobile
BT
Headset
phone
FS USB Host
…. DCMI
Camera
I/F
SPI
UART
MAC
FSMC
I²C
Player MS
SD
Card
I/F
Device (I²C ctrl)
TCP/IP
Modular approach
control Class
BT
Media Audio
File System (eFSL) Stack
streaming streaming
…
MP3/WMA codec EC/NR
HMI Control (2 possible instance @ same time)
& Display
Volume control / Ch. Mixer
Equalizer / Loudness
Authentication
Device
AUDIO CODEC (STw5098)
AMPLIFIER
FM
(STLC2690)
Speakers / Headset
25
25
Media Hubs
Host PC BT
HTML page BT
w/ Mobile
on host Media Player USB SD Media player
Headset
phone
CMOS Touch Tablet Key Card Micro
Camera Wifi Conn. Screen USB conn. BT Front
Front-end
end p
phone
G2-Icon (STLC2690)
module Eth-PHY USB conn. SD conn. USB Phy. + antenna
(Evaluation for Apple + USB conn. With analog switches)
& Display
Volume control / Ch. Mixer
Equalizer / Loudness
Authentication
Device
AUDIO CODEC (STw5098)
AMPLIFIER
FM
(STLC2690)
Speakers / Headset
26
26
Audio Docking Station
Micro
Cortex™-M3 12-bits ADC Phone &
CPU Pre-amplifier
120MHz
SRAM
Flash DMA
Volume/ Ch Mixer
5 bands - Equalizer
Audio
I²S Amp
DAC
MS
File System *
Class * HMI Control
& Display PLL block Speakers / Headset
FS USB Host *
XTAL oscillators
USB FSMC SDIO 32 kHz + 3-25 MHz
USB
Key Touch
Screen
Audio Media: QVGA
USB mass storage device LCD
27
27
Architecture : DMA & Multi-Bus Matrix
Dual Port
DMA1 Dual Port
Slow Peripherals
Fastt Peripheral
F P i h l DMA bus
b AHB1-APB1
Master 2 to bypass the bus matrix
Hi h S
High Speedd AHB2 DCMI C
DCMI, Crypto,
t
USB Full Speed
USB2.0
Master 4 SRAM1
112KB
Ethernet
10/100 SRAM2
Master 5 16KB
FSMC
CORTEX-M3
120MHz w/ MPU
FLASH
D-Code ART Up to
Master 1 Accelerator
1Mbytes
Multi-AHB Bus Matrix I-Code
28
28
STM32 F-2 ART AcceleratorTM
Supports 120 MHz operation without penalty
128-bit wide Flash with Prefetch
64 × 128-bit buffers for code, 8 x 128-bit buffers for Data)
Intelligent Branch management
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit Arbitration and
128 Bit 128 Bit 128 Bit 128 Bit Branch
128 Bit 128 Bit 128 Bit 128 Bit FLASH
128 Bit 128 Bit 128 Bit 128 Bit
Management MEMORY
128 Bit 128 Bit 128 Bit 128 Bit
128 bits
128 bits
128bits
128 Bit 128 Bit 128 Bit 128 Bit
…
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
Instruction-BUS 128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
CORTEX-M3 128 Bit 128 Bit 128 Bit 128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
CPU 128 Bit 128 Bit 128 Bit 128 Bit
128 Bit
128 Bit 128 Bit 128 Bit 128 Bit
128 Bit ARRAY
128 Bit
128 Bit
128 Bit
Data/Debug-BUS 128 Bit
128 Bit
128 Bit
29
29
ART AcceleratorTM: the Bottom line !
160 STM32F200
140 MCU A
120 MCU B
DMIPS)
100 STM32F200
80 performance
Performance (D
i almost
is l t linear
li
60 with frequency
40
20
0
Impact of wait states:
0 50 100 150
P
-Imperfect
I f t accelerator
l t
Core Frequency -Slow flash
30
30
Estimating Real World Performance
Designed not to make any library calls that could be optimized away
www.coremark.org
31
31
What is CoreMark?
Simple, yet sophisticated
Easily ported in hours,
hours if not minutes
Comprehensive documentation and run rules
Dhrystone Terminator
The benefits of Dhrystone without all the shortcomings
Free, small, easily portable
C
CoreMark does real work
32
Exposing Dhrystone Weaknesses
Major portions of Dhrystone are susceptible to a compiler’s
ability to optimize the work away - NOT CoreMark
Library calls are made within the timed portion and dominate
the time consumed by the benchmark - NOT CoreMark
Completely synthetic and does not mimic any behavior that
can be expected in a real application- NOT CoreMark
No official source code resulting in different, and often
undisclosed, versions (1.1, 2.0, 2.1) - NOT CoreMark
Very vague and ambiguous run guidelines are not universally
known and are not enforced - NOT CoreMark
Reporting lacks standardization; various formats in use
(DMIPS, Dhrystones per second, DMIPS/MHz) - NOT
CoreMark
33
CoreMark Workload Features
Matrix manipulation allows the use of MAC and common math ops
Linked list manipulation exercises the common use of pointers
State machine operation represents data dependent branches
Cyclic Redundancy Check (CRC) is very common embedded
function
Testing
T ti for:
f
A processor’s basic pipeline structure
Basic read/write operations
Integer operations
p
Control operations
34
Summary: The Value of CoreMark
35
EEMBC CoreMark 1.0 - Summary
250
CoreMark STM32F2xx
[Iter/Sec] (228.6@ 120MHz
STM32F2xx
(190.30@ 100MHz)
200
150
100
50
MHz
36
36
High-performance and Low Power!
37
37
Let’s review!
38
38
Questions?
39
39