Professional Documents
Culture Documents
codenamed Haswell
Per Hammarlund
Haswell Chief Architect, Intel Fellow
August, 2013
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Modularity Options
Family of Innovations
Value
Range
Core Count
2-4
Graphics
GT1, GT2,
GT3
Active Power
Level
Tablet to
Desktop
Idle Power
Variable
Cache Size
Variable
Interconnects
Variable
Platforms
Traditional,
power
optimized
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance
NEW Intel
Intel
Microarchitecture Microarchitecture
(Nehalem)
(Nehalem)
TOCK
TICK
Sandy Bridge
Ivy Bridge
Haswell
NEW Intel
Microarchitecture
(Sandy Bridge)
Intel
Microarchitecture
(Sandy Bridge)
NEW Intel
Microarchitecture
(Haswell)
TOCK
TICK
TOCK
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Increased Turbo
New C-states, improved latency
Power efficient features: better than
voltage / frequency scaling
Continued focus on gating unused logic
and low-power modes
Optimized manufacturing and circuits
Power
Performance
DMI
System
DisplayAgent
1
PCI Express*
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Processor Graphics
3
System Power
Evolutionary improvements
Great
Haswell
performance
platform is
always
S0
in this
Active
mode.
S0
instant
Active
Excellent
resume
Standby
Power
20x
S0ix
Active
Idle
S3/4
S3/4
Sleep
Sleep
Resume Time
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Core VR
variable
voltage
0V-1.2V
Graphics
VR
variable
voltage
0V-1.2V
Haswell Platform
Haswell
Processor
PLL VR
1.8V
input/output
VR 1.0V
Ivy
Bridge
Process
or
VccIn
0V-1.8V
System
agent VR
DDRx
DDR
VR
DDRx
DDR
VR
FIVR VRs:
Vccsa
Vccio
Vccioa
VccCore 0
VccCore 1
VccCore 2
VccCore 3
VccCache
Graphics0
Graphics1
VccEDRAM
DDRx
VccOPIO
Logic
Blocks
DDRx
Large inductors,
butterfly mounted
through board
5.4mm thick
Haswell
Backside bare
3.4mm thick
Vccin
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
DMI
System
Display Agent
PCI Express*
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Processor Graphics
System
Agent
PCI Express*
Display
IMC
Core
LLC
Core
LLC
Core
LLC
Core
LLC
Processor
Graphics
IO
IO
128MB eDRAM
DDR
Cache attributes
DMI
Inter-frame
Capture texture reuse across frames due to continuity between
frames
WiC frame i
DDR
eDRAM
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Scalable Architecture
partitioned into 6
domains:
Engine (MFX)
5. Video Quality Enhancement
Engine
6. Displays
Sets the stage for Scale-up!!
Next Generation Intel Microarchitecture Code Name Haswell
Video Codec
Improved!
New!
1New
on Haswell
4.6x faster
speed
2x Power saving
Faster speed
at 0.5x 4.6x
power
over SW encoder
48.4
Power (W)
24.74
60
4K Mosaic
HD Mosaic
Speed (FPS)
280
50
100
150
200
250
21
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance
300
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Branch Prediction
Icache
Tag
ITLB
op Cache
Tag
Improved front-end
Initiate TLB and cache misses speculatively
Handle cache misses in parallel to hide latency
Icache
Data
Deeper buffers
Extract more instruction parallelism
op Cache
Data
Decode
Decode
uop Queue
op Allocation
Out-of-Order Execution
No pipeline growth
Sandy Bridge
Haswell
128
168
192
In-flight Loads
48
64
72
In-flight Stores
32
36
42
Scheduler Entries
36
54
60
N/A
160
168
FP Register File
N/A
144
168
28/thread
28/thread
56
Out-of-order
Window
Allocation Queue
Intel Microarchitecture (Haswell); Intel Microarchitecture (Nehalem); Intel Microarchitecture (Sandy Bridge)
Vector
Logicals
Vector
Logicals
Branch
Divide
Vector
Shifts
2xFMA
Port 7
Vector Int
ALU
Store
Data
Port 6
Vector Int
Multiply
Load &
Store Address
Port 5
FMA FP Mult
FP Add
Port 4
FMA
FP Multiply
Port 3
Integer
ALU & LEA
25
Port 2
Port 1
Port 0
Integer
ALU & Shift
Integer
ALU & LEA
Integer
ALU & Shift
Store
Address
Vector
Shuffle
Vector Int
ALU
Vector
Logicals
4th ALU
Great for integer workloads
Frees Port0 & 1 for vector
Branch
Haswell
50
40
Sandy Bridge
30
20
Merom
10
0
Banias
10
15
20
25
30
Peak FLOPS/clock
Latency (clks)
Haswell
MulPS, PD
AddPS, PD
Mul+Add /FMA
2x L2 bandwidth
New
Prior
Gen
35
2X cache bandwidth to
feed wide vector units
Ratio
Nehalem
Sandy Bridge
Haswell
L1 Instruction Cache
32K, 4-way
32K, 8-way
32K, 8-way
L1 Data Cache
32K, 8-way
32K, 8-way
32K, 8-way
4 cycles
4 cycles
4 cycles
Load bandwidth
16 Bytes/cycle
32 Bytes/cycle
(banked)
64 Bytes/cycle
Store bandwidth
16 Bytes/cycle
16 Bytes/cycle
32 Bytes/cycle
256K, 8-way
256K, 8-way
256K, 8-way
10 cycles
11 cycles
11 cycles
32 Bytes/cycle
32 Bytes/cycle
64 Bytes/cycle
4K+2M shared:
1024, 8-way
Fastest Load-to-use
L2 Unified Cache
Fastest load-to-use
Bandwidth to L1
L1 Instruction TLB
L1 Data TLB
L2 Unified TLB
All caches use 64-byte lines
27
Intel Microarchitecture (Haswell); Intel Microarchitecture (Sandy Bridge); Intel Microarchitecture (Nehalem)
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Instruction Set
SP FLOPs
per cycle
DP FLOPs
per cycle
Nehalem
SSE (128-bits)
Sandy
Bridge
AVX (256-bits)
16
Haswell
32
32
16
16
Benefits
High performance computing
Audio & Video
Games
Cryptography
Group
Instructions
Bit Gather/Scatter
PDEP, PEXT
Arbitrary Precision
Arithmetic & Hashing
MULX, RORX
Intel Microarchitecture (Haswell); Intel Microarchitecture (Sandy Bridge); Intel Microarchitecture (Nehalem)
SHA-256:
RORX Rotates, AVX2
Locks!
Cached Lock
Performance
35
30
Min Cycles/Lock
25
20
15
10
5
0
Bridge
4 Core HT system
tsx
Speedup over
Baseline with 1 Thread
7
6
5
4
3
2
1
0
1
4
ssca2
4
ua
physicsSolver
4
nufft
histogram
canneal
AVG.
2.
3.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to
http://www.intel.com/performance
3500
New Accessed and Dirty bits for
Extended Page Tables (EPT) eliminates
major cause of vmexits
Haswell
reduces
round-trip to
<500 cycles
3000
2500
2000
Overhauled TLB invalidations lower
latency, less serialization
New VMFUNC instruction enables
hyper-calls without a vmexit
Intel VT-d adds 4-level page walks to
match Intel VT-x
1500
1000
500
0
Intel Virtualization Technology (Intel VT) for Directed I/O (Intel VT-d); Intel Virtualization Technology for IA-32,
Intel 64 and Intel Architecture (Intel VT-x); Intel Microarchitecture (Haswell)
Sequence
Family of Innovations!
Power Efficiency and Management
FIVR Fully Integrated Voltage Regulator
Cache Hierarchy and Interconnects
Gfx/Media
Intel Microarchitecture (Haswell): Core
ISA
Wrap Up
Wrap Up!
Huge family: SOC methodology, common architecture
Low power platform: 20x idle power reduction, low power
IO (I2C, SDIO, I2S, UART), Link power management (USB,
PCIe, SATA)
Large eDRAM Cache
Platform: PSR (Panel Self Refresh)