NIOS II Processor

Outline
Ú What is a ³Soft´ Processor

Ú What is the NIOS II?
Ú Architecture for NIOS II, what are the
implications
TigerSHARC VS. NIOS II
Pipeline Issues
Issues related to FIR
Ú Hardware acceleration, using FPGA
logic
< t·s is ´Softµ Processor?
Ú Processor implemented in VHDL, Verilog,
etc., and downloaded onto FPGA hardware
Ú Can implement many parallel processors
on one FPGA
Ú Can use addition FPGA resources on the
same chip that is not part of the processor
core.
Ú NIOS II is a ³Soft´ Processor

< ´Softµ Processor?
Ú Higher level of design reuse
Ú Reduced obsolescence risk
Ú Simplified design update or change
Ú Increased design implementation
options
Ú Lower latency between processor and
FPGA components
< t is NIOS II?
Ú Software-defined processor
Ú The processor core is loaded onto
FPGA
Ú Programmed using µnormal¶
programming tools (C, asm), not
hardware description languages
Ú Can use the rest of the FPGA hardware
for accelerating parts of the code
dow Is NIOS II Implemented
Ú The custom FPGA logic that interacts
with the processor is implemented in
Altera Quartus II
Ú The Avalon Interface bus (common
instruction/data bus) is implemented in
Quartus II
Ú The architecture is generated in Quartus
II and used for programming in Eclipse
IDE
NIOS II IDE
Ú Coding is implemented in Eclipse rather than

VisualDSP.
6 e Different NIOS II Cores
Ú There are 3 cores available from Altera
a NIOSII/e: Economical Core
a NIOSII/s: Standard Core
a NIOSII/f: Fast Core
< t·s t e Difference between
t e Cores?
An LE is equivalent to a 8-1 NAND gate + 1 D-Flip Flop

An ALM is equivalent to 2 LE¶s
Comprison of 6igerSd C nd NIOS
II rc itecture
6igerSd C rc itecture
NIOS II rc itecture
-thirty two 32-bit general registers, six 32-bit control registers

-variable cache based on how much FPGA space you have
-ALU- 32bit two input to one input, does shifts, logic and arithmetic. Shifter is
not separate like TigerSHARC
lon Interfce
-separate address, data and control lines

-up to 1024-bit data width transfer, can be set to any width (not power of 2)
-one transfer per clock cycle.
NIOS II/f pipeline
Ú Six stages
Ú One instruction can be dispatched and/or
retired pre cycle
Ú Dynamic branch prediction: 2-bit branch
history table (no BTB like in TigerSHARC)
NIOS II/f pipeline
The pipeline stalls for:
Multi-cycle instructions
Cache misses
Data dependencies (2 cycles between
calculating and using result)
Mispredicted branch penalty: 3 cycles

drdwre multipl
Ú Can use different options for multiplier
(at the processor design stage)
a No h/w multiply (saves FPGA gates)
ż Speed depends on algorithm
a Use embedded multipliers (if FPGA has
those)
ż 1-5 cycles (depends on FPGA)
a Implement multipliers on FPGA gates
ż 11 cycles
a Division 4-66 cycles on hardware
Compre to 6igerSd C
Ú No support for parallel instructions
Ú No support for SIMD operations
Ú Multicycle instructions stall the pipeline
All the above limitations can be overcome

by using FPGA space unoccupied by the
processor itself
Comprison of NIOS II nd
6igerSd C on n FI lgorit m
Integer FI lgorit m
w coeff[]={1, 2, 3, 4, 5, 6, 7, 8};
w data1[] = {1, 0, 0, 0, 0 ,0 ,0 ,0};
w output[8];
w i=0, j=0, k=0;
(k=0; k<8; k++) output[k] =0;
( j =0; j< 8; j++)

{
( i= 0; i< 8; i++)
{
output[j] += data1[i]*coeff[7-i];
}
}
Speed nlsis
- ñ w

w
w
w

ñ

ww ww
w

w ww
w w

!w w " w w
#
$

w%
Speed nlsis
Ú ù cycles per iteration except the first two
(branch predicted not taken) and the last
(branch predicted taken) ± those will be
ù+3=12 cycles
Ú 1 data stall ± can remove by moving
instruction from line 4 to 7
Ú Speed: 8 cycles * (N-3) + 11 cycles * 3 =
8*(N-3)+33 cycles
Ú For 1024-tap FIR: 8201 cycles
Ú Clock cycle is 3 times longer (200MHz vs
600MHz)
Speed comprison
8201 NIOS II cycles equivalent to 24603
TigerSHARC cycles
Lab3 timing:
± 56000 cycles Debug mode
± 13000 unoptimized ASM
± 4000 Optimized ASM
Worse than unoptimized assembly, but no

hardware acceleration used, so this is not
that bad
drdwre ccelertion
Ú Profiling tool in Eclipse can show how
long each function takes
Ú If function takes too long, it can be sped
up by
a Custom instructions
a Hardware Acceleration
Ú Hardware Acceleration is to take the
function and transform it into FPGA
circuitry
drdwre ccelertion
Ú Can be done using C2H compiler from Altera
Ú Trades off Logic Size for Speed up.
6

¬

¬

¬
&-' (
[¬ &' - (
!
&' (
""
# &-' -(
""#
$"
&' - (
%
&' -(
&['( &' - (
&[' ) &' - (
Conclusion
Ú ³Soft´ Processors such as the NIOSII
offers another alternative in the
embedded system scene.
Ú The NIOSII offers the advantage of
added configurability, and customization
that blur the line between FPGAs and
DSPs
eferences
[1] http://www.fpgajournal.com/articles/behere.htm
Describes an FPGA-DSP project based on Altera Nios
[2] http://www.altera.com/products/ip/processors/nios2/ni2-index.html
Official Nios II page
[3] http://www.hunteng.co.uk/dsp-fpga.htm
DSP or FPGA? What is better when?
[4] http://www.hunteng.co.uk/pdfs/tech/DSP1736FPGA.pdf
Article from Xilinx about FPGA DSPs
[5] http://www.niosforum.com
Community forum for NIOS
[6] http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf
NIOSII Processor Handbook ±Altera Corporation
[7] http://www.altera.com/literature/manual/mnl_avalon_spec.pdf
Avalon Memory-Mapped Interface Specifications ± Altera Corporation
[8] http://www.analog.com/en/prod/0,2877,ADSP%252DTS201S,00.html
ADSP-TS201S 500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded
DRAM

NIOS II Processor

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NIOS II Processor

Uploaded by

Copyright:

Available Formats

Outline

Ú What is a ³Soft´ Processor

Ú NIOS II is a ³Soft´ Processor

Ú Coding is implemented in Eclipse rather than

An LE is equivalent to a 8-1 NAND gate + 1 D-Flip Flop

-thirty two 32-bit general registers, six 32-bit control registers

-separate address, data and control lines

Mispredicted branch penalty: 3 cycles

All the above limitations can be overcome

(k=0; k<8; k++) output[k] =0;

( j =0; j< 8; j++)

Worse than unoptimized assembly, but no

You might also like