You are on page 1of 44

Design and Development of a

General-Purpose Processor

Scholar: Mr. Talal Khaliq

Project Investigator: Dr. Awais M. Kamboh

Project Co-Investigator: Dr. Shafqat Khan

School of Electrical Engineering and Computer Science

National University of Sciences and Technology

Islamabad, Pakistan

Date Submitted: 31 Jan 2018


Abstract
Design and development of a processor is a very long, challenging, and expensive endeavor. This is
one of the reasons why only a few countries and companies design general purpose processors.
However, modern security paradigms dictate that a certain level of processor-design understanding
and capability must be achieved so that a processor could be designed in future with custom
capabilities, if needed.

This project explores the design of a Reduced Instruction-Set Computer based 32-bit General-
Purpose Processor. A simple 32-bit RISC processor has been evaluated and tested on an FPGA. For
the initial proof of concept, an open-source 8-bit processor was selected, and synthesized to run on
an FPGA running custom code and instructions written in C. Thereafter, there specifications were
upgraded to implement a 32-bit processor. Several open-source and commercial processors were
studied. Once a decent level of understanding of the architecture was achieved, an open-source
processor was selected, synthesized, and tested. Finally, a new peripheral was integrated into the
processor to enhance the processor's capabilities, and to adapt it for better performance as needed
for different applications.

32-Bit Processor Design 2


Table of Contents
Ch 1: Introduction 4
Ch 2: Processor Architecture 5
Ch 3: 8-bit Processor 8
Ch 4: 32-bit Processor 18
Ch 5: Processor Extension and Customization 37
Ch 6: Conclusion 44

32-Bit Processor Design 3


Chapter 1: Introduction
Objectives

To help in exploring the foundations of processor design and to obtain an architecture on which
further components could be added for improved and increased functionality.

Brief

The purpose of the project is to explore the design of a 32-bit general purpose processor which is
synthesizable on an FPGA.

Deliverables

1. Basic architecture of a 32-bit general purpose processor.


2. Synthesizable processor core in Verilog / VHDL
3. Final report and Source Code

Organization of this Report

The report starts by discussing the basic structure of a processor in Chapter 2. Chapter 3 discussed
the bare minimum design of a processor and builds an 8-bit processor as a proof of concept. It is
synthesized, mapped, and tested on an FPGA. The processor is then modified and tested again.
Based on the architecture of Chapter 3, Chapter 4 moves to a 32-bit RISC architecture with a more
complex structure. Again the 32-bit processor is synthesized, mapped and tested on an FPGA.
Finally, ways to incorporate custom peripherals to the processor are explored in Chapter 5, leading
to the conclusion of the report in Chapter 6.

32-Bit Processor Design 4


Chapter 2: Processor

Introduction:
A processor is a device capable of manipulating information in a way specified by a set of
instructions. This instruction set, in-turn, defines the capabilities of a processor. A sequence of these
instructions forms a machine controlled program. Each family of processors has a different
instruction set, thus the functionality varies. This sequence of instructions may be altered as needed
to alter the application.

Instruction Set and RISC:


Instruction set is processor’s vocabulary for understanding instructions. As words can be combined
into a sequence to make sentences, instructions can be combined to make programs. Complex
programs are broken down into instructions and again encoded in 1s and 0s (machine language) by
the compiler. Processors read and execute these instructions.

There are two major approaches in instruction set architecture:


a. Complex Instruction Set Architecture (CISC)
b. Reduced Instruction Set Architecture (RISC)
CISC Architecture include processors like Intel x86, Motorola 68xxx and National Semiconductor
32xxx series. RISC processors include Sun’s SPARC, ARM, Microchip PIC and Atmel’s AVR. The
memory of a processor contains both the instructions that it will execute and the data it will
manipulate. Instructions are fetched (read) from the memory while data is both read and written to
memory. This form of computer architecture is known as von Neumann Architecture. Most CISC
processors use this form of architecture.

Fig 1: von Neumann Architecture

Deviation from von Neumann is Harvard Architecture in instructions and data have different
memory spaces with separate address, data and control buses for each memory space. This has
number of advantages in that instruction and data fetches can occur concurrently. In our current
study, we will stick with RISC Processor conforming Harvard Architecture.

Fig 2: Harvard Architecture

32-Bit Processor Design 5


Processor Components:
Its main components are:

1. Decoder
2. Memory
3. Bus
4. Peripherals
5. Arithmetic Logic Unit (ALU)
Other add-ons that can be found on high performance and commercial processors may include
advanced pipelining, Floating Point Unit, co-processor, high performance buses etc.

Fig 3: Processor Internals

Instruction Decoder:
The Instruction Decoder reads the next instruction (incremented by program counter) in from
memory, and sends the component pieces of that instruction to the Arithmetic Logic Unit (ALU) for
execution. For each machine-language instruction, the control unit produces the sequence of pulses
on each control signal line required to implement that instruction (and to fetch the next instruction).
Many processors are designed with single cycle execution.

The RISC instruction decoder is typically a very simple device. Because RISC instruction words are a
fixed length, the positions of the fields are fixed, and processor reads in the entire instruction into
the instruction register. We can decode an instruction, therefore, by simply separating the machine
word in the instruction register into small parts.

32-Bit Processor Design 6


Memory:
Processors use memory for storing the instructions and the data used by instruction decoder and
ALU. In a Harvard architecture, the data memory unit and the instruction memory unit are two
different units.
The memory unit is typically one of the slowest components of a processor, because the external
interface with RAM is typically much slower than the speed of the processor. For this purpose, high
speed bus has to be designed. Also, memory is analog part of a processor i.e. it dependent on
technology while synthesis on FPGA or fabrication into ASIC. This problem shall be encountered later
when we will deal with 8- and 32-bit processor.

Fig 4: Processor Memory

Bus:
Bus is a communication system physically links the components to each other, thus allowing the
transit of control information, or data between these components. Communication between the
functional blocks of a system mono-chips were first provided by bus-based architectures. There are
many open source and commercial Bus Architectures which include OpenCores’ Wishbone, Altera’s
Avalon, ARM’s AMBA and IBM’s CoreConnect. We will draw a comparison between them later.

Fig 5: Advanced Microcontroller Bus Architecture (AMBA)

Arithmetic Logic Unit (ALU):


The component that performs Arithmetic and logical operations is called ALU. The ALU is one of the
most important components in a processor. Once the ALU is designed, the rest of the processor is
implemented to feed operands and control codes to the ALU. ALU units typically need to be able to
perform the basic logical operations (AND, OR) and the addition operation.

32-Bit Processor Design 7


Fig 6: Arithmetic Logic Unit (ALU)

The ALU performs operations on the one (or two) operands decoded by instruction decoder and
read by memory. The inclusion of inverters on the inputs enables the same ALU hardware to
perform the subtraction operation (adding an inverted operand), and the operations NAND and
NOR.

Fig 6b: Block Diagram of a computer and microprocessor

32-Bit Processor Design 8


Chapter 3: 8-bit Processor

Introduction:
For understanding of how a processor works and how it can be synthesized into FPGA, we chose
open source that was compatible to Intel 8051 architecture. There are many open source and
commercial IP Core available. Open source 8051 IP Cores include Oregano Systems mc8051,
OpenCores’ T51 and 8051 while commercial IP Cores include Evatronix R8051XC2, e8051 and Digital
Core Design DP8051CPU.

Of all the above mentioned 8051 cores, R8051XC2 is claimed to be fastest and fully-configurable
8051 achieving speed of 350 MHz. However, its code was not open source and meant for
commercial purposes. For education, cores from Open Cores and Oregano Systems were to be used.
Cores from Open Cores had one disadvantage that they were not easy to synthesize and
documentation provided was not helpful. Thus, core for 8051 Microcontroller written in VHDL from
Oregano Systems was chosen.

Oregano Systems mc8051:


Its main features due to which it was chosen are as under:
• Open source VHDL code
• Instruction set compatible to the industry standard 8051 microcontroller (Intel Architecture)
• Technology Independent (FPGA and ASIC)
• Active timer/counter and serial interface units selectable via additional special
function register
• Parameterizeable via VHDL constants
• 256 bytes internal RAM
• Up to 64 Kbytes ROM and up to 64 Kbytes External RAM
• Its target IP Core was available in ARM Keil compiler for software programming

Its core can divided into (see diagram):


1. Control Unit
2. ALU
3. Timer / Counter (Parameterizable)
4. Serial Interface (Parameterizable)
Control Unit is further divided into memory controller and Finite State Machine (FSM). Note
that core does not contain any memory unit such as RAM or ROM to store instructions. This
will be done during creation of top module in synthesis and simulation using selected target
technology. The following table shows the variables used in 8051 core:

Signal Name Description

clk System Clock

reset Asynchronous reset for all Flip Flops

32-Bit Processor Design 9


all_tx0_i Timer 0 interrupt

all_tx1_i Timer 1 interrupt

all_rxd_i Receive data input for serial interface units

int0_i Interrupt 0 input

int1_i Interrupt 1 input

p0_i Port 0 Input

p1_i Port 1 Input

p2_i Port 2 Input

p3_i Port 3 Input

all_rxdwr_0 Data direction signal for bidirectional RXD input / output

all_txd_o Transmit Data output for serial interface

all_rxd_o Receive data output mode 0 operation for serial interface

p0_o Port 0 output

p1_o Port 1 output

p2_o Port 2 output

p3_o Port 3 output

32-Bit Processor Design 10


Fig 7: Oregano mc8051 Core

Tools Required for Synthesis and Simulation:


1. For synthesis and simulation Xilinx ISE 14.5 was installed in Window 10 x64 bit computer and was
configured for x64 XST Simulator (nt64).

2. For compilation of C Program for 8051, Keil c51 was installed which has built in target specification
for Oregano 8051 Core. Here, after building C file (for example, BLINKY.c or Fibonacci.c),
corresponding .hex file was created.

Fig 8: Target 8051 Microcontroller in Keil c51

3. Hex to Bin converter


4. Bin to COE Converter (we will discuss it later on their purposes)

32-Bit Processor Design 11


mc8051 Top Module:
Two projects for 8051 were created in Xilinx ISE for synthesis and simulation. Spartan 3E (XC3S500E)
was chosen for both simulation and synthesis. However, due to low IOBs (about 200%) in Spartan 3E
during synthesis, we had to chose Vertex 5 (XC5VFX70T) Evaluation Board to work on.

Top module for 8051 was written in VHDL, which used components of 8051 Core as well as
memories such as 128 x 8 RAM, 64k x 8 ROM and 64k x 8 External RAM. Memories were created
from Core Generator in Xilinx ISE. Configuration for 128 x 8 bit RAM is as follows:

• Single Port RAM


• Minimum Area
• Read / Write Width: 8
• Write / Read Depth: 128
• Enable (ENA) Pin
• Write First
• Reset (RSTA)

Fig 9: RAM Configuration in Xilinx

Configuration for ROM is as follows:

➢ Single Port ROM


➢ Minimum Area
➢ Read Width: 8
➢ Read Depth: 65536
➢ Always Enabled
➢ Load Init File (COE File)
➢ Use Reset (RSTA) pin

32-Bit Processor Design 12


Fig 10: ROM Configuration in Xilinx

Configuration for XRAM (External RAM) is as follows:

➢ Single Port RAM


➢ Minimum Area
➢ Write / Read Width: 8
➢ Write / Read Depth: 65536
➢ Write First
➢ Always Enabled
➢ Use Reset (RSTA) Pin

Fig 11: XRAM Configuration in Xilinx

Also, Phase Locked Loop (PLL) from Xilinx Core Generator was used to downgrade the speed from
FPGA system clock of 100 MHz to desired frequency (11.675, 25 or 40 MHz). Its component was also
called in top module.

Architecture of Top Module generated from Plan Ahead (Pre-Synthesis) is shown below:

32-Bit Processor Design 13


Fig 12: Top Module (Plan Ahead Pre-Synthesis)

Work on Keil C51 (Microcontroller):


For 8051 Core to work on FPGA, we had to create HEX file from C file written for Oregano mc8051. It
should be noted before compilation; the frequency of target core should be same as in PLL. The code
used was for BLINKY, an example from Keil C51 after installation. After successful compilation and
build, HEX file was created.

32-Bit Processor Design 14


Fig 13: Keil c51 IDE

Conversion from HEX to COE:


Normally, HEX file created is loaded into microcontroller ROM as instructions to execute a particular
function. On FPGA, however, ROM created from Xilinx Core does not use HEX file. It rather loads
Coefficient (COE) File.

To convert HEX to COE file, there are some open source tools available but most are not compatible
with 64 bit Windows. For this purpose, an alternative set to tools (number 3 and 4 in tools section)
were introduced which convert HEX to bin file and then, bin to COE. These tools used Command
Prompt in Windows as shown:

Fig 14: HEX to COE Conversion in Command Prompt

32-Bit Processor Design 15


The resulting COE File is referenced by ROM Core during Core Synthesis (see diagram). These are
instructions for FPGA to perform once it is programmed into FPGA.

Fig 15: COE file load in Xilinx Generated ROM Core

Implementation Problems on FPGA:


Before implementation, User Constraints (UCF) file was created in project. On board clock for FPGA
is 100 MHz. Program loaded from HEX file running on default 12 MHz clock. Change in clock domains
caused wrong results in LEDs shown as P0 of 8051 Core.

To deal with this problem, a PLL Core was introduced in between FPGA Clock and 8051 Core Clock.
The resultant clock was matched at: 11.675 MHz. Core Schematic is as follows:

Fig 16: Core RTL Schematic in Xilinx

32-Bit Processor Design 16


Synthesis and Implementation on FPGA:
Once all problems were catered, programming file was generated and loaded into FPGA and was
working smoothly. We tried with different clock speeds to check Timing and Power Utilization of
Synthesis Process. 40MHz was highest clock speed possible achieved by 8051 Core. Comparison for
25 MHz and 40 MHz using different design strategies in given below:

Fig 17: Comparison on different design strategies

Simulation of mc8051:
A local testbench was created for Fibonacci.c file which was loaded into a ROM similar to synthesis
process. It was then simulated using Xilinx ISim. The output integer values were used in Port 0
(p0_o).

Fig 18: Fibinacci Simulation

Configuration of mc8051 for extra peripherals:


The original microcontroller design offered only 2 timers, one serial and 2 external interrupt units.
These can be changed in VHDL Core using some constants to increase or decrease the said
peripherals. However, to be able to reach all registers of the generated units without changing the
address space of the microcontroller only two 8 bit registers are inferred as additional special
function registers (SFRs). These are TSEL (address 0x8Eh for timer/counter units) and SSEL (address
0x9Ah for serial interface units). If these registers point to a not existent device number, the default
unit number 1 is selected. Efforts were made to be able to infer SFRs in Keil.

REG51.H is referenced by C File in Keil. SFRs inferred is shown below:

32-Bit Processor Design 17


Fig 19: REG51 SFRs

As an example, 25 MHz synthesizable core was chosen. In this core, file named mc8051_p.vhd there
is parameter named 'C_IMPL_N_TMR'. It can take values from 1 to 256. Its default value to set to 1.
We changed its value to 2 which generated 2 extra timer units, 1 additional serial port and 1
additional external interrupt sources. Initial peripheral diagram (pre-synthesized) is shown below:

Fig 20: Default I/Os:74 and C_IMPL_N_TMR=1

Similarly, for C_IMPL_N_TMR = 2 has 82 I/Os as shown below:

32-Bit Processor Design 18


Fig 21: Default I/Os:82 and C_IMPL_N_TMR=2

32-Bit Processor Design 19


Chapter 4: 32-bit Processor

Introduction:
The complexity of designing processors has increased overtime. Designing each and every hardware
component of the system from scratch soon became impractical and expensive for most designers.
Therefore, the idea of using pre-designed and pre-tested IP Cores in designs became an attractive
alternative. Softcore processors are processors whose architecture and behavior are fully described
using synthezable Hardware Desciption Languages (HDL) like Verilog or VHDL. They can be easily
synthesized to FPGA or ASIC.

Use of these processors has advantages like:


• Customizable
• Technology Independent
• Easily understandable
We will look for different open source and commercial IP Cores like in 8051 to come up with
the best one for 32-bit RISC Processor which can be easily customized to our needs.

Evaluation of Processors:
There are many 32-bit processors available such as Altera Nios II, Xilinx MicroBlaze, Tensilica Xtensa,
OpenCores OpenRISC 1200 and Gaisler Leon 3. Overall comparison has been drawn between them:

Category Nios II MicroBlaze Xtensa OpenRISC 1200 Leon3

Max Frequency 200 (FPGA) 200 (FPGA) 350 (ASIC) 300 (ASIC) 400 (ASIC)
(MHz)
125 (FPGA)

Cache Upto 64 KB Upto 64 KB Upto 32 KB Upto 64 KB Upto 256 KB

Pipeline Stages 6 3 5 5 7

Custom Upto 256 None Unlimited Unspecified None


Instructions

Implementation FPGA FPGA FPGA, ASIC FPGA, ASIC FPGA, ASIC

Open Source No No No Yes Yes

From above table, we can easily access that each processor has its advantages and disadvantages.
Xtensa offers unlimited ISA customization but it is also not open source and expensive. Similarly,
OpenRISC has open source code but difficult to use to use with given technology. Leon 3, despite its
ISA customization it excels all other departments. However, there are other problems to be explored
also like bus architecture, software tools and compliant ISA.

32-Bit Processor Design 20


Evaluation of Bus Architectures:
There are namely four different bus architectures:
1. WishBone (OpenCores)
2. AMBA (ARM)
3. Avalon (Altera)
4. CoreConnect (IBM)

There comparison is drawn below:

Feature WishBone AMBA Avalon CoreConnect

Open Architecture Yes Yes Partial Yes

Hierarchical No Yes No Yes

Pipelined No Yes Yes Yes

Arbitration Yes Yes Yes No

Data Transfer Yes Yes No Yes


Hand Shaking

Data Transfer No Yes Yes Yes


Pipelined

Split Transfer N/A Yes No Yes

Clocking Yes Yes Yes Yes

Frequency User Defined User Defined User Defined User Defined

From here also, we can see that AMBA from ARM has quite a lot of advantages. However, WishBone
has an edge of being adopted as primary bus for most open source designs. AMBA is the bus
architecture used by Leon 3. We will check more details about it afterwards.

Taking in view the above comparisons, we can safely use Leon 3 as 32-bit processor and our baseline
in designing our own custom processor.

SPARC Version 8 ISA:


If you choose a custom ISA, you have to create everything yourself:
• the entire chip design
• compiler
• all software including the OS
• cross-compilation and other equipment you need to develop software for this new ISA

32-Bit Processor Design 21


You also lose the convenience of having off-the-shelf hardware for the thousands of systems you
need on earth, for development, testing etc. And you lose the thousands of man-years already
invested in testing the SPARC architecture and software.

SPARC is an instruction set architecture (ISA), derived from a RISC lineage. As an architecture, SPARC
allows for a spectrum of chip and system implemenEtations at a variety of price/performance points
for a range of applications, including scientific/engineering, programming, real-time, and
commercial. SPARC was designed as a target for optimizing compilers and easily pipelined hardware
implementations. SPARC implementations provide exceptionally high execution rates and short
time-to-market development schedules. Its advantages are:
• Open architecture without patent or license fees unlike Intel, MIPS and ARM
• Well designed and documented
• Easy to implement
• Established software standard

Leon 3 Introduction:
The LEON3 is a synthesisable VHDL model of a 32-bit processor compliant with the SPARC v8
architecture. The model is highly configurable, and particularly suitable for system-on-a-chip (SOC)
designs. The full source code is available, allowing free and unlimited use for research and
education. The LEON3 processor has the following features:
• Compliant with SPARC V8 ISA
• Advanced 7-stage Pipeline
• Hardware Multiply, Divide and MAC units
• High Performance Pipelined Floating Point Unit (FPU)
• Harvard Architecture (Separate Instruction and Data Cache)
• AMBA 2.0 AHB Bus Interface
• On-Chip Debug Support
• Multiprocessor Support
• Power Down and Clock Gating
• Fault tolerant version available for High Performance space applications
• Extensively configurable
• Tools available like simulators, compilers, debuggers and kernels
Leon 3 consists of following subsystems:
➢ Integer Unit (based on 7-Stage Pipeline Harvard Architecture)
➢ Cache (Data and Instruction)
➢ Floating Point Unit & Co-processor
➢ Hardware Multiplier and Divider
➢ Memory Management Unit
➢ Debug Support Unit
➢ Interrupt Controller

32-Bit Processor Design 22


Fig 22: Leon 3 Integer Unit

Integer Unit:
It implements the full SPARC V8 standard, including hardware multiply and divide instructions. The
implementation is focused on high performance and low complexity. The number of register
windows are configurable within the limit of the SPARC standard (2 - 32), with a default setting of 8.
The pipeline consists of 7 stages with a separate instruction and data cache interface (Harvard
architecture). Its 7-stage pipeline is shown below:

32-Bit Processor Design 23


Fig 23: 7-Stage Pipeline

These can be summarized as:


➢ FE (Instruction Fetch): If the instruction cache is enabled, the instruction is fetched from the
instruction cache. Otherwise, the fetch is forwarded to the memory controller. The
instruction is valid at the end of this stage and is latched inside the IU (Integer Unit).
➢ DE (Decode): The instruction is decoded and the CALL and Branch target addresses are
generated.
➢ RA (Register access): Operands are read from the register file or from internal data bypasses.
➢ EX (Execute): ALU (Arithmetic Logic Unit), logical, and shift operations are performed. For
memory operations (e.g. LD) and for JMPL/RETT, the address is generated.
➢ ME (Memory): Data cache is accessed. Store data read out in the execution stage is written
to the data cache at this time.
➢ XC (Exception) Traps and interrupts are resolved. For cache reads, the data is aligned as
appropriate.
WR (Write): The result of any ALU, logical, shift, or cache operations are written back
to the register file.

32-Bit Processor Design 24


Cache Sub-system:
Processor implements a Harvard architecture with separate instruction and data buses connected to
two independent cache controllers. Both caches are configured with 1 - 4 sets, 1 - 256 kbyte/set, 16
or 32 bytes per line. Sub-blocking is implemented with one valid bit per 32-bit word. The instruction
cache uses streaming during line-refill to minimize refill latency. The data cache uses write-through
policy and implement a double-word write-buffer. The data cache also performs bus-snooping on
the AMBA based AHB (Advanced High performance BUS) bus.

Memory Management Unit:


A SPARC V8 Reference Memory Management Unit (SRMMU) is be optionally enabled. The SRMMU
implements the full SPARC V8 MMU specification, and provides mapping between multiple 32-bit
virtual address spaces and 36-bit physical memory. A three-level hardware tablewalk is
implemented, and the MMU can be configured to up to 64 fully associative Translation Look-Aside
Buffer (TLB) entries.

On-Chip Debug Support:


IU pipeline includes functionality to allow non-intrusive debugging on target hardware. To aid
software debugging, up to four watchpoint registers are enabled. Each register causes a breakpoint
trap on an arbitrary instruction or data address range. When the (optional) debug support unit is
attached, the watchpoints can be used to enter debug mode. Through a debug support interface, full
access to all processor registers and caches is provided. The debug interfaces also allow single
stepping, instruction tracing and hardware breakpoint/watchpoint control. An internal trace buffer
monitor and store executed instructions, which is later be read out over the debug interface.

Fig 24: On-Chip Debug Support

32-Bit Processor Design 25


Interrupt Controller:
The processor supports the SPARC V8 interrupt model with a total of 15 asynchronous interrupts.
The interrupt interface provides functionality to both generate and acknowledge interrupts. In case
of multiple processor system, AMBA system provides an interrupt scheme where interrupt lines are
routed together with the remaining AHB/APB bus signals. Interrupts from AHB (AMBA High
Performance Bus) and APB (AMBA Peripheral Bus) units are routed through the bus, combined
together, and propagated back to all units. The multi-processor interrupt controller core (IRQMP)
are attached to AMBA bus as an APB slave, and monitors the combined interrupt signals. The IRQMP
core prioritizes, masks and propagates interrupts to one or more processors.

Memory Controller:
The memory controller handles a memory bus hosting PROM, memory mapped I/O devices,
asynchronous static ram (SRAM) and synchronous dynamic ram (SDRAM). The controller acts as a
slave on the AHB bus. The function of the memory controller will be programmed through memory
configuration registers 1, 2 & 3 (MCR1, MCR2 & MCR3) through the APB (Advanced Peripheral Bus)
bus. The memory bus will support four types of devices: PROM, SRAM, SDRAM and local I/O. The
memory bus can also be configured in 8- or 16-bit mode for applications with low memory and
performance demands. The controller decodes three address spaces (PROM, I/O and RAM) whose
mapping is determined through VHDL-generics (parameters). Following diagram shows different
connections with:

Fig 25: Memory Controller

AMBA Bus Architecture:


Bus architecture is based on Advanced Microcontroller Bus Architecture (AMBA). The AMBA
specification can be regarded as an on-chip communications standard for designing high
performance embedded microcontrollers. The typical AMBA bus system is shown in the figure
below, here there are two bus systems, one requiring high performance for the high speed
components, like, the on-chip memory and Direct Memory Access (DMA) are connected to the high
performance bus, whereas the other that do not need such high bandwidth are connected through a
bridge to the low power bus.

32-Bit Processor Design 26


Fig 26: AMBA Shared Single Bus

The AMBA AHB is the high-performance system backbone bus. It is for the high performance, high
clock frequency system modules. It supports the efficient connection of processors, on-chip
memories and off-chip external memory interfaces with low-power peripheral macro-cell functions.
AHB is also specified to ensure ease of use in an efficient design flow by using synthesis and
automated test techniques.

AMBA APB is optimized for minimal power consumption and reduced interface complexity to
support peripheral functions. The APB is for the low power peripherals. APB can be used in
conjunction with either version of the system bus.

Example Template Design (with Bus, Memory Controller and Peripherals):


Leon-3 architecture is centered around the AMBA Advanced High-Speed bus (AHB), to which the
processor and other high-bandwidth devices will be connected. External memory will accessed
through a combined PROM/IO/SRAM/SDRAM memory controller. The on-chip peripheral devices
will include Ethernet, Dual CAN-2.0 interface, Serial and JTAG debug interfaces, two UARTs, Interrupt
Controller, Timers and an I/O port. The design will be highly configurable as desired by use. Leon3
SoC is shown below:

32-Bit Processor Design 27


Fig 27: Leon 3 in Spartan 3E Template

Library (Source Code) and Toolchain:


The complete design environment for LEON3 including all the IP cores can be downloaded from its
website. Leon 3 design is integrated with template designs and other IP Cores in a single library file
known as GRLIB. It is distributed as a zipped file and can be installed in any location on the host
system. This library includes:
1. Make Files and script generators for shell commands (like bash or Cygwin)
2. Target FPGA Board designs from different companies like Altera, Xilinx etc
3. IP Cores including Leon 3
4. Example software files
5. FPGA and ASIC Technologies
6. Example template designs for Configuration and Synthesis

After installation of library, toolchain is required to use Leon 3. It is compatible in both Windows and
Linux. However, Windows is preferred due to ease in installation. It includes:
➢ Bare-C Compiler (BCC)
➢ Boot-Prom Builder (mkprom2)
➢ RTEMS Leon Cross Compiler (RTEMS)
➢ GRMON Debug Tool (GRMON2 Evaluation version)
➢ TSIM Simulator (Evaluation Version)

In windows environment, these tools are installed through a single installer file known as GRTOOLS
where in Linux every file has to be installed separately. Also, during installation, environment
variables in windows are set automatically. For Bare-C Compiler, Eclipse Kepler version 1.6 is
installed during installation.

Besides ease at installation, I preferred Windows because tools for Synthesis and Simulation (Xilinx
and ModelSim) were already installed and their environment variables were set. For Linux, all these
tools had to be installed from scratch.

32-Bit Processor Design 28


However, for shell commands, Cygwin for windows was installed. It emulates Linux Terminal in
Windows. Cygwin has a major disadvantage that it is unable to launch ModelSim GUI. Also, it should
be note to simulate the design using ModelSim, its professional edition should be installed. Student
edition is not supported by Leon toolchain.

Detailed work with each tool discussed above is presented later.

Example Template Configuration and Implementation:


Implementing a LEON3 system is typically done using one of the template designs in the design
directory. We will use the LEON3 template design for the Xilinx ML50x board which was also used
when we were using 8051.

Fig 28: Xilinx Vertex-5 ML507

Implementation is done in five steps:

 Configuration of Leon design in xconfig


 Simulation of design
 Synthesis and Place & Route
 Generate Bitstream
 Configure FPGA on board

Template design is based on mainly three files found in ML50x folder:

 config.vhd - a VHDL package containing design configuration parameters. Automatically


generated by the xconfig GUI tool.
 leon3mp.vhd - contains the top level entity and instantiates all on-chip IP cores. It uses c
onfig.vhd to configure the instantiated IP cores.
 testbench.vhd - test bench with external memory, emulating the ML50x board.
Each core in the template design is configurable using VHDL generics. The value of these generics is
assigned from the constants declared in config.vhd, created with the xconfig GUI tool.

In windows, we install Cygwin to replicate the Linux environment in Windows. During installation,
make sure to install Tcl/Tk which is important for GUI launch. With cygwin installed, it is time to
configure Leon using xconfig tool. Cygwin can be launched from Desktop and also XWIN server is
required also for display. After XWin is successfully launched, following command is written in
Cygwin to export Display to XWin server:

32-Bit Processor Design 29


Fig 29: Export Display to XWin Server

Here we can see that we are in target design ML50x. Here by writing xconfig in cygwin shell calls for
xconfig GUI as shown below:

Fig 30: Leon3 Design Configuration GUI(xconfig)

Xconfig is broken down into:

 Synthesis: Target technology for FPGA and other technology related configurations. In this
case, it is Xilinx.
 Board Selection: FPGA Board (Xilinx ML507) or ASIC Technology
 Clock Generation: PLL Generated for FPGA Board. Default is 60 MHz for 100 MHz Board.
 Processor: Main Processor configuration like number of processors, Integer Unit, FPU, MMU
Configuration
 L2 Cache
 AMBA Bus Configuration
 Debug Link
 Peripherals: Memory Controller, On-Chip RAM/ROM, Ethernet, UART, Timer, VGA and
Keyboard Interface, PCI Express
 VHDL Debugging

This default configuration is known as Minimal Processor. First we will try to simulate and synthesize
the Minimal Processor and then, go for more high performance configurations.

Scripts and Simulation:


Once we have saved the design for Leon 3 in xconfig GUI, config.vhd is updated with selected
configuration. After cleaning the design, we generate scripts for simulation and synthesis. In scripts
generation, all library files for peripherals from each vendor, bus, other IP Cores and Leon 3 are
loaded and checked. This process is shown below:

32-Bit Processor Design 30


Fig 31: make scripts generation

Now, we call on ModelSim for simulation. ModelSim is not launch as separate application which is
problem is Cygwin because it is crashed during launch. It is called as vsim.exe which works in the
background.

Synthesis of Leon 3 using ise:


After cleaning the design files, we generate scripts as was in simulation. We can synthesize, place &
route and generate bitstream for one command in cygwin which will make Xilinx to work in the
background. However, we can also launch Xilinx ISE unlike ModelSim to do these steps manually.
After launch, designs are loaded automatically. For minimal processor of Leon 3, we implement the
design.

Fig 32: make ise-launch

After successful Bit file generation, the selected file is loaded into FPGA (Vertex 5 ML507). Now,
software is loaded and debugged into Leon 3 synthesized in FPGA.

Configuration for Minimal, General Purpose and High-Performance Processor:


Following table describes the VHDL Generics to change for Minimal, General Purpose and High
Performance Processor and then, synthesized on FPGA. These generics (or global variables) are
updated in config.vhd file.

VHDL Generic MP GPP HP Description

dsu 0 1 1 Debug Support Unit

32-Bit Processor Design 31


fpu 0 1 1 Floating Point Unit

v8 0 2 16#32# Support for SPARC v8 MUL/DIV

nwp 0 2 4 Hardware Watchpoints

icen/ dcen 1 1 1 Processor Caches

irepl / drepl 2 2 2 Random replacement policy

dnsoop 0 6 6 Data Cache Snooping

mmuen 0 1 1 Memory Management Unit

tbuf 0 4 4 Trace Buffer

pwd 1 2 2 Power Down Mode

smp 0 0 1 SMP Support

bp 0 1 1 Branch Prediction

tlb_type 1 2 2 Look-a-side TLB Buffers

1ddel 1 1 1 1-cycle load delay

itlbnum / dtlbnum - 8 16 MMU look-a side buffers

A brief table for area utilized and timing analysis for each processor is shown in the table below:

Software Development (BCC):


Writing a simple c-program for our LEON3 processor is a quick way to find out if the hardware
system is working. It is also a good test of the software development environment. We will start by
using Bare-C Compiler (BCC) installed in the system.

BCC is a cross-compiler for LEON3 processors. It is based one the GNU compiler tools and the Newlib
standalone C-library. The cross-compiler system allows compilation of both tasking and non-tasking
C and C++ applications. It supports hard and soft floating-point operations, as well as SPARC V8
multiply and divide instructions.

32-Bit Processor Design 32


We will write the simplest program of them all: Hello World. Here is the source code:

#include <stdio.h>

main()
{
printf("Hello World\n");
}

Here is the command to compile the source program:


sparc-elf-gcc-4.4.2 -O2 -msoft-float hello.c -o hello.exe

It takes hello.c file and compiles it to output hello.exe. This executable file can be loaded into FPGA
program using two methods:
1. GRMON Debugger
2. MKPROM2 PROM Programmer

Software Development (GRMON Debugger):


We will use the GRMON debug monitor to control the loading and running of our program. GRMON
is a general debug monitor for the Leon processor, and for SOC designs based on the GRLIB IP
library. It can also be launch using Windows Command Prompt. GRMON includes the following
functions:
 Read/write access to all system registers and memory
 Built-in disassembler and trace buffer management
 Downloading and execution of LEON applications
 Breakpoint and watchpoint management
 Remote connection to GNU debugger (GDB)
 Support for USB, JTAG, RS232, PCI, Ethernet and SpaceWire debug links

We will use JTAG link which is also used for bit file programming of FPGA. However, for GRMON,
compatible driver must be installed to use it. After successful link is established in JTAG, GRMON
shell is launched in Command Prompt:

32-Bit Processor Design 33


Fig 33: GRMON shell launch

hello.exe compiled with Bare-C Compiler can be loaded into Leon 3 FPGA using GRMON debug link
and its output is wrote back in the GRMON shell:

Fig 34: hello.exe loaded and run

Software Development (PROM Programmer):


This method is used to program PROM of Leon 3 as boot loader before it is synthesized on FPGA
same way as ROM was loaded with coefficient file in 8051. Here, we use mkprom2 to output PROM
file with loaded hello.exe file. First, it compiles and create PROM.out file:

32-Bit Processor Design 34


Fig 35: PROM.out file creation

After successful PROM.out, file is loaded in PROM.srec bootloader file of Leon 3:

Software Development (TSIM Simulator):


TSIM is a unique Leon 3 simulator which emulate its environment without the use of FPGA. ERC32 or
LEON applications can be loaded and simulated using a Windows Command Prompt. A number of
commands are available to examine data, insert breakpoints and advance simulation. To call TSIM
from Command Prompt:

Fig 36: TSIM in Command Prompt

32-Bit Processor Design 35


Here Leon 3, can be loaded like GRMON System Information. This information can be changed to
custom needs to emulate the required environment:

Fig 37: TSIM Leon Assembly

hello.exe compiled using Bare-C Compiler again can be loaded and checked here:

Fig 38: TSIM running hello.exe

32-Bit Processor Design 36


Chapter 5: Processor Extension and Customization

Introduction:
Using the knowledge of Leon 3 processor, we need to extend our work in customizing this processor.
We will to study the factors and variables essential in the designing of this processor. There is
different form of understanding required to achieve each form customization. To add a peripheral,
we need:

 Library Structure
 Understanding and working of AMBA APB bus
 VHDL Generics and link with Leon3mp.vhd (Top Module)
 xconfig GUI Customization

Library Structure:
The automatic generation of compile scripts searches for VHDL libraries in the file lib/libs.txt, and in
lib/*/libs.txt. The libs.txt files contains paths to directories containing IP cores to be compiled into
the same VHDL library. The name of the VHDL library is the same as the directory. The main libs.txt
(lib/libs.txt) provides mappings to libraries that are always present in existing library, or which
depend on a specific compile order (the libraries are compiled in the order they appear in libs.txt):

Fig 39: Library showing scripts for different vendors

Each directory specified in the libs.txt contains the file dirs.txt, which contains paths to sub-
directories containing the actual VHDL code. In each of the sub-directories appearing in dirs.txt
should contain the files vhdlsyn.txt and vhdlsim.txt. The file vhdlsyn.txt contains the names of the
files which should be compiled for synthesis (and simulation), while vhdlsim.txt contains the name of
the files which only should be used for simulation. The files are compiled in the order they appear,
with the files in vhdlsyn.txt compiled before the files in vhdlsim.txt.

32-Bit Processor Design 37


Fig 40: Library showing scripts for different files in AMBA folder

Why is this important? When scripts are generated during synthesis or simulation, the library is
loaded with each file required for the processor and assembly system. When we create or add new
peripheral, we update these scripts accordingly. It is done by updating target vhdlsyn.txt file with
new peripheral file. The resulting script is shown below:

Fig 41: New files in AMBA folder

Understanding and Working of AMBA Bus:


The AMBA Advanced Peripheral Bus (APB) is a single-master bus suitable to interconnect units of low
complexity which require only low data rates. An APB bus is interfaced with an AHB bus by means of
a single AHB slave implementing the AHB/APB bridge. The AHB/APB bridge is the only APB master on
one specific APB bus. More than one APB bus can be connected to one AHB bus, by means of
multiple AHB/APB bridges. It is shown below:

32-Bit Processor Design 38


Fig 42: AMBA AHB/APB Conceptual View

The access to the AHB slave input (AHBI) is decoded and an access is made on APB bus. The APB
master drives a set of signals grouped into a VHDL record called APBI which is sent to all APB slaves.
The combined address decoder and bus multiplexer controls which slave is currently selected. The
output record (APBO) of the active APB slave is selected by the bus multiplexer and forwarded to
AHB slave output (AHBO).

Fig 43: APB Inter-connection

Example IP Core is written with APB interface. The IP core has one memory mapped 32-bit register
that will be reset to zero. The register can be read or written from register address offset 0. The
core’s base address, mask and bus index settings are configurable via VHDL generics (pindex, paddr,
pmask). The paddr and pmask VHDL generics are propagated to the APB bridge via the apbo.pconfig
signal and the index is propagated via the apbo.pindex signal. These values are then used by the APB

32-Bit Processor Design 39


bridge to generate the APB address decode and slave select logic. Its script is already loaded in Fig
41.

Fig 44: example_apb.vhd

There are also many open source or commercial IP Cores available that can be added. The problem is
not every IP Core uses AMBA interface. For example, cores available on Open Cores follow Wishbone
Bus Architecture. For that purpose we sometimes need to create Wishbone to AMBA wrapper.

Fig 45: Wishbone to AMBA wrapper

VHDL Generics and Link with Top Module:


VHDL Generics are global variables used as parameters saved in config.in. It creates a new variable
which is used in config.vhd as parameter to generate component in leon3mp.vhd. To understand
this, we first look config.in where it loads variables from different libraries containing .in files.

32-Bit Processor Design 40


Fig 46: Generation of components in Top Module

xconfig extension:
This module is the last but it uses information of all previous work which leads to customization of
GUI shown. Each core has a set of files that are used to generate the core’s xconfig menu entries. As
an example we will look at the apb_example. The xconfig files are typically located in the same
directory as the core’s HDL files (but this is not a requirement).

Fig 47: apb_example.in files

Fig 48: apb_example.in

The first line defines a boolean option that will be saved in the variable CONFIG_I2CAHB. This will be
rendered as a yes/no question in the menu. If this constant is set to yes (‘y’) then the user will be
able to select two more configuration options. First the width, which is defined as an integer (int),
and the interrupt mask which is defined as a hexadecimal value (hex). The GUI has a help option for

32-Bit Processor Design 41


each item in the menu. When a user clicks on the help button a help text can be optionally
displayed. The contents of the help text boxes is defined in the file that ends with .in.help:

Fig 49: apb_example.in.help


The two remaining files (apb_example.in.h and apb_example.in.vhd) are used when generating the
config.vhd file for a design. config.vhd typically consists of a set of lines for each core where the first
line decides if the core should be instantiated in the design and the following lines contain
configuration options.
When exiting the xconfig tool, the .in.vhd files for all cores will be concatenated into one file. Then a
pre-processor will be used to replace all the variables defined in the menus (for instance
CONFIG_I2C2AHB_CADDR) into the values they represent. In this process, additional information is
inserted via the .in.vhd.h files. The contents of apb_example.in.h is:

Fig 50: apb_example.in.h

32-Bit Processor Design 42


The menu entries to include in xconfig is defined for each template design in the file config.in. As an
example we will look at the config.in file for the design leon3-xilinx-ml50x. In config.in we find the
entry for the apb_example port (described in the previous section) as part of one of the submenus:

Fig 51: apb_example.in included in config.in (Folder:ML50x)

Now, these variables can be used to generate cores for apb_example in the same way as shown in
Fig: 46. The modified xconfig is shown below:

Fig 52: Modified xconfig

32-Bit Processor Design 43


Chapter 6: Conclusion
Design of the processor depends heavily on the capabilities required by the intended applications.
These capabilities are manifested in the instruction set architecture. Once an instruction set has
been finalized, a processor can be created that can implement that instruction set. A bare-minimum
processor consists of a controller which implements an instruction decoder, an ALU, and supporting
registers. This processor needs external components like data-memory, instruction memory, a
communication bus, and peripherals to do any meaningful work.

A basic processor architecture was explored based on the 8- bit 8051 instruction-set architecture. It
was simulated, synthesized, and mapped on an FPGA. Then a custom C code was executed on the
processor and its performance was measured.

Then a 32-bit Sparc v8 instruction set architecture was explored which included more complex
components like caches, memory controller, and AMBA busses. This was again synthesized and
mapped on an FPGA for testing.

Finally, the capabilities of the processor were enhanced by designing and interfacing a custom
memory mapped peripheral with the AMBA bus of the system, thus incorporating the desired
functionality into the existing system.

The architecture was found to be suitable for implementation in ASICs and FPGAs, and is flexible
enough to incorporate custom peripherals.

32-Bit Processor Design 44

You might also like