Remove Glitches in Physical Design

A TECHNIQUE TO REMOVE GLITCHES IN PHYSICAL
DESIGN STAGE
A Dissertation submitted in partial fulfilment of the requirements for the award of the degree of
MASTER OF SCIENCE
IN
ELECTRONICS & COMMUNICATION ENGINEERING
(VLSI & Embedded Systems Design)
Submitted by
S.S.SARAT CHANDRA
(11011J6033)
Under the esteemed guidance of
Mr P.PARAMESWARA RAO
Assistant Professor
Seer Akademi Pvt. Ltd.
Department of Electronics and Communication Engineering
JNTUH COLLEGE OF ENGINEERING HYDERABAD
(Autonomous)
Jawaharlal Nehru Technological University
Hyderabad 500085
2011 - 2013
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING
JNTUH COLLEGE OF ENGINEERING
HYDERABAD-500085
CERTIFICATE
This is to certify that this dissertation work entitled A TECHNIQUE TO REMOVE
GLITCHES IN PHYSICAL DESIGN STAGE is a bonafide work carried out by
S.S.SARAT CHANDRA bearing Roll NO.11011J6033 in partial fulfilment of the requirement
for the award of MASTER OF SCIENCE degree in ECE with specialization in VLSI AND
EMBEDDED SYSTEMS DESIGN from JNTUH during the academic year 2011-13. The
results have been verified and found to be satisfactory.
Internal Guide
Mr P.Parameswara Rao
M.Tech, Asst. Professor
Seer Akademi Pvt.Ltd.
HYDERABAD-500085
CERTIFICATE BY THE SUPERVISOR
GLITCHES IN PHYSICAL DESIGN STAGE, being submitted by S.S.SARAT
CHANDRA bearing Roll NO.11011J6033 in partial fulfilment of the requirement for the award
of MASTER OF SCIENCE degree in ECE with specialization in VLSI AND EMBEDDED
SYSTEMS DESIGN, is a record of bonafide work carried out by him under my supervision.
The results have been verified and found to be satisfactory.
Supervisor
Mr P.Parameswara Rao
M.Tech, Asst. Professor
Seer Akademi
HYDERABAD-500085
CERTIFICATE BY HEAD OF THE DEPARTMENT
GLITCHES IN PHYSICAL DESIGN STAGE, being submitted by S.S.SARAT
CHANDRA bearing Roll NO.11011J6033 in partial fulfilment of the requirement for the award
of MASTER OF SCIENCE degree in ECE with specialization in VLSI AND EMBEDDED
SYSTEMS DESIGN, is a record of bonafide work carried out by him.
Dr. D. SREENIVASA RAO
Professor & Head of the Department
Department of ECE
JNTUH College of Engineering
Hyderabad
HYDERABAD-500085
DECLARATION OF THE CANDIDATE
I, S.S.SARAT CHANDRA bearing Reg. No.11011J6033 hereby declare that the
dissertation work entitled A TECHNIQUE TO REMOVE GLITCHES IN PHYSICAL DESIGN
STAGE have been developed under the valuable guidance of Mr. P.Parameswara Rao and
submitted in partial fulfilment of the requirements for the Award of the Degree of Master of
Science in VLSI AND EMBEDDED SYSTEMS DESIGN.
This is a record of bonafide work carried out by me and the results obtained have not
been reproduced or copied from any source. The results of this dissertation have not been
submitted to any other University or Institute for the award of any Degree.
S.S.SARAT CHANDRA
Reg.No.11011J6033
Branch ECE (M.S VLSI & ESD)
ACKNOWLEDGEMENT
This is an acknowledgement of intensive drive and technical competence of many
individuals who have contributed to the success of my project.
I wish to express my deep sense of gratitude and sincere thanks to Mr P.
Paramesawara Rao, Assistant Professor, in Seer Akademi for his valuable suggestions,
sagacious guidance in all respects during the course of the project.
I am particularly thankful to Dr. D. Sreenivasa Rao, Professor and Head of the
department of ECE, JNTU College of engineering, Hyderabad for her support during
the project work.
I express my heart-felt thanks to Mr Srikanth Jadcherla, CEO of Seer Akademi,
Mr M. Ram Kumar, Manager of Seer Akademi, and Mr P.Parameswara Rao for
their encouragement, guidance, suggestions and Mr. Ramanji Reddy for his IT
support.
I am grateful to my friends, parents, to the entire faculty and non-faculty staff for
their encouragement and support.
S.S.SARAT CHANDRA
Reg.No.11011J6033
Branch ECE (M.S VLSI & ESD)
ABSTRACT
A glitch compensation methodology is proposed in this paper which involves
in reducing the undesired switching of combinational circuits in order to save
dynamic power. The proposed methodology can be seamlessly integrated to existing
physical design flow to reduce the glitch power which is one of the major contributing
factors for both dynamic and IR drop. A glitch is an undesired transition that occurs
before intended value in digital circuits.
A glitch occurs in CMOS circuits when differential delay at the inputs of a
gate is greater than inertial delay, which results into notable amount of power
consumption. The glitch power is becoming more prominent in lower technology
nodes. Introduction of buffers at the input of the Logic gate may reduce glitches, but it
results into large area overhead and dynamic power. Coupling capacitance of the nets
should be decreased by shielding or spacing far or by using proposed methodology.
By using the proposed methodology the functionality can be improved by removing
static glitches and timing can be improved by removing the dynamic glitches.
Hence, the proposed methodology will ensure low dynamic power
consumption with less area. The proposed methodology has been validated using
Synopsys 90nm SAED PDK.
Tools Used: VCS (Functionality and simulation), DC (Logic Synthesis) & IC
compiler (Physical design).
LIST OF CONTENTS
Title Page No
LIST OF FIGURES i
LIST OF TABLES iii
ABBREVIATIONS iv
Chapter 1: INTRODUCTION 1
1.0 Introduction and background 1
1.1 Introduction to ASIC 2
1.2 Standard cell based ASIC 2
1.3 Need for low power ASIC 3
1.4 ASIC flow 4
1.5 Objective of the project 7
1.6 Organization of thesis 8
Chapter 2: Asynchronous Fifo 10
2.1 Asynchronous Interface 10
2.2 Issues in designing asynchronous fifo 10
2.3 Operation of the design 11
2.3.1 Data write operation 11
2.3.2 Fifo full status 11
2.3.3 Asynchronous fifo pointers 12
2.4 Handling full and empty conditions 15
2.4.1 Generating empty flag 15
2.4.2 Generating full flag 15
2.5 Procedure to design fifo module 16
2.5.1Dual port RAM 20
2.5.2 Gray counters 20
2.5.3 Address pointer difference generation 20
2.5.4 Full and empty generation logic 20
2.5.5 Next read and write control logic 20
Chapter 3: Logic Synthesis 20
3.1 Synthesis and its basic flow 20
3.2 Synopsys design compiler flow for synthesis 21
3.3 Design flow 23
3.3.1 Read design 23
3.3.2 Synthesis libraries 24
3.3.2.1 Target library 24
3.3.2.2 Synthetic library 24
3.3.2.3 Link library 24
3.3.2.4 Symbol library 24
3.4 Design environment 25
3.5 Synthesis constraints 26
3.5.1 Design rule constraints 27
3.5.2 Design optimization constraints 27
3.6 Design constraints 28
Chapter 4: Design Planning 32
4.1 Introduction 32
4.2 Design planning flow 34
4.3 Tasks to be performed in design planning 35
4.4 Macro planning 36
4.5 Partitioning 36
4.6 Power planning 37
4.7 Defining chip area 38
4.8 Virtual flat placement 39
Chapter 5: Placement & Power Planning 40
5.1 Introduction 40
5.2 IO power 41
5.2.1 Core power 41
5.3 Power network synthesis 41
5.3.1 Power pads 42
5.3.2 Rectangular rings 43
5.3.3 Power straps 43
5.4 Placement 43
5.4.1 Tasks to be performed during placement 44
5.4.2 Global placement 44
5.4.3 Detailed placement 45
Chapter 6: Clock Tree Synthesis 46
6.1 Introduction 46
6.2 Clock network modelling for logic synthesis 48
6.2.1 Virtual clocks 48
6.2.2 Trail of clock network synthesis 48
6.3 Clock design at implementation stage 49
6.3.1 Clock network synthesis in a flat design flow 49
6.3.2 Clock network synthesis in a hierarchical design flow 49
6.4 Prerequisites for clock tree synthesis 50
6.4.1Design prerequisites 50
6.4.2 Library prerequisites 51
6.5 Clock distribution architectures 51
6.5.1 Tree 52
6.5.2 H-tree 52
6.6 Algorithm for clock tree construction 53
6.7 Analyzing the clock trees 53
6.7.1 Identify the clock tree at the end points 54
6.7.2 Analyzing the clock sink groups 54
6.7.3 Defining clock root attributes 54
6.8 Clock skew scheduling 55
6.9 Optimization of registers 55
6.10 Clock analysis and on chip variations 56
Chapter 7: Routing 57
7.1 Introduction 57
7.2 Process design rules 57
7.3 Routing grid 57
7.4 Global & detailed routing 69
7.5Routing congestion 59
7.6 Routing order 60
Chapter 8: Signal Integrity 61
8.1 Introduction 61
8.2 Crosstalk 61
8.3 Si closure methodologies 62
8.4 Si prevention 63
8.5 Si analysis and repair 64
Chapter 9: Results & Analysis 67
9.1 Synthesis results 67
9.1.1 Timing report 69
9.1.2 Data required and arrival time 70
9.1.3 Slack 70
9.1.4 Setup and hold time 70
9.1.5 Power report 72
9.1.6 QOR (quality of results) 73
9.2 Design planning results 75
9.2.1 Floorplan results 75
9.2.2 Virtual flat placement 77
9.2.3 Congestion analysis 77
9.3 Power planning results 79
9.3.1 Rectangular rings result 79
9.3.2 Power straps result 79
9.4 Power network synthesis results 80
9.4.1 IR drop analysis 80
9.5 Placement results 82
9.5.2 Area report 86
9.5.3 Power report 87
9.5.4 Congestion report ` 88
9.6 Clock tree synthesis results 89
9.7 Analyzing the routing results 93
9.7.1 Noise report before buffer insertion 93
9.7.2 Noise report after buffer insertion 94
9.7.3 Power report before buffer insertion 96
9.7.4 Power report after buffer insertion 97
Chapter 10: Conclusion 100
Chapter 11: Future scope 101
Bibliography 102
Department of ECE, JNTUHCEH i
List of figures
S.No Figure. No Figure Name Page. No
1 1.1 Cell based ASIC 3
2 1.3 Traditional ASIC design flow 6
3 2.2 Fifo full and empty conditions 13
4 2.3 n bit Gray code converted to n-1 gray code 14
5 2.4 Top module of asynchronous fifo 16
6 2.5 Internal architecture asynchronous fifo 17
7 3.1 Synthesis flow 21
8 3.2 Design environment for a design 26
9 3.3 Specification of input delay 29
10 3.4 Specification of output delay 30
11 4.1 Basic flow of physical design 33
12 4.2 Data setup for design flow 34
13 4.3 Design planning 35
14 5.1 Power planning 42
15 6.1 Clock skew 47
16 6.2 Tree generated by clock tree synthesis 52
17 6.3 H-tree balances skew 52
18 6.5 Routing grids 58
19 7.2 Routing with two different metals 59
20 8.1 SI closure criteria 61
21 8.2 Buffer insertion to victim and aggressor 63
22 8.3 Adding shielding to aggressor 65
23 9.1 Setup & hold time 71
24 9.2 Floorplanned design 76
25 9.3 Virtual flat placement 77
26 9.4 Rectangular rings & power straps 80
27 9.5 Power network synthesis 82
28 9.6 Placement 83
29 9.7 Clock tree synthesis 89
Department of ECE, JNTUHCEH ii
S.No Figure. No Figure Name Page. No
30 9.8 Routed clock drivers with clock nets 93
31 9.9 Victim net representations 98
32 9.10 Buffer additions to victim net 98
Department of ECE, JNTUHCEH iii
List of tables
S.No Table. No Table Name Page .No
1 9.1 Floorplan reports 75
2 9.2 IR drop analysis 81
Department of ECE, JNTUHCEH iv
ABBREVIATIONS
i. ASIC : Application specific integrated circuit
ii. BGAs : Ball grid arrays
iii. CAD : Computer aided design
iv. CPU : Central processing unit
v. CTS : Clock tree synthesis
vi. CMP : Chemical mechanical polishing
vii. CNS : Clock network Synthesis
viii. DC : Design compiler
ix. DCTB : Dynamic clock-tree building
x. DME : Deferred merge embedding
xi. DMST : Dual-MST geometric matching topology
xii. DNNA : Dynamic Nearest-Neighbour Algorithm
xiii. DLL : Delay locked loop
xiv. DSP : Digital Signal Processing
xv. DDR : Domain Deskew Register
xvi. DRC : Design rule check
xvii. DSP : Digital signal processing
xviii. DIPS : Dual In-line package
xix. ECO : Engineering Change Orders
xx. ESD : Electrostatic Discharge
xxi. ERC : Electrical Rule check
xxii. FPGA : Field programmable gate array
xxiii. FDP : Force-directed placement (FDP) framework.
xxiv. FPU : Floating point unit
xxv. GDSII : Geometric data stream
xxvi. GMA : Geometric matching algorithm
xxvii. GPL : General Public License
xxviii. GTECH : General technology
xxix. HDL : Hardware description language
xxx. ISPD : International symposium for Physical design
xxxi. I/O : Input and output
xxxii. IP : Intellectual properties
Department of ECE, JNTUHCEH v
xxxiii. IC : Integrated circuits
xxxiv. ICC : Integrated circuit compiler
xxxv. IEEE : Institute of electrical and electronic engineers
xxxvi. IP : Intellectual protocol
xxxvii. LC : Library compiler
xxxviii. LCB: Local clock buffers
xxxix. MCMM : Multi-corner and Multi mode
xl. MLBB : Multi level bounding box
xli. PPO : Post placement optimization
xlii. PVT : Process voltage and temperature
xliii. PGAs : Pin grid arrays
xliv. QOR : Quality of results
xlv. RC : Resistance and capacitance
xlvi. RCD : Regional clock Driver
xlvii. RTL : Register transfer level
xlviii. SDC : Synopsys design constraints
xlix. STA : Static timing analysis
l. SDF : Standard delay format
li. SOC : System On-chip
lii. SPICE : Simulation Program with Integrated Circuit Emphasis
liii. TDF : Top design file
liv. TLU : Table look up
lv. TNS : Total Negative Slack
lvi. VCS : Verilog compiler and simulator
lvii. VHDL : Very high speed integrated circuit hard ware description
Language
lviii. VLSI : Very large scale integrated circuit
lix. WNS : Worst Negative Slack
lx. WSBFS : Walk-Segment Breadth First Search
Department of ECE, JNTUHCEH 1
Chapter 1
Introduction
The integrated circuits in todays scenario coming up with increased transistors
placed on it and making the designers to complete it with many challenges, so
challenges are compromised with much compensation. So designing a vlsi chip
conditioned to have low power consumption, low power dissipation is not easy task,
they have to look up at the following factors and balance
PVT (Process Voltage Temperature), Clock frequencies, Timing Closure,
Scaling, Power Dissipation, Electromigration & IR drop.
No circuit is to have low area low power consumption low power dissipation
and high speed any one or two factors are to be sacrificed, so by decreasing feature
size day by day its a big dilemma to designers and fabrication.
However, at the time when the need and opportunity for an ASIC market were
clear, the design challenges were not as numerous and complex. As it is with any
industry and market, legacy forces become too strong to overcome. This has been the
problem with today's RTL-to-GDSII design flow which is clinging to past successes
and hobbles to tackle future problems.
Many of the design decisions that are made in today's RTL-to-GDSII
methodology are based on coarse estimates or worst-case decisions. Such decisions
can no longer lead to successful design due to the increased miniaturization of the
process which in turn leads to tighter design margins, as well as the tight market
constraints that demand a shorter turn-around design time. This dissertation is not a
survey of the various EDA algorithms involved in the RTL-to-GDSII flow, nor is it a
survey of the various design methodologies and flows that are employed by designers
in the field of high-end IC design.
1.1 Introduction to ASIC
An application-specific integrated circuit (ASIC) is an integrated circuit(IC)
customized for a particular use, rather than intended for general-purpose use. In
today's world, ASICs offer many advantages over off-the-shelf devices.
Smaller die size leads to board size reduction, reduced power consumption,
less heat dissipation, Lower costs under mass production, improved performance,
Better radiation tolerance, improved testability, Enhanced reliability, Proprietary
design implementation
1.2 Standard-CellBased ASIC
A cell-based ASIC uses predefined logic cells like AND gates, OR gates,
multiplexers, and flip-flops known as standard cells. The flexible blocks in a CBIC
are built of rows of standard cells. Placement of the standard cells and the
interconnect is defined by an ASIC designer in a CBIC. The advantage of CBICs is
that they can be designed in less time with small amount of money compared to full-
custom ASICs, and also the most important thing is it reduce the risk by using a
predesigned, pretested, and pre-characterized standard-cell library which can be
optimized individually. At the same time, the disadvantages are the time or expense of
designing or buying the standard-cell library and the time needed to fabricate all
layers of the ASIC for each new design. Figure-1.1 shows a CBIC (cell based IC).
Figure 1.1 cell based ASIC
[13]
Each standard cell in the library is constructed using full-custom design
methods, but you can use these predesigned and pre-characterized circuits without
having to do any full-custom design yourself. This design style gives you the same
performance and a flexibility advantage of a full-custom ASIC but reduces design
time and reduces risk.
1.3 Need for Low Power ASIC
For early digital circuits, high speed and minimum area were the main design
constraints. Most of the EDA tools were designed specifically to meet these criteria.
Power consumption was never highly visible. Nowadays, the area reduction of digital
circuits is no longer a big issue as with the latest sub-micron techniques, many
millions of transistors can be fit in a single IC. Smaller chip size eventually leads to
high demand for portable and handheld devices. More and more applications are
battery powered, and low power ICs are the key to extend the usage time in between
battery recharge, and in turn increase battery life and reliability of the product. Also in
submicron technologies, there is a limitation on the proper functioning of circuits due
to heat generated by power dissipation. Market forces are demanding low power for
not only longer battery life but also reliability, portability, performance, cost and time
to market. This is very true in the field of personal computing devices, wireless
communications systems, home entertainment systems, which are becoming popular
now-a-days. Implantable medical devices, such as pace maker, deep brain system for
Parkinsons disease, and spinal cord stimulator for pain management, particularly
need to dissipate less power for longer battery life and improved component reliability
and safety.
As process technology reduces into 90nm and below, performance and density
are taken to new levels, yet power loss in both switching and leakage makes designing
with these devices a major challenge. Leakage power reduction is essential in
sustaining the scalingof the CMOS process. Leakage power is now becoming
proportional to dynamic orswitching power loss as shown in Figure below. While
lowering of the threshold voltage leads to significant increase in sub-threshold
leakage current, the increase in gate tunneling leakage current is caused by thinner
gate oxides. While scaling improves transistor density, functionality, and higher
performance on chip, it also results in power dissipation increase. Therefore, it has
become necessary to use new techniques to manage energy at the system level.
1.4 ASIC Flow
The traditional ASIC design flow:
Prepare requirement specification and create a Micro-Architecture document.
RTL design and development of IPs. After the previous step DFT memory BIST
insertion can also be implemented, if the design contains any memory element.
Functional verification all the IPS. Check whether the RTL is free from lifting errors
and analyze whether the RTL is synthesis friendly. Perform cycle based verification
(functional) to verify the protocol behaviour of the RTL. Perform the property
checking to verify the RTL implementation and the specification understanding is
matching. Design environment setting. This includes the technology file to be used
along with Other environmental attributes.
Prepare the design constraints file to perform synthesis, usually called as an
SDC synopsys_constraints or dc_synopsys_setup file, specific to synthesis tool
(design compiler). Once the constraints file is set. For performing synthesis inputs to
the DC are the library file (for which the synthesis needs to be targeted for, which has
the functional/timing information available for the standard cell library and the wire
load models for the wires based on the fan-out length of the connectivity), RTL files
and the design constraints files, so that the synthesis tool can perform the synthesis of
the RTL files and map and optimize to meet the design constraints requirements.
After performing the synthesis, scan insertion and JTAG scan chain insertions
are implemented and then synthesis is repeated. Check whether the design is meeting
the requirements after synthesis. Perform block level static timing analysis using
Design compilers built-in static timing analysis engine. Perform Formal verification
between RTL and the synthesized netlist to confirm that the synthesis tool has not
altered the functionality. Perform the pre-layout STA (static timing analysis) using
PrimeTime with the SDF (standard delay format) file and synthesized netlist file to
check whether the design is meeting the timing requirements. Once the synthesis is
performed the synthesized netlist file (VHDL/Verilog format) and the SDC
(constraints file) is passed as input files to the Placement and routing tool to perform
the back-end activities. The tool used is IC Compiler.
Initialize the floorplanning with timing driven placement of cells, clock tree
insertion and global. Transfer of clock treeto the original design (netlist) residing in
Design Compiler. In-place optimization of the design in Design Compiler. Formal
verification using Formality. Extraction of estimated timing delays from the layout
after the global routing step. Back annotation of estimated timing data from the global
routed design, to PrimeTime. Static timing analysis in PrimeTime, using the estimated
delays extracted after performing global route. Detailed routing of the design.
Extraction of real timing delays from the detailed routed design. Back annotation of
the real extracted timing data to PrimeTime. Post-layout static timing analysis using
PrimeTime. Functional gate-level simulation of the design with post-layout timing (if
desired). Tape out after LVS and DRC verification.
CAD tools are involved in all stages of VLSI design flowDifferent tools can
be used at different stages due to EDA common data formats. CAD tools provide
several advantages:
Ability to evaluate complex conditions in which solving one problem creates
other problems. Use analytical methods to assess the cost of a decision. Use synthesis
methods to help provide a solution. Allows the process of proposing and analyzing
solutions to occur at the same time.
Figure 1.3 Traditional ASIC Design Flow
[14]
As shown in the Figure 1.3 graphically illustrates the typical ASIC design
flow discussed above. The acronyms STA and CT represent static timing analysis and
clock tree respectively. DC represents Design Compiler Synopsys CAD tool for
Physical Design is called Integrated Circuit Compiler (ICC).
1.5 Objective of the project
The main objective is to remove glitches or minimize the glitches effect on the
design.
Glitches are formed when coupling capacitances are more, so need to decrease
coupling capacity effect either by placing buffer in aggressor or victim nets. When the
unnecessary signals switch together there would be dynamic IR drop.
To implement backend flow process some inputs are necessary of which major
input is gate level netlist obtained from synthesis. In floor planning determines the
size of the cell (or die) creates boundary and core area, Aspect ratio and creates wire
tracks for power planning. Size of the die and utilization directly reflects wire
spacing, power consumption and IR drop.
Virtual flat placement is done to analyze congestion which affects the goals of
glitch less design. Power network synthesis is done to know IR drop as it is one of on-
chip skew variation problem.
In noise analysis it gives glitch which is produced by noise due to it victim net
which needs to be constant gives out any dynamic logic value in turn we get
unnecessary transitions and corrupted output. By inserting buffers the glitch effect is
reduced or removed.
Buffers are inserted either on aggressor net or victim net and should be before
the receiving side of the glitch circuit. Then the resultant noise should be noted to see
whether it is decreased or removed. If decreased then it has to be below noise margin
to produce glitch less output.
1.6 Organization of Thesis
This section defines the organization of the entire thesis and the flow of
the project from introduction to the conclusion of the project.
Chapter1 describes the introduction of the project and main objectives of
the project.
Chapter 2 describes about the Asynchronous FIFO, it explains about different
operations, hierarchy and signals. Its known as asynchronous dual ram. It reads and
writes the data using counters and registers. It explains about the top level module
and modules involved in the design.
Chapter 3 explains about the flow of synthesis, libraries and
constraints applied to the design. It also explains about design environment for the
design.
Chapter 4 explains about the design planning, which includes the flow
of physical design, floor planning, virtual flat placement, congestion issues and
macro planning.
Chapter 5 describes about placement, power planning, power straps, IO
power, core power and power structure of the design.
Chapter 6 explains about the clock tree structure and different types of
clock tree distribution structures for the design. In this distribution of clock trees,
analyzing of clock sink groups, clock tree attributes and clock network modeling
for synthesis. It also explains about analyzing clock trees, clock skew scheduling,
optimization of registers and clock analysis. Trail for clock tree synthesis, virtual
clocks present in the design and describes about clock design at implementation
stage.
Chapter 7 explains about routing at all stages after tracks and virtual
placement of clock and power nets are formed. Its interconnects all the signals in
real.
Chapter 8 explains about signal integrity and types of noise occurred. Eco cell
buffer addition to the aggressor or victim net.
Chapter 9 explains about the results occurred in each flow of the design.
Chapter 10 explains about conclusion of the project.
Chapter 11 explains about future scope of the methodology.
Chapter 2
Asynchronous FIFO
2.1 Asynchronous Interface
Asynchronous interface design is the circuitry in which set of signals that
comprises the connection between devices of a computer system where the transfer of
information between devices is organized by the exchange of signals not
synchronized to some controlling clock. A request signal from an initiating device
indicates the requirement to make a transfer; an acknowledging signal from the
responding device indicates the transfer completion. This asynchronous interchange is
also widely known as Handshaking.
Most of the time, asynchronous designs are referred to as the designs with no
clocks, but this project asynchronous FIFO interface circuit incorporates multiple
clocks for transmitting and receiving the data values. The description of the design is
explained below along with the top module diagram of the design.
An asynchronous FIFO refers to a FIFO design where data values are written
to a FIFO buffer (RAM) from one clock domain and the data values are read from the
same FIFO buffer from another clock domain, where the two clock domains are
asynchronous to each other. Asynchronous FIFOs are used to safely pass data from
one clock domain to another clock domain.
There are a lot of different ways to design asynchronous FIFO interface
design, the method used in this project is FIFO partitioning with synchronized
pointer comparison; for comparing and synchronizing the design working on two
clocks one for transmitting and one for receiving, uses gray counters for comparison
of full and empty registers of RAM which is FIFO buffer for writing and reading the
data values.
Data words are placed into a FIFO buffer memory array by control signals in
one clock domain, and the data words are removed from another port of the same
FIFO buffer memory array by control signals from a second clock domain. The
difficulty associated with doing FIFO design is related to generating the FIFO
pointers and finding a reliable way to determine full and empty status on the FIFO. [6]
Generally FIFOs are used where write operation is faster than read operation.
However, even with the different speed and access types the average rate of data
transfer remains constant. FIFO pointers keep track of number of FIFO memory
locations read and written and corresponding control logic circuit prevents FIFO from
either under flowing or overflowing. FIFO architectures inherently have a challenge
of synchronizing itself with the pointer logic of other clock domain and control the
read and write operation of FIFO memory locations safely.
2.2 Issues in Designing Asynchronous FIFO
Although the design states that the circuitry is asynchronous and is working in
multiclock environment, it is essential to synchronize the two clocks as the data can
be lost due to setup and hold violations. It is very important to understand the signal
stability in multi clock domains since for a travelling signal the new clock domain
appears to be asynchronous. If the signal is not synchronized to new clock, the first
storage element of the new clock domain may go to metastable state and the worst
case is that resolution time cannot be predicted. It can traverse throughout the new
clock domain resulting in failure of functionality. To prevent such failures setup time
and hold time specification has to be obeyed in the design. Manufacturers provide
statistics of probability of failure of flip-flops due to metastability characters in terms
of MTBF (Mean Time before Failure). Synchronizers are used to prevent the
downstream logic from entering into the metastable state in multiclock domain with
multibit data values.
Thus, for efficient working of FIFO architecture designing of FIFO pointers is
the key issue. At this point, deep understandings of the FIFO read and write pointers
become necessary. On reset both read and write pointers are pointing to the starting
location of the FIFO. This location is also the first location where data has to be
written at the same time this first location happens to be first read location. Therefore,
in general, read pointer always points to the word to be read and write pointer always
points to the next location to which data has to be written.
2.3 Operation of the Design
2.3.1 Data write operation
When both read and write pointers are pointing to first location of FIFO empty
flag is asserted indicating the FIFO status as empty. Now data writing can be
performed. Data will be written to the location where the write pointer is pointing and
after the data write operation write pointer gets incremented pointing to the next
location to be written. At the same time, empty flag is de-asserted which indicates that
FIFO is not empty, somedata is available. One notable point regarding read pointer is
with empty flag active the data pointed out by the read pointer is always invalid data.
When first data written and empty flag status cleared (i.e. empty flag inactive) read
pointer logic immediately drives the data from the location to which it was pointing to
the read port of the dual port RAM, ready to be read by read logic. With this
implementation of read logic the biggest advantage is that only one clock pulse is
required to read from read port since previous clock cycle has already incremented
read pointer and drives the data to read port. This will help in reducing latency in
detecting empty and full pointer flag status. Empty status flag can be asserted in one
more condition. After some n number of data write operations if same n number of
read is performed then both pointers are again equal. Hence, if both pointers catch
up each other, then empty flag is asserted.
2.3.2 FIFO full status
When write pointer reaches the top of the FIFO, it is pointing towards the
location, which can be written and is the last location to be written. No read operation
is performed yet and read pointer is pointing to first location itself. This is one method
is to generate FIFO full condition. When write pointer reaches the top of the FIFO, if
full flag is asserted then it is not the actual FIFO full condition, this is only almost
full as there is one location which can be written. Similarly almost empty condition
can exist in FIFO. Now a write operation causes the location to be written and
increment of write pointer. Since the location was the last one write pointer wraps up
to first location. Now both read and write pointers are equal and hence empty flag is
asserted instead of full flag assertion, which is a fatal mistake. Hence wrap around
condition of a full pointer may be a FIFO full condition.
After writing the data to FIFO (consider write pointer is in top of FIFO) some
data has been read and read pointer is somewhere in between FIFO. One more write
operation causes the write pointer to wrap. Note that even though write pointer is
pointing to first location of FIFO this is NOT FIFO full condition, since read pointer
has moved up from the first location. Further data writing pushes write pointer up.
Imagine read pointer wraps around after some more read operation. Present condition
is that both pointers have wrapped around but there is no FIFO full or FIFO empty
condition. Data can be written to FIFO or read from the FIFO. The disadvantage of a
FIFO of this kind is that the status signals cannot be fully synchronized with the read
and write clock.
2.3.3 Asynchronous FIFO pointers
FIFO is full when the pointers are equal, that is, when the write pointer has
wrapped around and caught up to the read pointer. This is a problem. Considering that
point, it is difficult to decide which condition has occurred; the FIFO is either empty
or full when the pointers are equal.
One design technique used to distinguish between full and empty is to add an
extra bit to each pointer. Whenever the write pointer increments past the final FIFO
address, the write pointer will increment the unused MSB while setting the rest of the
bits back to zero as shown in Figure below (the FIFO has wrapped and toggled the
pointer MSB). The same is done with the read pointer. If the MSBs of the two
pointers are different, it means that the write pointer has wrapped one more time that
the read pointer. If the MSBs of the two pointers are the same, it means that both
pointers have wrapped the same number of times.
Figure 2.2 FIFO full and empty conditions
[15]
Using n-bit pointers where (n-1) is the number of address bits required to
access the entire FIFO memory buffer; the FIFO is empty when both pointers,
including the MSBs are equal. And the FIFO is full when both pointers, except the
MSBs are equal. The FIFO design uses n-bit pointers for a FIFO with 2(n-1) write-
able locations to help handle full and empty conditions. As shown in the figure 2.2 it
explains full and empty conditions.
The counters designed to synchronize the signals are Gray code counters. The
reason to choose gray coder counter and not the binary code counter is that, trying to
synchronize a binary count value from one clock domain to another is problematic
because every bit of an n-bit counter can change simultaneously (example 7->8 in
binary numbers is 0111->1000, all bits changed). Gray codes only allow one bit to
change for each clock transition, eliminating the problem associated with trying to
synchronize multiple changing signals on the same clock edge. It is desirable to create
both an n-bit Gray code counter and an (n-1)-bit Gray code counter. It would certainly
be easy to create the two counters separately, but it is also easy and efficient to create
a common n-bit Gray code counter and then modify the 2nd MSB to form an (n-1)-bit
Gray code counter with shared LSBs. This will be called a dual n-bit Gray code
counter.
Figure 2.3 n-bit Gray code converted to an (n-1)-bit Gray code
[15]
It is obvious that inverting the second MSB of the second half of the 4-bit
Gray code will produce the desired 3-bit Gray code sequence in the three LSBs of the
4-bit sequence. The only other problem is that the 3-bit Gray code with extra MSB is
no longer a true Gray code because when the sequence changes from 7 (Gray 0100) to
8 (~Gray 1000)and again from 15 (~Gray 1100) to 0 (Gray 0000), two bits are
changing instead of just one bit. A true Gray code only changes one bit between
counts. As shown in the figure 2.3 it explains about gray code counters.
2.4 Handling full and empty conditions
Exactly how FIFO full and FIFO empty are implemented is design-dependent.
The FIFO design in this paper assumes that the empty flag will be generated in the
read-clock domain to insure that the empty flag is detected immediately when the
FIFO buffer is empty, that is, the instant that the read pointer catches up to the write
pointer (including the pointer MSBs).The FIFO design in this paper assumes that the
full flag will be generated in the write-clock domain to insure that the full flag is
detected immediately when the FIFO buffer is full, that is, the instant that the write
pointer catches up to the read pointer (except for different pointer MSBs).
2.4.1 Generating empty flag
The FIFO is empty when the read pointer and the synchronized write pointer
are equal. The empty comparison is simple to do. Pointers that are one bit larger than
needed to address the FIFO memory buffer are used. If the extra bits of both pointers
(the MSBs of the pointers) are equal, the pointers have wrapped the same number of
times and if the rest of the read pointer equals the synchronized write pointer, the
FIFO is empty. The Gray code write pointer must be synchronized into the read-clock
domain. Since only one bit changes at a time using a Gray code pointer, there is no
problem synchronizing multi-bit transitions between clock domains. In order to
efficiently register the rempty output, the synchronized write pointer is actually
compared against the rgraynext (the next Gray code that will be registered into the
rptr).
2.4.2 Generating full flag
Since the full flag is generated in the write-clock domain by running a
comparison between the write and read pointers, one safe technique for doing FIFO
design requires that the read pointer be synchronized into the write clock domain
before doing pointer comparison. The full comparison is not as simple to do as the
empty comparison. Pointers that are one bit larger than needed to address the FIFO
memory buffer are still used for the comparison, but simply using Gray code counters
with an extra bit to do the comparison is not valid to determine the full condition.
2.5 Procedure to Design FIFO Module
The general block diagram of asynchronous FIFO is shown in Figure 2.4.
Functionality wise mainly we can distinguish four blocks in this diagram. They are:
dual port RAM, read pointer logic, writes pointer logic.
ReadEn_in
Data_out
Clear_in
RCLK
Data_in Empty_out
WriteEn_in
WCLK Full_out
Figure 2.4 Top Module of Asynchronous fifo
[16]
Asynchronous FIFO
Figure 2.5 Internal Architecture of Asynchronous fifo
[16]
Dual port RAM has two ports-one is for reading and the other one is for
writing operation. These two accesses of the FIFO are independent of each other and
are completely controlled by read pointer logic and write pointer logic. Number of
memory locations of the FIFO varies from 8 locations to some kilobytes. The data
width of each location is also varying from one to 256 bits depending on the
applications and technology. Modern day FIFOs provide options to program of the
above parameters as per requirements.
Data is written sequentially into the FIFO and read sequentially such that the
first data written is the first data read out and so on with the remaining sequential
data. Thus architecture of FIFO is completely characterized by these two independent
operations as shown in the figure 2.5. Dual port RAM and read-write logic circuits
with synchronizers accomplish this task. Read port has its associated memory
addressing logic called as read pointer logic and write port has write pointer logic.
When FIFO is reset both read and write pointers point to first memory location of the
FIFO. As and when data is written to FIFO write pointer gets incremented and points
to next memory location. Similarly when read operation takes place read pointer gets
incremented for every read. Both pointer works in circular fashion i.e. after reaching
the last position it will jump to first location of the FIFO.
Full flag and empty flag are used to detect the status of the FIFO. These two
flags are generated depending on the comparison result of FIFO pointers. Full flag is
asserted when FIFO is completely full. Empty flag is asserted when FIFO is empty.
Assertion of full flag indicates that no data can be written further unless at least one
data is read out of the FIFO. Assertion of empty flag indicates the condition that no
more data can be read from the FIFO unless until at least one data is written to the
FIFO.
Even after the assertion of full flag, if data is written to FIFO overflow
condition occurs. Similarly after the assertion of empty flag if read operation is
performed then underflow occurs. Either overflow or underflow condition causes
the data corruption or data loss. Safe and reliable FIFO designs always avoid both
extreme conditions.
A new asynchronous FIFO design is presented here. The concept of using
pointer difference for determining the FIFO status is already used in synchronous
FIFO designs. Here same concept is extended to asynchronous FIFO. The block
diagram consists of a dual port RAM, two 4 bit binary up-counters, address pointer
gap generation logic, full and empty condition generation logic, next read control
logic and next write control logic.
One of the most interesting architectural decision is to how to calculate depth
of a fifo. For worst case scenario, difference in the data rate between write and read
should be maximum. Thus for write operation maximum data rate should be
considered and for read operation minimum data rate should be considered for
calculating depth of fifo.
Dual ports has two ports one is for writing and other is for reading .when data
is written to fifo write pointer gets incremented and points to next memory location.
Similarly when read operation takes place.
To determine full/empty flags and fifo size, the read pointer must be
synchronised to the write domain, and the write pointer must be synchronised to the
read domain.The design consists of dual port RAM , two 4 bit gray counter , full and
empty condition generation logic , write next control logic and read next control logic.
The naming conventions are as follows
Data_in : input data 8 bits width is considered, Data_out : output data 8 bots
width is considered, ReadEn_in : read enable, WriteEn_in : write enable, Clear_in :
clear input, WClk : write clk, RClk : read clk, Empty_out :fifo empty flag is asserted
when fifo is empty, Full_out :fifo full flag is asserted when fifo is full, Mem [3:0] :
memory to store data, pNextWordToWrite: write pointer, pNextWordToRead : read
pointer, EqualAddresses : write pointer == read pointer, NextWriteAddressEn : write
next address enable, NextReadAddressEn : read next address enable, Set_Status : set
status based on pointers, Rst_Status : reset status set by pointers, Status :status of fifo,
PresetFull : reset when fifo is full & PresetEmpty : reset when fifo is empty.
2.5.1 Dual port RAM
For this design depth of RAM is considered to be 16 and width is 8 .Data is
written to fifo only if fifo is not full and write enable signal is enabled. Similarly data
is read out of fifo only if fifo is not empty and read signal is enabled.
2.5.2 Gray counters
Four bit gray counters are used to generate address for read and write port.
These address generators have external reset and enable signals called write
next enable and read next enable which generated and controlled by next write control
logic and next read control logic. Single resets are mapped to both gray counters to
reset both write and read pointers.
2.5.3 Address pointer difference generation
This block compares the both read and writes address and gives out difference
of two address pointers. This block contains comparators and adders and subtractors
which gives the status of the fifo.
2.5.4 Full and Empty generation logic
This block takes pointer difference as input and gives status of the fifo. If
pointer difference is zero empty condition is generated and if pointer difference is 15
full condition is generated.
2.5.5 Next read and write control logic
This control logic decides enabling read and writes once the empty and full
conditions are asserted.
Chapter 3
Logic Synthesis
3.1 Synthesis and its Basic Flow
Synthesis is the process that generates a gate-level netlist for an IC design that
has been defined using a Hardware Description Language (HDL). Synthesis includes
reading the HDL source code and optimizing the design from that description. Using
the technology library's cell logical view, the Logic Synthesis tool performs the
process of mathematically transforming the ASIC's register-transfer level (RTL)
description into a technology-dependent netlist. This process is similar to a software
compiler converting a high-level C-program listing into a processor-dependent
assembly-language listing. The netlist is the standard-cell representation of the ASIC
design, at the logical view level. It consists of instances of the standard-cell library
gates, and port connectivity between gates. Proper synthesis techniques ensure
mathematical equivalency between the synthesized netlist and original RTL
description. The netlist contains no unmapped RTL statements and declarations. As
shown in the figure 3.1 it explains basic synthesis flow.
RTL Source Constraints Technology Libraries
RTL synthesis
Figure 3.1 Synthesis flow
[17]
3.2 Synopsys Design Compiler Flow for Synthesis
The Design Compiler is a synthesis tool from Synopsys Inc. In simple terms,
synthesis tool takes a RTL [Register Transfer Logic] hardware description written in
either Verilog or VHDL and standard cell library as input and the resulting output
would be a technology dependent gatelevel-netlist. The gatelevel-netlist is nothing but
structural representation of only standard cells based on the cells in the standard cell
library. The synthesis tool internally performs many steps, which are listed below.
Also below is the flowchart of synthesis process.
Design Compiler reads in technology libraries, DesignWare libraries, and
symbol libraries to implement synthesis. During the synthesis process, Design
Compiler [DC] translates the RTL description to components extracted from the
technology library and DesignWare library. The technology library consists of basic
logic gates and flip-flops.
The DesignWare library contains more complex cells for example adders and
comparators which can be used for arithmetic building blocks. DC can automatically
determine when to use Design Ware components and it can then efficiently synthesize
these components into gate-level implementations.
Design Compiler also needs the RTL designed by the designer. It reads the
RTL hardware description written in either Verilog/VHDL.
The synthesis tool now performs many steps including high-level RTL
optimization, RTL to un-optimized Boolean logic, technology independent
optimizations, and finally technology mapping to the available standard cells in the
technology library, known as target library. This resulting gate-level-netlist also
depends on constrains given. Constraints are the designers specification of timing
and environmental restrictions [area, power, process etc] under which synthesis is to
be performed. As an RTL designer, it is good to understand the target standard cell
library, so that one can get a better understanding of how the RTL coded will be
synthesized into gates.
After the design is optimized, it is ready for DFT [design for test/ test
synthesis]. DFT is test logic; designers can integrate DFT into design during
synthesis. This helps the designer to test for issues early in the design cycle and also
can be used for debugging process after the chip comes back from fabrication.
After test synthesis, the design is ready for the place and route tools. The Place
and route tools place and physically interconnect cells in the design. Based on the
physical routing, the designer can back-annotate the design with actual interconnect
delays; DC can be used again to resynthesize the design for more accurate timing
analysis.
While running DC, it is important to monitor/check the log files, reports,
scripts etc to identity issues which might affect the area, power and performance of
the design.
3.3 Design Flow
3.3.1 Read Design
Design Compiler reads designs into memory from design files. Many designs can
be in memory at any time. After a design is read in, you can change it in numerous
ways, such as grouping or ungrouping its sub designs or changing sub design
references. Design Compiler provides the following ways to read design files:
The analyze and elaborate commands & the read_file command.
Using the analyze and elaborate Commands
The analyze command does the following:
Reads an HDL source file, Checks it for errors (without building generic logic
for the design). Creates HDL library objects in an HDL-independent intermediate
format, Stores the intermediate files in a location you define, If the analyze command
reports errors, fix them in the HDL source file and run analyze gain. After a design is
analyzed, you must reanalyze it only when you change it.
The elaborate command does the following:
Translates the design into a technology-independent design (GTECH) from
the intermediate files produced during analysis. Allows changing of parameter values
defined in the source code. Allows VHDL architecture selection. Replaces the HDL
arithmetic operators in the code with DesignWare components. Automatically
executes the link command, which resolves design references.
Resolving the reference means that the design library or file containing the
detailed design data for the sub-block can be found and processed. If any references in
the netlist cannot be resolved, the link command will issue warnings as to which sub-
component designs are not available.
3.3.2 SYNTHESIS LIBRARIES
3.3.2.1 Target library
The target library variable defines the technology library that tool uses to build
the circuit. That is, during technology mapping phase Design Compiler selects
components from the library specified with the target library variable to build the
gate-level netlist.
3.3.2.2 Synthetic Library
The synthetic library variable specifies the synthetic or Design Ware libraries.
These synthetic libraries are technology-independent, micro architecture-level design
libraries providing implementations for various IP blocks.
3.3.2.3 Link Library
The link library variable is used to resolve design references. That is, Design
Compiler must connect all the library components and designs it references. This step
is called linking the design or resolving references.
Note that in most cases the link library is the same as the target library
3.3.2.4 Symbol Library
Symbol library defines the schematic symbols for components in technology
library. These symbols are needed for drawing design schematics.
3.4 Design Environment
In order to obtain optimum results from synthesis, designers have
to methodically constrain their designs by describing the design environment,
target objectives and design rules. The constraints may contain timing and/or area
information, usually derived from design specifications. Synthesis tool uses these
constraints to perform synthesis and tries to optimize the design with the aim
of meeting target objectives. It defines the environment by defining the operating
conditions, wire load models, and system interface characteristics.
Operating conditions include temperature, voltage, and process variations. Wire
load models estimate the effect of wire length on design performance.
System interface characteristics include input drives, input and output loads,
and fan-out loads. The environment model directly affects design synthesis
results.
Operating Conditions describes the process, voltage and temperature
conditions of the design. The process variation accounts for deviations in the
semiconductor fabrication process. The designs supply voltage can vary from
established ideal value during day-to-day operation. Temperature variation is
unavoidable in the everyday operation a design. Effects on performance caused
by temperature fluctuations are most often handled as linear scaling effects. The
library contains the library contains the description of these conditions, usually
described as WORST, TYPICAL and BEST case. The names of operating
conditions are library dependent.
By changing the value of the operating condition command, full ranges of
process variations are covered. The worst case operating condition is generally used
during pre-layout synthesis phase, thereby optimizing the design for maximum
setup- time. The best case condition is commonly used to fix the hold-time
violations. The typical case is mostly ignored, since analysis at worst and best case
also covers the typical case. It is possible to optimize the design both with the
worst and the BEST case, simultaneously. The optimization is achieved by using
the analysis in the case analysis. This is very useful for fixing the design for
possible hold-time violations. The wire load models used to estimate the net delays
as the function of loading, for a particular block different wire load models are
present.
As shown in figure 3.2 it explains about design environment and constraints
added for the particular blocks. The design environment consists of two blocks and
clock divider circuit. Clock divider logic generates clock and applies to block B.
Block A is used to send data to block B. Block A is the input to the design and
its output is applied to block B. For each block there are particular constraints
added for input and output to drive signals to their respective locations.
Figure 3.2 Design Environment for a design
[10]
3.5 Synthesis Constraints
There are basically two types of design constraints they are design rule
and optimization constraints.
Design rule constraints are supplied in the technology library we
specify They are referred to as the implicit design rules. These rules are
established by the library vendor, and, for the proper functioning of the fabricated
circuit, they must not be violated. We can, however, specify stricter design rules
if appropriate. The rules you specify are referred to as the explicit design rules.
Design optimization constraints define timing and area optimization goals
present.
[10]
present.
[10]
for Design Compiler. These constraints are user-specified. Design Compiler
optimizes the synthesis of the design, in accordance with these constraints,
but not at the expense of the design rule constraints. That is, Design Compiler
attempts never to violate the higher -priority design rules.
3.5.1 Design Rule Constraints
Maximum transition time is the longest time allowed for a driving pin of a
net to change its logic value. The maximum and minimum total capacitive load
that an output pin can drive. The total capacitance comprises of load pin
capacitance and interconnects capacitances. The maximum Fanout is applied for the
driving pin.
Some technology libraries contain cell degradation tables. The cell
degradation tables list the maximum capacitance that can be driven by a cell as
a function of the transition times at the inputs of the cell.
3.5.2 Design Optimization Constraints
The system clock definitions and clock delays are the most important
Constraints in your ASIC design. The clock signal is the synchronization signal that
controls the operation of the system. The clock signal also defines the timing
requirements for all paths in the design. Most of the other timing constraints are
related to the clock signal.
A multicycle path is an exception to the default single cycle timing
requirement of the paths. That is, on a Multicycle path the signal requires more
than a single clock cycle to propagate from the path start point to the path endpoint.
Clock uncertainty is used to define the clock skew information. Basically
this is used to certain amount of margin to the clock, both for setup and hold
times. During the pre layout phase one can add more margin as compared to
the post- layout phase.
Input specifies the input arrival time of a signal in relation to the clock. It
is used at the input ports to specify the time it takes for the data to be stable
after the clock edge. Given the top-level timing specification of the design, this
information may also be extracted for the sub-blocks of the design. Output delay
is used at the output port to define the time it takes for the data to be available
before the clock edge. Given the top-level timing specification of the design, this
information may also be extracted for the sub-blocks of the design.
Minimum and maximum path delays allow constraining paths individually
and setting specific timing constraints on those paths. Input transition and output
load capacitance can be used to constrain the input slew rate and output
capacitance on output pins.
3.6 Design Constraints
3.6.1 create_clock command is used to define a clock object with a particular
period and waveform. The period option defines the clock period, while the
waveform option controls the duty cycle and the starting edge of the clock. This
command is applied to a pin or port, object types.
In some cases, a block may only contain combinational logic. To define delay
constraints for this block, one can create a virtual clock and specify the input and
output delays in relation to the virtual clock. To create a virtual clock, designers may
replace the port name (CLK, in the above example) with the name <virtual clock
name>, in the above command. Alternatively, one can use the set_max_delay or
set_min_delay commands to constrain such blocks.
3.6.2 create_generated_clock command is used for clocks that are generated
internal to the design. This command may be used to describe frequency
divided/multiplied clocks as a function of the primary clock.
3.6.3 set_dont_touch is used to set a dont_touch property on the
current_design, cells, references or nets. This command is frequently used during
hierarchical compilation of the blocks. Also, it can be used for, preventing DC from
inferring certain types of cells present in the technology library.
3.6.4 set_input_delay specifies the input arrival time of a signal in relation to
the clock. It is used at the input ports to specify the time it takes for the data to be
stable after the clock edge. The timing specification of the design usually contains this
information, as the setup/hold time requirements for input signals. Given the top-level
timing specification of the design, this information may also be extracted for the sub-
blocks of the design.
In Figure 3.3, the maximum input delay constraint of 23ns and the minimum input
delay constraint of 0ns is specified for the signal datain with respect to the clock
signal CLK, with a 50% duty cycle and a period of 30ns. In other words the setup-
time requirement for the input signal datain is 7ns, while the hold-time requirement is
0ns.
Figure 3.3 Specification of the Input Delay
[10]
3.6.5 set_output_delay command is used at the output port to define the time
it takes for the data to be available before the clock edge. The timing specification of
the design usually contains this information. Given the top-level timing specification
of the design, this information may also be extracted for the sub-blocks of the design.
In Figure 3.4, the output delay constraint of 19ns is specified for the signal
dataout with respect to the clock signal CLK, with a 50% duty cycle and a period of
30ns. This means that the data is valid for 11ns after the clock edge.
Figure 3.4 Specification of the Output Delay
[10]
3.6.6 set_clock_latency command is used to define the estimated clock
insertion delay during synthesis. This is primarily used during the prelayout synthesis
and timing analysis. The estimated delay number is an approximation of the delay
produced by the clock tree network insertion (done during the layout phase).
3.6.7 set_clock_uncertainty command lets the user define the clock skew
information. Basically this is used to add a certain amount of margin to the clock,
both for setup and hold times. During the pre-layout phase one can add more margins
as compared to the post-layout phase.
3.6.8 set_false_path is used to instruct ICC to ignore a particular path for
timing or optimization. Identification of false paths in a design is critical. Failure to do
so compels DC to optimize all paths in order to reduce total negative slack.
Consequently, the critical timing paths may be adversely affected due to optimization
of all the paths, which also includes the false paths. The valid start point and endpoint
to be used for this command are the input ports or the clock pins of the sequential
elements, and the output ports or the data pins of the sequential cells.
3.6.9 set_max_delay defines the maximum delay required in terms of time
units for a particular path. In general, it is used for the blocks that contain
combinational logic only. . However, it may also be used to constrain a block that is
driven by multiple clocks, each with a different frequency.
3.6.10 set_min_delay is the opposite of the set_max_delay command, and is
used to define the minimum delay required in terms of time units for a particular path.
Chapter 4
Design Planning
4.1 Introduction
Design planning was not a concern when designs were relatively small (less
than one million placeable components). The implementation of those designs relied
on a at design methodology where the whole design was viewed as one entity.
However, as the level of integration increased and multi-million cell designs started to
appear, these designs exceeded the capacity of a design flow, and convergence issues
became more severe. At that point, design planning becomes a mandatory step for a
successful and efficient implementation of the designs.
Design planning here means the necessary design steps that are needed to
manage the implementation and verification of the various components of the design.
To manage complexity, a design planning system is responsible of partitioning the
design into a number of components/blocks such that each can be designed and
optimized independently. The top-level design constraints are partitioned and mapped
onto the blocks to ensure that the overall design meets its design targets. Once each
block is designed, design planning is responsible of the necessary steps to integrate
these blocks and ensure that the design goals are met.
Design planning was not a necessity in previous generations of process
technologies due to two reasons: the size of the design in terms of the number of
gates was reasonable, and the performance targets were modest. In current and future
process technologies, two aspects made design planning a necessity: the exponential
growth of the number of transistors that can be packed on a die, and the aggressive
and tight design constraints and market forces.
There are two kinds of digital design styles: custom and structured. Each
design style has its own design methodologies and goals. Custom designs are
typically used for high-end microprocessors, high-end graphics, and communication
designs. The other type of design is Application Specific Integrated Circuits (ASIC)
for various process generation technologies.
The basic flow of physical design is as follows
Figure 4.1 Basic flow of Physical design
[10]
Before entering in to design planning the first step of design flow is data setup Below
shows the diagram of data setup for the design.
Synthesis
Data Setup
Design Planning
Placement
Clock Tree Synthesis
Routing
Chip Finishing
Figure.4.2 Data setup for design flow
[10]
As shown in figure 4.2, explains about setup for physical design flow.
The logical and timing libraries are given by vendor. Constraints file written by
the designer. Technology file (.tf) and RC models (.TLU) are given by the vendor.
Gate level netlist is obtained from synthesis. These are mandatory inputs to tool to
obtain desired outputs.
4.2 Design planning
Design planning was not a concern when designs were relatively small. The
implementation of those designs relied on a flat design methodology where the
whole design was viewed as one entity. However, as the level of integration
increased and multi-million cell designs started to appear, these designs exceeded
the capacity of a flat design flow. Design planning became a mandatory step for
a successful and efficient implementation of the design.
Design planning here means the necessary design steps that are needed
to manage the implementation and verification of the various components of the
design. To manage complexity, a design planning system is responsible of
partitioning the design in to number of components/blocks can be designed
and optimize independently.
Design Planning (figure 4.3) consists of macro planning, portioning and
global placement, power planning, top level routing, constraints management and
Top-level clock routing.
Figure 4.3 Design Planning
[10]
4.3 Tasks to be performed during Design Planning
Initializing the Floorplan. Automating Die Size Exploration. Performing an
Initial Virtual Flat Placement. Performing Power Planning. Performing Prototype
Global Routing. Performing Hierarchical Clock Planning. Performing In-Place
Optimization. Performing RC Extraction. Performing Timing Analysis. Performing
Timing Budgeting.
4.4 Macro Planning
Almost every design contains some I P blocks. IP blocks could come in
three different forms, they are soft, firm, and hard. Soft IPs i s t ypi cal l y
RTL designs with their verification counterparts. This IPs are usually
synthesized wi t h the rest of the logic in the design and are handled any other
HDL module. Firm IPs are those that consist of a synthesized netlist (logic
netlist), and are usually placed so that the module can be characterized
with respect to timing and power. Such a module is represented as a soft
macro in the physical design world, and could be placed manually or
automatically in the floor planning stage. Hard IPs is usually the most common
form of IPs. This IPs form the memory blocks, analog, RF, and other custom
circuitry. Most often, the designer is responsible of the placement of these
IPs because of their dependence on their outside connectivity ( off-chip
buses), o r because of the sensitivity of their circuits as in the analog/RF case.
Such blocks are usually very sensitive and require special attention when
placing them and routing.
Over or nearby them. For the most part, traditional placement engines
do not the top-level macros automatically. However, better automated
placement can be attained by on some hints provided by the user that can
specify the side of the die where the macros should reside, or some form of
clustering which serves to simplify the placement job and improve the QoR.
4.5 Partitioning
The number of devices on a single die is increasing rapidly due to
the continuing shrinking of the process technology. Today, billion-transistor
systems have become a reality. Such complexity necessitates a divide-and-
conquer approach to manage the design process. Partitioning plays a key role in
attaining the design goals in an acceptable turnaround t i me . However due t o
t he t i ght desi gn const rai nt s present, partitioning becomes a formidable task.
Partitioning the top-level constraints amongst the different blocks is a
formidable task by itself entails performing budgeting of the t op-level constraints
amongst the various blocks. Since the partitions have not been
implemented yet, it is hard to estimate the performance and area of these
partitions. This makes deciding on accurate timing budgets early in the design
planning stage a difficult task. To avoid all the issues mentioned above,
designers tend to opt for a flat implementation of the design whenever possible.
However, due to the large size of ICs today, the implementation algorithms are
not scaling at a comparable rate to the aggressive levels of design integration.
This fact forces designers to engage in design planning and carry out
the partitioning step to be able to manage and design the partitions
concurrently and independently. Typically, partitioning and gl obal placement go
hand-in-hand. To produce good quality partitions, the top-level connectivity
( global routing) of the partitions has to be taken into account. Most of the
global placement algorithms have some form of partitioning and global routing
embedded in t h e m t o accomplish this task.
4.6 Power Planning
Power integrity is an important factor to any successful design. Power
plays a key role in achieving the speed target set for the design. In addition, it
plays a key role in the reliability and proper functionality of the design. In
nanometre technologies, designs switching at high frequencies require a
comprehensive design approach of the power network that takes into account the
chip and package. Noise sources in the package such as inductive noise, signal
reflections due to impedance mismatches, and signal coupling are no longer
negligible; they could travel to the chip core and affect the power levels seen by
the clock buffers (power supply drop and ground bounce).This will limit the
performance of the clock network and may negatively affect its reliability, it
cannot be guaranteed to provide the necessary power levels and could cause the
clock network design to fail.
4.7 Defining Chip Area
Size of your design while maintaining the relative placement of hard
macros, I/O cells and a power structure that meets the voltage drop
requirements. The technology into the design planning will estimate the area in cell
view. It specifies the exact width and height of die. The aspect ratio target
utilization so that tool will estimate the width and height. The exact boundary
of die area in case block is rectilinear one. Rectilinear means a die area having
more than four corners.
The core size, chip area, chip utilization and aspect ratio for the design are
discussed with formulas in the below context.
Core area = Standard cell area + Macro Area ..(4.1)
Standard cell utilization
Die Size = Core Size + IOtoCoreclearance + Area pad(4.2)
IO-core clearances are the space from the core boundary to the inner side
of I/O pads (Design Boundary). Blockages macros and pads are combined in the
denominator of the effective Utilization. Aspect ratio for the design is
Aspect ratio = W = Horizontal routing resources ..........................................(4.3)
H Vertical routing resources
Chip Utilization is defined as the ratio of the area of standard cells, macros,
and the pad cells with respect to area of chip.
Chip Utilization = Area (standard cells) + Area (macros) + Area (pad cells) ....(4.4)
Area (chip)
4.8 Virtual Flat Placement
The initial virtual flat placement is very fast and is optimized for wire length,
congestion and timing. To perform initial virtual flat placement is described there
are some steps to follow.
To evaluate initial hard Macro Placement, No straightforward criteria exist
for evaluating the initial hard macro placement. Measuring the quality of results
(QoR) of the hard macro placement can be very subjective and often depends
on practical design experience.
Some of the constraints are specified for hard macro placement. Different
methods can be use to control the pre placement of hard macros and improve the
QoR of the hard macro placement. To create a user-Defined array of hard macros
and by setting floor plan placement constraints on macro cells. Place a macro cell
relative to an anchor object. Using a virtual placement strategy create macro
blockages for hard macros and pad the macros to their respective positions.
Standard cell placement tile is used during placement phase. the placement
tile is defined by one vertical routing track and the standard cell height. Placement
and routing blockage layer definitions are internal to physical design tools. To
avoid placing standard cells too close to macros, which can cause congestion or
DRC violations, one can set a user-defined padding distance or keep out
margin around the macros. One can set this padding distance on a selected
macros cell instance master. During virtual flat placement no other cells will be
placed within the specified distance from the macros edges.
Chapter 5
Placement and Power Planning
5.1 Introduction
Power integrity is an important factor to any successful design.
Power integrity refers to the notion of providing each circuit in the design the
required supply voltage to enable proper switching. Given the loss nature of
chips and the various noise- inducing factors that m a k e this task almost
impossible, the goal of power design is to provide reliable power levels within
acceptable design margins to the various switching devices in the chip.
Given that t he design of a reliable and robust clock network necessitates
a design of a robust and reliable power network, in this chapter we discuss
the various factors that play a role in the design of the power delivery system.
Reliable power directly affects the performance and reliability o f the design.
The delays of the switching devices are directly proportional to the power levels
they receive. In addition, t h e design of the power distribution network is a
function of the number of switching devices, their switching speeds, their sizes,
there locations, and their interconnections.
Failure to design a robust power network and provide the required
power levels to the different parts of the chip will cause the design to
violate its performance constraints, and potentially, it might lead to failure in
functionality. With the down-scaling of the process technology, the noise margins
have shrunk 10s of millivolts. Any perturbation in the power delivery network
could cause a design failure.
In nanometre technologies, designs switching at high frequencies require
a com- prehensile design approach of the power network that takes into account
the chip and package Thus if the design of the power network does not account
for the packages effects, it cannot be guaranteed to provide the necessary
power levels and could cause the clock network design to fail. This is why
we believe an in depth study of the power design in the package and chip is
needed.
5.2 IO Power
Due to the fast switching speeds of high-end IO circuits careful design of the
power network for these circuits is required. The IO power network is separated
from the core power network. This is done not only because the IO circuits might
have different supply voltages, but also to protect the IO and the core from
the high- frequency effects caused by the switching of these IO cells. The IO
circuits are typically large buffers that draw large currents when switching on and
off. Due to the high-inductive nature of the package and board traces, the
inductive noise (L) is typically very large and could cause logic failures to the core.
Due to the high inductance of the package traces, the inductive noise
could wreak in power supply of the switching IO drivers as well.
5.2.1 Core Power
The core power network is separated from the IO power network as
discussed above. This results in separate power and ground pads on the chip to
supply current to the IOs and core. The number of power pads needed is a function
of the current needs of the logic, the size of the die, and the layout of the chip. If
the pads are bounded to the periphery of the die, then the number of pads needed is
a function of the resistivity of the power grid and the estimate of the current needs.
This ensures that the IR Drop constraint is honored and the needed current is applied
to the design.
The inductance of the power grid is gaining importance in nanometre and
high-end designs. Careful design of the power and ground networks is needed to
make sure the current loops are as small as possible to reduce the effective
inductance that is seen by the switching devices.
5.3 Power Network Synthesis
Power network synthesis offers advanced power planning technology and
helps solve signal integrity problems without lengthy and tedious iterations. By
performing power network synthesis, you can view early power plan and thereby
reduces the chance of encountering electro migration and voltage drop problems
later in the routing.
Before doing the power plan the prerequisites to be done for the design as
shown in figure 5.1 as power planning.
Figure.5.1 Power Planning
[10]
5.3.1 Power Pads
Power pads are used to supply power to the core the number of power pads
for each side of the core is decided by the factors like Total core power, number
of sides core voltage and maximum allowable current is current density of that pad.
Number of power pads on each Side = [Total Core Power / (Number of
side*Core Voltage * Maximum Allowable current on Each I/O Pad)]. (5.1)
Total Core Power = [Total dynamic Power of core / Core voltage]...... (5.2)
Current Drawn by the core = (Total Core Power / Worst Case Voltage). (5.3)
5.3.2 Rectangular Rings
Rectangular Rings are the core part of the design to supply power to the
core cells and I/O cells for the design this rings are used, we can use two rings
one for VDD and one for VSS. The Ring width is calculated by using the formula
as follows.
Core ring width for Metal 4 = (Current Drawn by Core / (2*JMetal4* core power
pad for each side of the chip))...... (5.4)
Note: current in to core is split in to two directions so we can multiply by 2.
Core ring width for Metal 5 = (Current Drawn by Core / (2*JMetal5* core power
pad for each side of the chip)).. (5.5)
5.3.3 Power Straps
If the design cell count is more we cannot supply power to each cell in
the core through rectangular rings, if we supply power through rings more
power is required to drive power to cells and we cant supply power to each cell
in the core so we are using power straps to supply power.
To calculate how many power straps are required for the design we
require core height, width, power, voltage and current to calculate power Straps
Max Vertical Strap spacing = Lmax = Vmax/(J x Rsh). (5.6)
No. of vertical straps = Nv = (Core Width)/ Lmax (5.7)
Max Horizontal strap spacing = Lh = 2 x Lmax (5.8)
No. of horizontal straps = Nh = (Core Height) / Lh (5.9)
Strap Width = Wring / (Nv x Nh).. (5.10)
Where Vmax = Maximum Voltage
J = Current Density
Rsh = Sheet Resistance
5.4 Placement
This step is tasked with placing the cells (placeable components) legally
such that there is no cell overlap and congestion is minimized. The objective of
the placement is to reduce area, wire-length, and improve timing. Placement is
divided to two parts, global placement and detailed placement.
5.4.1 Tasks to Be Performed During Placement
In placement stage primary task to define placement blockages are areas
that leaf cells must avoid during placement and legalization, including
overlapping any part of the placement blockage. Placement blockages can be hard
or soft. A hard blockage prevents cells from being put in the blockage area. A soft
blockage restricts the coarse placer from putting cells in the blockage area, but
optimization and legalization can place cells in a soft blockage area.
Second task is to set placement options to minimize congestion during
placement and optimization. Congestion occurs when the number of wires
going through a region exceeds the capacity of that region. Third task is to
automatically insert protection diodes on sub design ports to prevent antenna
violations at the top level.
Next task is to perform placement and optimization. To perform this task
uses the commands available in the tool. In physical optimization one can run
incremental placement-based optimization that supports area recovery, design rule
fixing, sizing and route-based optimization.
5.4.2 Global Placement
In global placement, the objective is to distribute the cells over the die-
area in a such a fashion that global objectives (timing, wi r e -length) are attained.
It is permissible t o have overlaps amongst the cells. At this st age, the
objective is to be able to compile some estimates of the die-area, wire-length,
and timing violation. If the produced estimates are not satisfactory better
partitioning, design guides, and re-planning the IO signals or the hard macros
are carried out to improve the results of the global placement.
5.4.3 Detailed Placement
In detail placement, the cells which are clustered together or are on
top of each other as a result of the previous step are spread and re-
ordered. The objective is to produce a legal placement (no overlaps),
minimize congestion, and improve wire-length. Again different design constraints
can be imposed on the detail placer so that the legalization step does not
wreak havoc in the timing of the design.
Chapter 6
Clock Tree Synthesis
6.1 Introduction
The design of the clock network has become a challenging task due to the
growing complexity of the design, the down-scaling of the process technology,
and the increasing frequency of the devices. In nanometre designs, tight design
constraints related to skew; power and latency are imposed on clock network.
This task of designing clock network is further complicated by the fact that
reasonably accurate cell and interconnect delay estimated are needed for the
design for the clock network, 40% of the total power is consumed by clock network
in the design. Accurate power planning depends on knowing the placement and
sizes of the cells in the design, especially the clock buffers and sequential elements.
In a synchronous digital system clock signal is used to synchronize the
movement of data within the system. Clock signals are required to be distributed at
physically remote locations of an integrated circuit. Clock signals transitions drive
all the synchronous elements of a digital circuit like Flip Flops and Memories.
These elements are referred to as Sinks. Clock Distribution Networks (CDN) is
circuits that distribute a clock signal from a central global clock source at the
centre of the Integrated circuit to all the sinks which use it.
In the process of Clock distribution, the clock signal traverses through a lot
of interconnect networks and buffers which are a part of the clock distribution
network. These elements introduce delay in the clock signal path. Ideally, a clock
signal should arrive at all the sinks at the same time. But due to the variations in
parameters like wire interconnect length, temperature variations, capacitive
coupling and process variations; the arrival time of the clock transition at
different sink locations varies. Clock skew as a fraction of the cycle time, is a
growing problem for faster chips. The problems for clock skew are fewer gate
delays, large clock loads.
This spatial variation in the arrival time of the clock transition on an
integrated circuit is commonly referred to as Clock Skew. For two points i and j,
if the arrival times of the clock signals are a
i
and a
j
respectively then the clock
skew between two points is given by d(i,j) = a
i
-a
j.
Figure 6.1 Clock Skew Illustrations
[18]
Clock signals typically have the highest fan out and operate at the
highest speed in a synchronous digital s y s t e m. Since t h e c l o c k s i g n a l s a r e
u s e d t o synchronize the operations of the entire digital circuit the clock transitions
should be sharp and should have minimum possible skew to avoid any data
integrity errors or race conditions. As the frequency of operation of the synchronous
circuit increases the circuit becomes more and more susceptible to clock skew i.e.
the timing becomes more and more critical.
At this point we introduce another term called Slew Rate. Slew is
maximum rate of change of signal in a circuit. Slew depends upon the time it takes
for a signal to rise (fall) from logic low (logic high) to logic high (logic low). More
commonly, it depends on the time it takes for a signal to rise from 10% to 90% or
fall from 90% to 10% of the supply voltage. For a clock distribution network, in
addition to achieving minimum skew, it should also try to obtain as high slew
rate as possible (i.e. minimum time to change from one logic level to another).
Another factor of prime importance in clock distribution network design is
the power consumption. Clock distribution networks account for a significant
component of power consumption on an integrated circuit. It is therefore
absolutely essential to build clock distribution networks with minimum possible
power consumption.
6.2 Clock Network Modelling For Logic Synthesis
The clock tree should be modelled and analyzed at different stages of
the design. Modelling of clock tree is done differently in synthesis and backend
stage. The modelling of clock at these stages is discussed below.
6.2.1 Virtual Clocks
Prior to the interconnect-dominated era, a clock tree was synthesized and
the buffers were inserted based on s o me load-driven delay estimates.
Since the interconnect resistance was so low, those estimates did not differ
much from the actual delay values after implementation. In the interconnect-
dominated era, such an approach is no longer viable. To overcome this problem,
most designers estimate the clock timing annotations (latency, uncertainty, skew)
and annotate them on an ideal clock network. The hope is that these estimates will
more conservative than the implementation results, and the design converges.
6.2.2 Trail Clock Network Synthesis
A second approach is to do placement and clock tree synthesis under the
hood while doing logic synthesis in order to get reasonable clock annotations. Once
logic synthesis is done, the clock network is removed before handling the netlist off
to the physical synthesis stage. Although there is the issue of correlation between
the final clock network that is synthesized after the P&R stage and that built
during logic synthesis based on global placement information, this approach is
an improvement over the ideal clock assumption since it captures the global
placement as well as the global congestion in the design when synthesizing the
clock network. Since physical synthesis could make big changes to the design in
order to close timing or improve some metric be it power, routability, or noise-
related issue, the clock estimates generated during logic synthesis are likely to be
off compared to final numbers.
6.3 Clock Design at Implementation Stage
In the implementation stage, the design faces with the same issue as in
the logic synthesis stage. As we mentioned earlier, clock network synthesis requires
cell placement, in particular latch placement to be able to extract realistic parasitics
and do delay calculation. However, early in the design planning stage cell
placement is not done and the need to make some assumptions about the clock
network is still present. In a similar fashion to logic synthesis, physical synthesis
has make some assumption about the clock annotations or synthesize a clock
network under the hood as it tries to optimize the physical netlist to meet the design
constraints.
It is worth mentioning that in both the logic synthesis and the
physical synthesis, the algorithms are iterative in nature. This leads to multiple clock
network synthesis processes if the chosen route is to synthesize the clock
network under the hood. Since at the end of both stages and prior to the final clock
network synthesis the clock network is discarded off, a lot of time and resources
are wasted. In some cases, it is due to the lack of the proper automation algorithm
due to the complexity of the problem at hand (NP-complete problems), and in other
cases, the iterative and incremental nature of the flow is due to legacy reasons.
6.3.1 Clock Network Synthesis in a Flat Design Flow
The clock network synthesis approach is directly affected by the RTL-
to- GDSII design flow. The clock network synthesis approach in a flat fashion.
However if the design flow is hierarchical, the clock network synthesis can be
done either hierarchically or in a similar fashion to the flat approach.
6.3.2 Clock Network Synthesis in a Hierarchical
Design Flow
In a hierarchical implementation of the design, each partition will have its
own clock driver and its clock tree or network. One way of creation the clock port
for each partition is to synthesize the clock network flat, and then use the
information of the produced network to add clock ports to the partitions. In addition,
the produced clock network will provide information on the latency and skew in
each partition.
The timing analysis in each partition is done with ideal clocks whose
latency, skew and jitter/uncertainty values are estimated based on the global clock
planning. Since neither the placement nor the routing on which the clock plan
relied are the final placed and routed netlist, the estimates could be off as compared
with final place optimized and routed designs.
6.4 Prerequisites for Clock Tree Synthesis
Before doing clock tree synthesis there are some factors to be consider
to check whether the design is ready for cts or not.
6.4.1 Design Prerequisites
Before running clock tree synthesis, the design should meet the following
requirements and if issues a raised, the designer has to repeat previous steps.
The design is placed and optimized. Check whether the placement is
legalized or not. The estimated QoR for the design should meet your requirements
before you start clock tree synthesis.
If congestion issues are not solved before clock tree synthesis, the addition
of clock trees and placement of buffers can increase the congestion, if the design
is congested, you can rerun placement step or identify the congestion spots by
reloading the design and remove them by finding coordinates of particular tracks.
To ensure that the clock tree can be routed, verify that the placement is
such that the clock sinks are not in narrow channels and that there are no
blockages between the clock root and its sinks, if these condition occur, fix the
placement before running before clock tree synthesis. The power and ground nets
are prerouted. High- Fanout nets, such as scan enables, are synthesized with buffers.
6.4.2 Library Prerequisites
Any cell in the logic library that you want to use as a clock tree reference (a
buffer or inverter cell that can be used to build a clock tree) or for sizing of gates
on the clock network must be usable by clock tree synthesis and optimization.
By default, clock tree synthesis and optimization cannot use buffers and
inverters that have the dont_use attribute to build the clock tree. If we have cells
with no reference pin of particular gate in the library put dont_use on that gate and
link the library and perform clock tree synthesis.
The physical library should include all clock tree references (the buffer and
inverter cells that can be used to build the clock trees). Routing information,
which includes layer information and non-default routing rules. Resistance and
capacitance
Information models used to estimate the Resistance and capacitance.
6.5 Clock Distribution Architectures
The clock distribution network is responsible to provide a reliable and
stable environment for the clock signal to reach the clock cells. To do so, the
distribution network shoul d provide i mmunit y from systematic and random
variations whi ch could distort the clock signal as it travels to the destinations.
The clock distribution network is typically composed of two parts: global
and local. The global clock network delivers a reliable and low skew clock to
different parts (sections or blocks) of the chip. The global clock network could be a
grid a synthesized tree, an H-tree or a hybrid network which uses a combination of
these topologies. H-tree driving a mesh or a set of spines is a favourite top-level
clock network topology due to the simplicity of the H-tree although its
implementation in nanometre has become very challenging. The integrity of both
the global and the local parts of the distribution network are needed to provide a
reliable and robust clock network.
6.5.1 Tree
It is the most common topology choice for ASIC designs. Although trees
provide the least control over skew, they exhibit low power consumption, and
low area overhead. Low to middle frequency designs employ trees while high-end
designs employ custom-made topologies.
A tree generated by a clock tree synthesis engine is shown fig 6.2 as
tree generated by clock tress synthesis engine.
Figure.6.2 A tree generated by clock tree synthesis engine
[19]
6.5.2 H-Tree
H-tree is a tree topology which relies on matching delays to all clock sinks
in the network (figure6.3).This is accomplished by placing nodes at equidistant
positions from their roots and by matching delays to all nodes at the same level. In
real designs where hard macros and congestion may make such an ideal network
unrealizable, for the best way H-tree try to optimize for whatever skew present at
the leaves of the H- tree. A clock tree in the clock distribution network is show in
fig 6.3 clock tree as follows.
Figure.6.3 H-tree balances skew by equidistant paths form root to sinks
[19]
6.5.1 Tree
[19]
6.5.2 H-Tree
[19]
6.5.1 Tree
[19]
6.5.2 H-Tree
[19]
A clock buffer in the clock tree is used to balance the output loads
and minimize the clock skew a delay line can be added to the network to
meet the minimum insertion delay (for clock balancing) , buffers are used to speed
up the clock signals. The effects of cts in the design several hundreds of clock
buffers are added to the design, Placement and routing congestion may increase,
Timing violations can be introduced.
Clock Planning in flat implementation of the clock relied on a
preliminary placement of the logic cells in the blocks; it is obvious that such a
design flow does provide guarantees of convergence. None of t h e l e s s ,
hierarchical design implementation. Typically this is the adopted flow. It is this
lack of convergence guarantees and the complexity of the flow that make
designers lean toward a flat implementation whenever such an approach is
feasible.
6.6 Algorithms for clock tree construction
The first geometric algorithms for clock routing evaluated skew in terms
of wire length from the source to sinks and produced minimum wire length trees
for a given sink clustering using the deferred merging and embedding (DME)
principle. The deferred-merge embeddi ng (DME) algorithm defers the choice of
merging (tapping) points for sub trees of the clock tree. The principle of algorithms
works on Manhattan geometry.
6.7 Analyzing the Clock Trees
Before running clock tree synthesis, anal yze each clock tree to
determine its characteristics and its relationship to other clock trees in the design.
For each clock tree, determine the clock root pin or position of clock tree and
number of clock tree sinks and clock tree exceptions. The number of clock tree
levels, if any pre-existing cells such as clock-gating cells are present in the design.
If any logical design rules constraints like maximum Fanout, transition time and
maximum capacitance are applied to the design. If there are any routing
constraints like routing rules and metal layers are applied to the design whether
the clock tree has timing relationships with other clock trees in the design, such as
interlock skew requirements.
6.7.1 Identify the Clock Tree End Points
Clock paths have two types of endpoints. Stop pins are the endpoints of
the clock tree that are used for delay balancing. During clock tree synthesis, IC
Compiler uses stop pins in calculations and optimizations for both design rule
constraints.
Exclude pins are clock tree endpoints that are excluded from clock tree
timing calculations and optimizations. Verify that the default sink pins (implicit stop
pins), implicit nonstop pins, and implicit exclude pins are accurate by generating
a clock tree exceptions report. If the default sinks pins, implicit nonstop pins, and
implicit exclude pins are correct, you are done with the clock tree exception
definition.
6.7.2 Analyzing Clock Sink Groups
A clock sink group is a group of clock sinks driven directly by
a single net. The sink group assumes the net name. Sink groups can have timing
relationships when an endpoint in a sink group has one or more start points or
endpoints in another sink group. Each start point-and-endpoint pair forms one
timing relationship path.
6.7.3 Defining Clock Root Attributes
If the clock root is an input port (without an I/O pad cell), must
accurately specify the driving cell of the input port. A weak driving cell does not
affect logic synthesis, because logic synthesis uses ideal clocks. However, during
clock tree synthesis, a weak driving cell can cause IC Compiler to insert extra
buffers as the tool tries to meet the clock tree design rule constraints, such as
maximum transition time and maximum capacitance. If not specified a driving
cell (or drive strength), IC Compiler assumes that the port has infinite drive
strength. If the clock root is an input port with an I/O pad cell, must
accurately specify the input transition time of the input port.
6.8 Clock Skew Scheduling
As clock network design became more complex, and design
convergence became harder, design emphasis shifted from designing minimum-skew
networks to designing low-skew. Clock networks while reducing power and
improving robustness. By utilizing the available skew, timing convergence of the
design can be enhanced. There are two approaches to skew scheduling they are
useful skew and intentional skew.
Useful skew converts the present skew into a useful timing budget that can
be allocated to that critical or near critical paths in the design. This is done by
shifting the clock assertions such that STA will use the adjusted clock
assertions when checking for timing violations and reporting critical paths.
Intentional Skew is a sequential optimization technique to design skew in
as part of the logic/physical synthesis stage. The task becomes one of designing a
clock network satisfying the skew constraints generated by the synthesis engine to
optimize performance.
This can be accomplished either by inserting intentional delay
buffets/inverters on the clock paths that need to be delayed or it can be
accomplished by sizing the buffers/Inverters to decrease or increase the delay
along some paths Although skew scheduling reduces the number of close-to-zero
skew clock nodes, the actual physical implementation of the clock network
becomes harder. Another advantage of this approach is to reduce the number of
simultaneous switching clock cells by delaying the toggling of some of the
registers. This is desirable in order to reduce the peak current consumption of
the clock network and reduce the noise injected into the power grid.
6.9 Optimization of Registers
Circuit optimization plays a key role in the performance and cost of the
clock network. Selection and optimization of the type of latch or register to be
used in the clock network has a great impact on the achieved skew and power. To
properly assign the type of latch needed, timing analysis is performed to annotate
the netlist with the correct path constraints and path slacks. Since different logic
paths have different skews, the fastest latch is not warranted on every path. A
trade-off can be made between power and performance on the less critical paths.
Typically, such an optimization is not carried out as part of the back-end flow
since changing the registers used in the RTL, netlist is not encouraged. However,
given that most of the power is in the last stage of clock network and to converge on
design timing.
6.10 Clock Analysis and On-Chip Variation
If statistical analysis algorithms are not employed in the verification of
the clock network, designers have to rely on worst case decisions to study the
timing of the clock in the presence of process variations. Although this conservative
approach makes converging on the timing of a design with very tight timing
constraints difficult When a cell is common between a clock path and a data path.
This causes pessimism in the analysis not realistic to have cells in that path.
Chapter 7
Routing
7.1 Introduction
After CTS, the routing process determines the precise paths for interconnections.
This includes the standard cell and macro pins, the pins on the block boundary or pads
at the chip boundary. After placement and CTS, the tool has information about the
exact locations of blocks, pins of blocks, and I/O pads at chip boundaries. The logical
connectivity as defined by the netlist is also available to the tool. In routing stage,
metal and vias are used to create the electrical connection in layout so as to complete
all connections defined by the netlist. Now, to do the actual interconnections, the tool
relies on some Design Rules.
It is essential that tool completes all connections that are defined by the netlist
(100% routability), i.e. no LVS errors. No design rules are violated in completing the
routes (No DRC errors). All timing constraints are met.
7.2 Process Design Rules
In the Physical Design Flow, an input to the PnR tool is a Technology File
(or technology LEF for Cadence.) These are the constraints that the router should
honour. Designers techfile will have many more parameters for each layer. As in the
layer M1 above, minimum spacing, minimum width, minimum area etc are defined. It
also specifies which via connects the two metal layers M1 & M2. If any of these
parameters like spacing, width, via size etc are violated for any routing the tool does,
you will get a DRC error.
7.3 Routing Grid
Most of the routers available are grid based routers. There are routing grids
defined for the entire layout. Consider it like a graph as below. For grid based routers,
there is also preferred routing direction defined for each metal layer. e.g. Metal1 has a
preferred direction of horizontal, metal2 has preferred routing direction of vertical
and so on. So, in the whole layout, metal1 routing grids will be drawn (superimposed)
horizontally with metal1 wire picthand metal2 grids will be drawn vertically with
metal2 wire pitch between each. In the technology section above has a pitch defined
for metal1.
The first figure 7.1 on left figure shows how routing grids are drawn. I am only
considering two metals for now, but in a process with more metals, similar grids will
be superimposed on the layout for all available metals. Pitch is calculated by
determining the minimum spacing required between grid lines of same metal. This
can be the minimum spacing of the metal itself, but is usually a value greater than the
minimum spacing. This is calculated by taking into account the via dimension as well,
so that no two adjacent wires on the grid create any DRC violation even when there
are vias present.
Figure 7.1 Routing grids
[20]
In a grid based routing algorithm, the router switches the metal as per
preferred direction to interconnect the nodes. In the second figure 7.2, metal1 &
metal2 wires are drawn along the metal1 & metal2 grids respectively. They are
interconnected by via1 to complete the routing path.
for metal1.
are vias present.
[20]
for metal1.
are vias present.
[20]
Figure 7.2 Routing with two different metals
[20]
7.4 Global & Detailed routing
The PnR tools do routing in various stages, like global routing, track
assignment and detailed routing. It could also be that all these algorithmic stages are
masked from you and you just have a couple of commands to play with. Most PnR
tools deal with the routing problem in a two stage approach. In global routing, the tool
partitions the design into routing regions. A rough route is determined taking into
account the number of tracks available in each region. Routing congestion is also
determined at this stage by calculating 1) how many nets should pass through the
region; 2) How many routing tracks are available in the region. In detailed routing,
global routing results are used to lay the actual wires interconnecting the nodes. Do a
man on the routing options command and how much controllability is available in
each of these stages for the tool of our choice.
7.5 Routing Congestion
It is difficult to route a highly congested design. Some not-so congested
designs may have pockets of high congestion which will again create routing issues. It
is important that the congestion is analysed and fixed before detailed routing. After
CTS, the tool can give you a congestion map by a trial route/ global route values.
[20]
[20]
There are commands to check routability which gives you congestion numbers,
blocked pins etc, like check_routability.
7.6 Routing Order
It is recommended to route sensitive nets like clock before the rest of the signal
route. Completing power routing after the floorplan stage. Anyway the order of
routing is:
7.6.1 Power Routing
Connect the macro and standard cell power pins to the power rings and straps
which is created for the design. IR drop
7.6.2 Clock Routing
Do not upset the skew and delay values for the clock net as much as possible.
So the clocks are given higher priority in using routing resources and routed prior to
any other net routing. Clock routing can be limited to higher metal layers for reduced
RC numbers.
7.6.3 Signal Routing
The rest of the nets are routed. We can also route groups of nets, and non-
default routing rules can also be applied to select nets.
Chapter 8
Signal Integrity
8.0 Signal Integrity
Signal integrity is the ability of an electrical signal to carry information
reliably and to resist the effects of high-frequency electromagnetic interference from
nearby signals. The following conditions can impact signal integrity:
8.1 Introduction
For nanometre designs it is no longer sufficient to just achieve timing
closurea design must also reach signal integrity (SI) closure. SI closure implies that
the design is free from SI-related functional problems and meets its timing goals
while accounting for the impact of SI (see Figure 8.1).
Figure 8.1 SI closure criteria
[21]
8.2 Crosstalk
Crosstalk is the undesirable electrical interaction between two or more
physically adjacent nets due to capacitive coupling. Crosstalk can lead to crosstalk-
induced delay changes or static noise.
8.3 SI Closure Methodologies
In order to efficiently achieve SI closure certain design methodology decisions
should be made up front. They should be based on product schedule and market
requirements. SI avoidance is the most efficient way to achieve SI closure, but it
needs to be balanced against trade-offs of other design metrics such as area,
performance, and power. For example, most SI problems can be avoided by spreading
wires farther apart and reducing the ratio of coupling capacitance to grounded
capacitance.
However, if this approach is applied everywhere in the design the result is a
much larger die and increased cost. For certain critical nets, such as clocks or chip-
level buses, a practical solution could involve using wider wires, shielding with power
and ground lines, using repeaters to break up wire lengths, using different routing
layers for adjacent wires, or using 2-3X minimum spacing.
Other up-front decisions can be based around the selection of intellectual
property (IP) blocks. Ideally IP blocks should neither be noise-sensitive nor noise
sources. This applies to all forms of IP from standard cells, memories, I/Os, and
custom digital or analog cores. If an IP block is noise-sensitive or a noise source, then
early decisions can be made to protect this blocksuch as using guard-rings,
applying blockages to prevent over-the-block or near-block routing, spacing, or
shielding, or even selecting an alternative implementation of the same function.
All of the SI methodology choices mentioned above can be made early in the
design process. They all involve trade-offs in terms of area, performance, and
engineering schedule. They can be implemented as design methodology restrictions
and the implementation tools can be used to enforce the decisions in a correctly-
construction fashion. Creating design restrictions that minimize or eliminate certain
noise sources or noise-sensitive blocks or nets prior to implementation will greatly
enhance SI closure productivity.
8.4 SI Prevention
A number of techniques can be used to prevent SI problems during design
creation. During placement for example, the placement can be optimized to avoid
over-congested areas. Congested areas increase the likelihood of congested wires
leading to an increase in crosstalk. Other techniques during placement include
balancing slews within the design so that there are no very fast or very slow signal
transitions. Very fast transitions when present on aggressors will lead to an increase in
crosstalk. Weakly driven nets with slow transitions are potential crosstalk victims if
there is significant coupling on these nets. Typical examples of weakly driven nets are
non-timing critical signals such as resets or scan lines. These nets tend to be long and,
consequently, subject to many potential aggressors. A noise glitch on a reset line can
cause intermittent resetting of a chip while a noise failure on the scan line will make
testing a design very problematic. Using these heuristics during placement greatly
decreases the occurrence of these types of SI failures as shown in the figure 8.2.
Figure 8.2 buffer insertion to victim & aggressor
[21]
While SI prevention during placement will help reduce certain SI problems,
the main prevention effort should come during routing. As SI is inherently a wiring
problem, it has become necessary to address SI prevention as the design is being
routed. Crosstalk effects, such as glitch and delay, can only effectively be measured
when physical wires are available and final wire topology, layer selection, and track
assignments are concrete. Since placement-based and global route-based SI
prevention solutions do not have this detailed information with which to make trade-
offs, they are only a partial solution. In the nanometre era, physical wire effects need
to be taken into account to achieve reliable timing and SI closure during the final
routing stage of the design.
The key to successful SI closure during routing comes from having native on-
the-fly incremental extraction, timing analysis, SI analysis, and optimization. This
means that potential SI problems can be addressed as they occur. A number of
prevention techniques can be employed during routing to correct SI issues:
Wire spacing, Net ordering, Layer selection to reduce coupling and resistance,
Minimizing parallel wire lengths, Shielding, Buffer insertion & Gate resizing.
8.5 SI Analysis and Repair
After SI-aware routing is complete, a full detailed extraction and analysis
should be performed to determine if there are any remaining SI problems. This
analysis should include identifying potential functional and timing problems
introduced by SI. Functionality checking should involve calculating the worst-case
potential crosstalk glitch that can occur on every wire and propagating that glitch to a
storage element such as a latch or flip-flop to determine if it will cause a stored logic
state to change. A noise failure criterion based on latching glitches, rather than noise
peak on each victim or noise rejection curves on each receiving cell, will reduce the
number of potential repairs by several orders of magnitude.
In a SI prevention-based flow that uses noise propagation as the failure
criteria, the number of potential violations found post-route should be relatively small,
typically fewer than 50 for a design with 500K instances (~2M gates) at 250Mhz
using a 130 nm process. Consequently, repairing the remaining functional noise
problems (if any) is easily achieved through automatic or even manual repair. In
contrast, if the noise failure criterion is such that thousands of functional noise
problems are reported, then the repair effort can be significant and may not converge.
To repair glitch problems, a number of techniques can be used such as
Downsize the victims driver, Upsizing aggressors drivers, Buffer or repeater
insertion to break down crosstalk effects into smaller constituents & Spacing,
shielding or re-routing wires as shown in the figure 8.3. The key to successful
convergence on repair is to find the best solution that creates the least disturbance to
the existing design. For example, if re-routing a net to reduce coupling, the original
timing can be maintained by restricting the length of the new route to be similar to the
original route.
Figure 8.3 Adding Shielding to aggressors
[21]
More challenging than fixing functional violations, however, is fixing the
impact SI has on timing. The additional delay changes caused by crosstalk increase
the degree of difficulty for achieving timing closure. First a post-route static timing
analysis of the design must be performed to determine if any new setup or hold
violations have been introduced by SI. Each new failing path needs to be re-
optimized. This timing repair process must endeavour to fix the failing paths with the
minimum of design disturbance while identifying the optimal way to regain lost time.
Fixes for timing violations can include traditional in-place optimizations as
well as crosstalk reduction techniques such as those used to repair functional
violations. To converge quickly on repairs, both functional and timing problems
should be repaired simultaneously. As each potential repair is implemented it should
be incrementally analyzed to determine if it really fixes the problem and to ensure it
does not introduce a new timing or functional glitch.
After all repairs have been implemented, the design should be considered
closed and ready for final verification and SI sign-off.
Chapter 9
Results
9.1 Synthesis Results
The synthesis is performed on the design asynchronous fifo.
9.1.1 Timing Reports
****************************************
Report : timing
-path full
-delay max
-max_paths 1
Design :aFifo
Version: D-2010.03-SP4
Date : Mon Jun 3 17:42:47 2013
****************************************
Operating Conditions: TYPICAL Library: saed90nm_typ
Wire Load Model Mode: top
Startpoint: ReadEn_in (input port clocked by WClk)
Endpoint: GrayCounter_pRd/BinaryCount_reg[3]
(rising edge-triggered flip-flop clocked by RClk)
Path Group: RClk
Path Type: max
Point Incr Path
--------------------------------------------------------------------------
clockWClk (rise edge) 24.00 24.00
clock network delay (ideal) 0.00 24.00
input external delay 0.50 24.50 f
ReadEn_in (in) 0.00 24.50 f
U642/QN (NAND2X0) 0.04 24.54 r
U643/QN (INVX0) 0.13 24.67 f
U639/QN (NAND2X0) 0.08 24.75 r
U957/QN (INVX0) 0.05 24.79 f
U637/QN (NAND2X0) 0.06 24.85 r
U987/QN (NOR2X0) 0.05 24.90 f
U988/QN (NOR2X0) 0.04 24.94 r
U989/Q (MUX21X1) 0.08 25.02 r
GrayCounter_pRd/BinaryCount_reg[3]/D (DFFX1) 0.00 25.02 r
data arrival time 25.02
clockRClk (rise edge) 30.00 30.00
clock uncertainty -4.00 26.00
GrayCounter_pRd/BinaryCount_reg[3]/CLK (DFFX1) 0.00 26.00 r
library setup time -0.07 25.93
data required time 25.93
--------------------------------------------------------------------------
data arrival time -25.02
--------------------------------------------------------------------------
slack (MET) 0.91
Startpoint: Empty_out_reg
Endpoint: Empty_out (output port clocked by WClk)
Path Group: WClk
Path Type: max
Point Incr Path
------------------------------------------------------------------------------
Empty_out_reg/CLK (DFFASX2) 0.00 0.00 r
Empty_out_reg/Q (DFFASX2) 3.79 3.79 r
Empty_out (out) 0.00 3.79 r
output external delay -0.50 5.00
------------------------------------------------------------------------------
------------------------------------------------------------------------------
slack (MET) 1.21
As shown in the report 9.1.1 the design consists of two clocks read clock and
write clock represented as RClk & WClk by which it can read data after its written.
Here the endpoint gray counter binary count reg[3] is capture flip flop which needs to
catch data before setup timing of the gray counter is clocked at the rise edge by the
clk. Clk represents the clock path. The incr means incremental cell where it adds the
net delay from start pin to end pin. The r and f represents rise and fall delay of the
cell.
9.1.2 Data Required time and Data Arrival Time
The data arrival time shown in report 9.1.1 is the amount of elapsed
time from the source of the launch clock edge to the arrival of data at the
endpoint. The data required time shown in report 9.1.1 is the latest allowable
time for the date at the path endpoint, taking into account the nominal capture clock
edge time, the clock network delay, the clock uncertainty, the least possible delay
along the clock path, and the library setup time is taken from the library.
9.1.3 Slack
The slack value shown in report 9.1.1 is difference between data required
time and data arrival time. The slack is the amount of time by which the timing
constraint is met, considering the latest possible arrival of data at the endpoint and
the earliest possible arrival of the capture clock edge. In this example, the slack
is zero which means that the timing constraint is barely met. A negative slack
would require a change in the design to fix the violation. On the other hand, a
large positive slack offer opportunities for optimization.
9.1.4 Setup and Hold Time
Every flip-flop has restrictive time regions around the active clock edge
in which input should not change. We call them restrictive because any change
in the input would effect the output. The setup time is the interval before the clock
where the date must be held stable.
Figure 9.1 Setup and Hold Time
As shown in figure 9.1, the timing window around the clocking event during
which the synchronous input must remain stable and unchanged in order to be
recognized. This window is defined by the setup and hold times. If either is violated
correct operation of the flip flop is not guaranteed.
The hold time is the interval after the clock where the data must be held
stable. Hold time can be negative, which means the data can change slightly before
the clock edge and still be properly captured. Most of the current flip-flops has zero
or negative hold time.
To avoid setup time violations the combinational logic between the flip-flops
should be optimized to get minimum delay, redesign the flip-flops to get lesser
setup time tweak launch flip-flop to have better slew at the clock pin, this will make
launch flip-flop to be fast there by helping fixing setup violations.
To avoid hold time violations, delays can be added (using buffers), one
can add lockup-latches (in cases where the hold time requirement is very huge,
basically to avoid data slip).
9.1.5 Power report
****************************************
Report : power
-analysis_effort low
Design :aFifo
Date : Mon Jun 3 17:42:49 2013
****************************************
Library(s) Used:
saed90nm_typ (File: /home/11011J6033/dcshellfinal/ref/saed90nm_typ.db)
Global Operating Voltage = 1.2
Power-specific unit information :
Voltage Units = 1V
Capacitance Units = 1.000000pf
Time Units = 1ns
Dynamic Power Units = 1mW (derived from V,C,T units)
Leakage Power Units = 1pW
Cell Internal Power = 401.1208 uW (82%)
Net Switching Power = 87.0481 uW (18%)
--------------------
Total Dynamic Power = 488.1689 uW (100%)
Cell Leakage Power = 24.7386 uW
The above report 9 . 1 . 5 shows the total power consumed by the design.
The total power is sum of cell internal power and net switching power. Cell internal
power is obtained from library for particular cells used in the design and switching
power is dissipated when charging and discharging the load capacitance at the cell
output. The units of voltage capacitance and time units values are taken default
from the library.
9.1.6 QOR (Quality of results)
****************************************
Report :qor
Design :aFifo
Date : Mon Jun 3 17:42:47 2013
****************************************
Timing Path Group 'RClk'
----------------------------------------------
Levels of Logic: 8.00
Critical Path Length: 0.52
Critical Path Slack: 0.91
Critical Path Clk Period: 30.00
Total Negative Slack: 0.00
No. of Violating Paths: 0.00
Worst Hold Violation: 0.00
Total Hold Violation: 0.00
No. of Hold Violations: 0.00
----------------------------------------------
Timing Path Group 'WClk'
----------------------------------------------
Levels of Logic: 0.00
Critical Path Length: 3.79
Critical Path Slack: 1.21
Critical Path Clk Period: 6.00
Total Negative Slack: 0.00
No. of Violating Paths: 0.00
Worst Hold Violation: 0.00
Total Hold Violation: 0.00
No. of Hold Violations: 0.00
----------------------------------------------
Cell Count
----------------------------------------------
Hierarchical Cell Count: 0
Hierarchical Port Count: 0
Leaf Cell Count: 551
Buf/Inv Cell Count: 7
CT Buf/Inv Cell Count: 0
----------------------------------------------
Area
---------------------------------------------------------
Combinational Area: 3700.889009
Noncombinational Area: 3924.139912
Net Area: 0.000000
---------------------------------------------------------
Cell Area: 7625.028921
Design Area: 7625.028921
Design Rules
---------------------------------------------------------
Total Number of Nets: 718
Nets With Violations: 0
The above report 9.1.6 describes several factors like design rules area, timing
path group and cell count for the design. This reports a QoR summary without
reporting details about timing path group. The timing path group details of cell
count, along with current design statistics such as combinational, non-
combinational, a nd total area, and the area reports are um
2
. Under the cell count
section the Leaf cell count report includes all leaf cells that are not constant cells.
9.2 Design Planning Results
9.2.1 Floorplan reports
Total Area 7625.028921 um
2
Core utilization 0.930
Number of Rows 31
Core Width 91.84
Core Height 89.28
Aspect Ratio 0.972
Total Number of Nets 718
Total Number of Cells 551
Table 9.1 floor plan Report
Total area is the area of total cells from the netlist, core utilization is
0.930 (93%) we can take core utilization 40 80 and so on, but default value of
utilization is 70% because it becomes to 100% at the end of routing and for small
designs core utilization can 90 to 95%. if we take 40 or 80 percent utilization, the
placement of cells is over utilized at placement stage so, default value is
70%.Based on numbers of cells in the design the number of will be decided to place
the cells. Aspect Ratio is used to build the die with the available resources.
If we take AR 0.5 the shape will be rectangular and clock structure is not
built correctly so, AR plays important role in building clock tree so.AR is 1.The
number rows specified in table 9.1 is the number of rows in which standard cells
are placed, This rows or power rails are further used to deliver power to the
placed cells and power straps attached to the cells on the core. Total no of nets and
cells are from the netlist which is obtained from synthesis stage. The units of
core utilization, core height, width and area are microns.
As shown in figure 9.2, the design after floor plan. The figure shows the
die, core and standard cells locations with indications.
Figure 9.2 Floor Planned Design
9.2.2 Virtual Flat Placement
The virtual flat placement is very fast and is to optimized for wire
length, congestion and timing. To evaluate initial virtual flat placement some of the
tasks are done before it. During virtual flat placement no other cells will be placed
within the specified distance from the macros edges, if they are present. To avoid
Congesiton related issues and placement of cells this flat placement is done.
As shown in figure 9.3, the virtual flat placement of the design with
placement of cells. The placement of cells is not fixed at this stage.
Figure 9.3 Virtual Flat Placement
9.2.3 Congestion Analysis
Congestion occurs where there are a lot of chip-level or inter-block wires
that need to cross an area. For instance, interconnect between cells and I/O
pins or memory ports will be very dependent on both the floorplan as well as
where those cells are placed within the floorplan, Global interconnect congestion
can occur even when there is low placement density. In fact, in some cases low
placement density can even cause congestion because of the need for long
connections and additional buffering. Finally, chips with a limited number of
routing layers for cost reasons can also cause global congestion.
Both Dirs: Overflow = 43 Max = 7 (1 GRCs) GRCs = 15 (1.42%)
H routing: Overflow = 4 Max = 2 (2 GRCs) GRCs = 2 (0.19%)
V routing: Overflow = 39 Max = 7 (1 GRCs) GRCs = 13 (1.23%)
The output of virtual flat placement is congestion report, the above brief
report shows that the congestion is 0.03% means that in vertical direction two cells
are in overlapped, one GRC cell is allowing more than its limitation of
wire tracks .congestion occurs during this stage because of standard cells and
macros are close together. If the congestion issues are not solved at this stage
it will increase the congestion stage at further stages of design and cause DRC
violations at final routing stage of the design.
As shown in figure 9.3, the congestion spot indicating the number
56/54, means the signals passing through this wiring track is 56, but it is allowing
54 signals. To allow rest of the signals pass through that track we have to do
incremental placement.
After virtual flat placement congestion is 1.42% means that in vertical
direction two cells are in overlapped, one GRC cell is allowing more than its
limitation of wire tracks .congestion occurs during this stage because of standard cells
and macros are close together. For instance congestion can occur in slots between
memories or around corners of memories identifying this type of congestion requires
floorplan be used as the input.If the congestion issues are not solved at this stage it
will increase the congestion stage at further stages of design and cause drc violations
at final routing stage of the design , signal integrity issues and timing violations for
high frequency designs.
After Congesiton spot is identified we can eliminate that by incremental
placement to move some logic closer together and spread other logic farther
apart. After eliminating congestion spot the congestion report is as follows.
9.3 Power Planning Results
9.3.1 Rectangular Rings Result
After doing floor planning next step is to add rings .To supply power to
the core cells and I/O cells for the design this rings are used, basically this rings are
in rectangular in shape because chip size is rectangular in shape. The rectangular
rings type, width and number of rings are used for the design. The shape itself
indicates they are rectangular, if we have drawn rings inside square boundary
they are square rings. Current in to the core is split in to two directions.
As shown in the figure 9.4 Rectangular rings are highlighted, and metal
layers are used to design these rings are 4 and 5.
9.3.2 Power Straps Result
The straps are automatically connected to the closest ring to supply power
to cells in the core. The power straps attached to core and rings are shown in the
figure 8.8 as power straps.
Figure 9.4 Rectangular Rings & Power Straps
9.4 Power Network Synthesis Results
After adding power straps next step is PNS (power network synthesis)
to identify the early power plan to avoid on-chip variations like IR drop and voltage
drop problems later in detailed power routing.
9.4.1 IR Drop Analysis
The power network has resistance associated with its wires. This resistance
causes a voltage drop as power is transferred from the power pads to the target
devices. To reduce the IR drop in the power grid, sufficient number of
power IOs, decoupling capacitors, and sufficiently wide grid wires
(low resistance) are needed. IR Drop = max (V
dd
v
i)
vi
is
the potential a t
any node on the power grid.
Target IR Drop 150 mv
Average Power Dissipation 2000.00 mw
Total Number of Cells 551
Average Power Current 1333.33 mA
Power Supply Voltage 1.5 V
Maximum IR drop on VDD 144.699 mV
Maximum IR drop on VSS 217.00 mV
Maximum Current 41.003 mA
Maximum Instance IR Drop U479 352.398 mV
Table 9.2 IR Drop Analysis
As shown in the table 9.2 the summary of IR drop explains about the
power analysis for the design. IR drop occurs at the L/2 distance of the
chip. If the distance is 50 um the Ir drop occurs at 25 um, because signals
have to carry information between two distances. Here Target ir drop and
power supply voltage are related each other. Average power dissipation is total
power budget of the die to supply power. U479 is the name of the net for which
maximum IR drop occurs. Maximum current is amount of current supplied to the
core.
As shown in figure 9.5 the power network synthesis, IR drop position
is identified with red occurs, rest of the colours are levels of IR drop
occurs at different areas of the chip.
Figure.9.5 Power Network Synthesis
As shown in the table 9.2 Target IR Drop (135mv is 10% of Vdd), Less than
Maximum IR drop on instance net 352.398 mV, means IR drop is not met. If cant
met the Target IR drop, we can met the IR drop by using these techniques
They are Increase the Mesh width and Add more no of Straps. So by increasing mesh
width here the ir drop is met. But ir drop sometimes may be due to glitches, so at after
routing stage when glitch is removed ir drop analysis need to done to meet ir drop.
9.5 Placement Results
Placement is an essential step in electronic design automation, the portion
of the physical design flow that assigns exact locations for various circuit
components within the chips core area. Typical placement objectives include total
wire length, power, timing and runtime minimization.
The main focus of the placement algorithm is making the chip as dense
as possible (Area constraint), minimize the total wire length (Reduce the
length for critical nets. The number of horizontal/vertical wire segments crossing a
line. In the placement the total number of cells increases due to additional buffers
and inverters increases due to changes in the length and width of soft macro cells,
and also the net length of the cells adds to the area of the design. As shown in
figure 9.6 shows the design after placement. At this stage the cells are placed
and early power plan has made. The above colours (pink, blue) represent the
power straps, pink colour refers to VDD and Blue colour refers to VSS, are used to
supply power to the cells in the design.
Figure 9.6 Placement
9.5.1 Timing report
****************************************
Report : timing
-path full
-delay max
-max_paths 1
Design :aFifo
Version: D-2010.03-ICC-SP4
Date : Tue Jun 4 13:21:36 2013
****************************************
* Some/all delay information is back-annotated.
Path Group: RClk
Path Type: max
Point Incr Path
--------------------------------------------------------------------------
U642/QN (NAND2X0) 0.12 * 24.62 r
U643/QN (INVX0) 0.25 * 24.87 f
U639/QN (NAND2X0) 0.12 * 24.98 r
U957/QN (INVX0) 0.06 * 25.05 f
U637/QN (NAND2X0) 0.06 * 25.11 r
U987/QN (NOR2X0) 0.06 * 25.17 f
U988/QN (NOR2X0) 0.04 * 25.22 r
U989/Q (MUX21X1) 0.08 * 25.30 r
GrayCounter_pRd/BinaryCount_reg[3]/D (DFFX1) 0.00 * 25.30 r
--------------------------------------------------------------------------
--------------------------------------------------------------------------
slack (MET) 0.63
Startpoint: Full_out_reg
(rising edge-triggered flip-flop clocked by WClk)
Endpoint: Full_out (output port clocked by WClk)
Path Group: WClk
Path Type: max
Point Incr Path
------------------------------------------------------------------------
Full_out_reg/CLK (DFFASX2) 0.00 0.00 r
Full_out_reg/Q (DFFASX2) 3.80 3.80 r
Full_out (out) 0.17 * 3.97 r
-----------------------------------------------------------
-----------------------------------------------------------
slack (MET) 1.03
9.5.2 Area report
Area
----------------------------------------------
Combinational Area: 3700.889009
Noncombinational Area: 3924.139912
Net Area: 0.000000
Net XLength: 6753.28
Net YLength: 6021.40
----------------------------------------------
Cell Area: 7625.028921
Design Area: 7625.028921
Net Length: 12774.69
Design Rules
----------------------------------------------
Total Number of Nets: 718
Nets With Violations: 0
9.5.3 Power report
Report : power
Design :aFifo
Date : Mon Jun 3 17:42:49 2013
****************************************
Library(s) Used:
saed90nm_typ (File: /home/11011J6033/icshellfinal/ref/saed90nm_typ.db)
Voltage Units = 1V
Time Units = 1ns
---------------------------------------------------------------------
Total Dynamic Power = 505.8137uW (100%)
Cell Leakage Power = 34.6414uW
9.5.4 Congestion report
Both Dirs: Overflow = 2028 Max = 25 (1 GRCs) GRCs = 237 (22.40%)
H routing: Overflow = 205 Max = 11 (1 GRCs) GRCs = 47 (4.44%)
V routing: Overflow = 1823 Max = 25 (1 GRCs) GRCs = 190 (17.96%)
After placement congestion has increased because before it was virtual
placement where everything placed in unorganised way but now it has to be placed
orderly so all cells are placed in congestion so it creates less space for channels to get
routed thats why congestion has increased a lot. In order to remove congestion at
placement stage we need to create soft blockages where there is cell density more and
make utilization over there to decrease keeping in mind not to disturb timing.
Leakage power has been increased because ir drop can create many shorts or
opens in device, so many of them will be inactive which creates leakage power. and
due to shorts unnecessary transitions takes place which implies dynamic power to
increase. Operation of deign at low frequency if it happens to be at high frequency
generally it gets timing violations and signal integrity issues.
9.6 Clock Tree Synthesis Results
After placing all blocks after the placement an actual clock is assigned to the
design and also a clock tree will be formed accordingly replacing the virtual
clock which has been in the design till now, and the clock tree bifurcates to
individual blocks. As shown in figure 9.7, the propagation of clock is highlighted.
Figure 9.7 clock tree synthesis
In Cts stage the clock changes its state form ideal to propagated so the
real time delays will be added to the design, the there will be a variation in the
slack.
9.6.1 Timing report:
****************************************
Report : timing
-path full
-delay max
-max_paths 1
Design :aFifo
Date : Tue Jun 4 13:42:20 2013
****************************************
* Some/all delay information is back-annotated.
Parasitic source : LPE
Parasitic mode : RealRC
Extraction mode : MAX
Extraction derating : 25/25/25
Path Group: RClk
Path Type: max
Point Incr Path
-------------------------------------------------------------------------------
U642/QN (NAND2X0) 0.11 * 24.61 r
U643/QN (INVX0) 0.23 * 24.85 f
U639/QN (NAND2X0) 0.11 * 24.95 r
U957/QN (INVX0) 0.06 * 25.02 f
U637/QN (NAND2X0) 0.07 * 25.08 r
U987/QN (NOR2X0) 0.06 * 25.14 f
U988/QN (NOR2X0) 0.04 * 25.19 r
U989/Q (MUX21X1) 0.08 * 25.27 r
GrayCounter_pRd/BinaryCount_reg[3]/D (DFFX1) 0.00 * 25.27 r
clock network delay (propagated) 0.01 30.01
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
slack (MET) 0.67
Startpoint: Full_out_reg
(rising edge-triggered flip-flop clocked by WClk)
Endpoint: Full_out (output port clocked by WClk)
Path Group: WClk
Path Type: max
Point Incr Path
------------------------------------------------------------------------------
clock network delay (propagated) 0.54 0.54
Full_out_reg/CLK (DFFASX2) 0.00 0.54 r
Full_out_reg/Q (DFFASX2) 3.81 4.36 r
Full_out (out) 0.17 * 4.53 r
-----------------------------------------------------------
-----------------------------------------------------------
slack (MET) 0.47
As explained in 9.1.1, it can see that the there is a minor change in the slack
in the implementation, this is because an actual clock is assigned to the design and
also a clock tree will be formed accordingly replacing the virtual clock. it can be
noticed that the clock network delay which was ideal in all the above steps
has changed to propagated in this step, so the network delay also comes into
picture which increases the delay slightly.
9.7 Analyzing the Routing Results
Figure 9.8 Routed Clock Drivers with Clock Nets
9.7.1 Noise report before buffer insertion
****************************************
Report : noise
-verbose
Design :aFifo
Date : Sun Jun 9 16:27:31 2013
****************************************
slack type: area
noise_region: above_low
pin name width height slack
------------------------------------------------------------------------
Full_out_reg/SETB
Aggressors:
Full_out 0.21 0.43
WriteEn_in 0.03 0.05
n273 0.08 0.03
Total: 0.21 0.43 -0.08
noise_region: below_high
------------------------------------------------------------------------
Full_out_reg/SETB
Aggressors:
Full_out 0.20 0.42
n273 0.09 0.04
Total: 0.20 0.42 -0.07
****************************************
9.7.2 Noise Report after buffer insertion
Report : noise
-verbose
Design :aFifo
Date : Sun Jun 9 16:27:31 2013
****************************************
slack type: area
noise_region: above_low
-----------------------------------------------------------------------
Full_out_reg/SETB
Aggressors:
Full_out 0.03 0.12
n273 0.04 0.01
Total: 0.03 0.12 0.23
noise_region: below_high
-----------------------------------------------------------------------
Full_out_reg/SETB
Aggressors:
Full_out 0.04 0.11
n273 0.03 0.02
Total: 0.04 0.11 0.24
9.7.3 Power report before buffer insertion
****************************************
Report : power
Design :aFifo
Date : Mon Jun 3 17:42:49 2013
****************************************
Library(s) Used:
Voltage Units = 1V
Time Units = 1ns
--------------------
9.7.4 Power report after buffer insertion
****************************************
Report : power
Design :aFifo
Date : Mon Jun 3 17:42:49 2013
****************************************
Library(s) Used:
Voltage Units = 1V
Time Units = 1ns
--------------------
As shown in the 9.7.4 & 9.7.6 noise report & power report by inserting the
downsize buffer at the victim net. The noise height width decreases thus unnecessary
transitions reduce thus decreasing the dynamic power but in small design dynamic
power doesnt show reduction as buffer also uses dynamic buffer so at large designs
dynamic power reduction is visible. so the net switching power decreases which in
turn reduces dynamic power.
Figure 9.9 Victim net representations
Figure 9.10 Buffer additions to victim net
By adding buffer to the victim the width and height of the buffer decreases so
the noise margin of the glitch decreases thereby when it propagates to the succeeding
cell which has higher noise margin would neglect the propagation of glitch. here the
noise threshold is 0.35 if anything exceeds this then it comes unacceptable region , so
the glitch is unacceptable and we need to eliminate it. By inserting buffer at victim net
we can decrease the coupling capacitance, thereby glitch dimension decreases.
Chapter 10
Conclusion
A Glitch is an unwanted signal which propagates in the design without prior
knowledge. It consumes maximum dynamic switching power when it is huge number.
As the technology gets scaled down day to day the delay offered by it decreases
thereby at the propagation of two or more signals with different arrival times produce
the glitch. The methodology here doesnt need to change the library cells or create
any library cell where changing the different arrival times by making the difference of
arrival times less than cell delay which receives the arriving signals.
This proposed methodology reduces the dynamic power when the glitch effect
is more and even ensures the correct functionality of the device by meeting time in
terms of setup & hold times. Therefore cell switching power decreases saves power.
Though the methodology doesnt well suit for smaller designs as glitches may
decrease but due to buffer addition cell internal power increases which leads to
dynamic power incremental. Right now for less glitch effect design the proposed
methodology is best.
Chapter 11
Future Scope
Due to heavy scaling in technology increase in gate number, wire lengths, net
delay, ir drop, electron migration and performance of the design play a huge role in
the design implementation. The coupling capacitance and resistance along the paths.
Some amendments need to be made to improve performance with rich functionality
(noise less environment).
Glitch reduction in this proposed methodology removes or decrease glitches to
some extent by manually. There need to be automated mechanism or script which
detects and adds the buffers at victim or aggressor nets. Algorithms should be in such
a way where it has to do automatic shielding spacing for congested designs. Thus the
proposed methodology by adding automates mechanism works like charm in reducing
maximum extent of noise in the design.
Department of ECE, JNTUHCEH
Bibliography
1. Physical design Essentials by Khosrow Golshan: An Asic Implementation
Perspective 2007.
2. ASICs the Course by Michael John Sebastian Smith 1990.
3. http:asic-soc.blogspot.in/2007/10/power-planning.html
4. K. Agarwal, D. Sylvester and D. Blaauw, Simple Metrics for Slew Rate of RC
Cir-cuits Based on Two Circuit Moments, ACM/IEEE Design Automation
Conference,2003.
5. C. Albrecht, A. B. Kahng, B. Liu, I. I. Mandoiu and A. Z. Zelikovsky, On
theSkew-Bounded Minimum-Buffer Routing Tree Problem, IEEE Transactions on
CAD 22(7), 2003.
6. C. J. Alpert, F. Liu, C. Kashyap and A. Devgan, Closed-Form Delay and Slew
Metrics Made Easy, IEEE Transactions on CAD, vol. 23, 2004.
7. http:www.vlsisystemdesign.com/crosstalk.html
8. www.cecs.uci.edu/~papers/compendium94-03/papers/2003/.../02_5.pdf
9. http://vlsi-physical-design.blogspot.in/2011/11/floor-planning.html
10. IC compiler Implementation User guide 2010.
11. IC compiler student guide 2010.pdf
12. Design Compiler user guide 2007.pdf
13. http://iroi.seu.edu.cn/books/asics/Book2/CH01/CH01.1.htm
14. http://xine2009.blog.163.com/blog/static/163309345201211935756414/
15. http://www.csee.umbc.edu/~tinoosh/cmpe415/tutorials/FIFO.pdf
16. http://www.asic-world.com/tidbits/clock_domain.html
17. http://asicpd.blogspot.in/2012/08/logic-synthesis.html
18.
http://quartushelp.altera.com/12.0/mergedProjects/reference/glossary/def_clockskew.
Department of ECE, JNTUHCEH
19. http://vlsi-soc.blogspot.in/2013/03/clock-skew-implication-on-timing.html
20. http://vlsi.pro/physical-design-flow-iv-routing/
21. http://w2.cadence.com/whitepapers/4496_SI_WP_Fnl.pdf

Remove Glitches in Physical Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Remove Glitches in Physical Design

Uploaded by

Copyright:

Available Formats

A TECHNIQUE TO REMOVE GLITCHES IN PHYSICAL

You might also like