You are on page 1of 11

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO.

5, MAY 2018 1591

A Low-Overhead Dynamic TCAM With


Pipelined Read-Restore Refresh Scheme
Sandeep Mishra , Member, IEEE, Telajala Venkata Mahendra, Member, IEEE,
Jyotishman Saikia, and Anup Dandapat , Senior Member, IEEE

Abstract— Hardware search using content addressable


memory (CAM) produces the fastest upshot but takes larger
design area and consumes relatively high power. Dynamic CAM
certainly is an alternative to resolve these issues, but the slower
search speed and void of ternary approach act as primary
design constraints. A fully ternary dynamic storage is presented
for high density and low leakage associative memory. A unique
self-refresh scheme is introduced with pipelined read-restore
mechanism that updates the CAM cells at much lower refresh
overhead. The proposed 4-kb dynamic ternary CAM structure
has been designed using predictive 45-nm CMOS technology
at a 48% reduced cell area than traditional ternary CAM and
simulated using SPECTRE at a supply voltage of 1.0 V. The
proposed design dissipates 0.583-fJ/bit/search at a search rate
of 440 MHz. The refresh module has been integrated at an
additional area of 1.92% and updates the TCAM at 0.21%
refresh overhead.
Index Terms— Content addressable memory (CAM), dynamic Fig. 1. Organization of a dynamic CAM bank with self-refresh module.
CAM, high-density design, search engine, ternary CAM.

I. I NTRODUCTION charging and discharging of capacitances associated with the


searchline (SL) and matchline (ML). The leakage in the static
D IRECT access to the match address is necessary for a
faster search in the modern information technology era.
Software based algorithms lacked a good search speed when
TCAM is also significant through the two static random access
memories (SRAMs) used for data and mask storage [5]–[7].
used for applications requiring repeated hits/searches. Later, Design of TCAM arrays using hybrid core cells for trade-off
hardware search engines were introduced such as content between hit rate and power dissipation largely suffer from scal-
addressable memories (CAMs) those perform parallel compar- ability issues [8], [9]. Improvement on TCAM performance is
ison of search key with all pre-stored data in a single clock achieved through gating or segmentation [10], [11]. The gating
cycle [1]–[4]. A ternary CAM (TCAM) renders an efficient technique is not much effective in reducing sub-threshold
search by supplying an auxiliary don’t care state “X” that current and requires additional cell area. A loadless 4 transistor
makes it suitable in popular applications such as network SRAM is introduced in [12] which consumes low in-cell
routing, high speed associative cache and pattern recognition. current. Many CAMs have been inspired from it to show
In spite of this efficient advantage of fast lookup, a TCAM promising features [13]–[15], but threshold control is a major
implementation takes larger design space and is power hungry. concern which is essential for reliable operation. Traditional
The major sources of power consumption in a TCAM are static TCAM cell consists more number of transistors that
increases the macro area. Since density of CAM is compara-
Manuscript received April 22, 2017; revised July 25, 2017 and tively less compared to SRAM and DRAM, technology scaling
September 12, 2017; accepted September 21, 2017. Date of publication
October 5, 2017; date of current version April 2, 2018. This work was becomes crucial particularly in portable devices.
supported in part by the Ministry of Electronics and Information Technology Dynamic CAMs (DCAMs) are the possible alternatives
under Project SMDP-C2SD 9(1)/2014-MDD and in part by the Fellowship to these issues with reduced leakage and smaller design
YFRF, Department of Science & Technology, Government of India under
Project YSS/2015/001198. This paper was recommended by Associate Editor area [16]–[23]. In dynamic memory, data is stored as elec-
Y. Pu. (Corresponding author: Anup Dandapat.) trical charge either in the capacitor or as charged net, which
S. Mishra, T. V. Mahendra, and A. Dandapat are with the Department of discharges over time. Thus, DCAMs suffer from one major
Electronics and Communication Engineering, National Institute of Technology
at Meghalaya, Shillong 793003, India (e-mail: ssandeep.mmishra@nitm.ac.in; issue of retention time [24]–[28] but it can be suppressed
telajalamahendra@nitm.ac.in; anup.dandapat@nitm.ac.in). through periodic refreshing as structured in Fig. 1, that can be
J. Saikia is with the Department of Electronics and Electrical designed at low additional area and refresh overhead. However,
Engineering, IIT Guwahati, Guwahati 781039, India (e-mail:
saikia.jyotishman.93@gmail.com). during the process of refresh, both search and sense modules
Digital Object Identifier 10.1109/TCSI.2017.2756662 are disabled which in turn reduces the frequency of search.
1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1592 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

Besides the low area requirement and power consumption,


a low overhead refresh scheme is highly encouraged in the
DCAM to meet the required search rate. Carbon nanotube tran-
sistors have been used in designing an high density dynamic
TCAM (DTCAM) [28] but requires excessive device property
adjustment and is not compatible with existing interfacing
circuits. The following design and performance aspects are
considered to improve the applicability and characteristics of
refreshing of the dynamic CAM:
1) A decoupled dataline-searchline structure is employed
to avoid the charge sharing between I/O capacitances
of search transistors. Fully ternary approach (local and
Fig. 2. Dynamic ternary content addressable memory showing write and
global masking) is followed to meet the modern applica- search transistor parasitics.
tion requirements. Section II describes the storage cell,
its detailed functionality, limitations and the drivers used
in designing the DTCAM architecture. area is occupied by the TCAM array. Traditionally, dynamic
2) Conventional dynamic random access memory (DRAM) CAMs use a coupled bitline-searchline approach with reduced
refresh is inherited by the available DCAMs. The transistors (4T/6T) [16], [18], [20]. This approach is beneficial
scheme equalizes bitlines to half-VDD at first, reading for high density requirements but encounters coupling between
data values through sense amplifiers (S/As) and finally drain and gate capacitance of search transistors during data
writing the values back into the cells. This refresh- write.
ing (equalization-read-restore) suffers from two major The presented DTCAM comprises a decoupled data-mask-
performance issues: 1) higher refresh overhead; 2) fault search line. The forced bitline values (BL and BL) are written
due to bitline coupling. A pipelined read-restore refresh to the storage nets (N1 and N2 ) respectively unlike the cross-
scheme is introduced in Section III which improves the coupled latch in SRAM. Parasitics from gate-source and gate-
search rate by eliminating the additional equalization drain (CG S3, CGD3 for transistor T3 and CGS4 , CGD4 for T4 )
phase. put a great impact on the net charges during the phase
3) An in-depth performance analysis and comparison change. In particular, when BL and SL are coupled together,
among the relevant designs are carried out in Section IV frequent switching occurs in the global line that affects the
with both core cells (NAND and NOR) and consideration storage net charges. Decoupled source of T1 –T3 and T2 –T4
of variation, macro size to prove the efficacy of the approach is necessary for higher retention time. The storage
design. nets in Fig. 2 are shown as connected to the refresh module
Scalability issues in CMOS fundamentally needs reduction in through dotted lines those are used for gate control during
the design components [29]. DCAMs have not been exploited the refresh duration (the detailed functionalities are discussed
much due to its lower search speed but the attractive feature in Section III).
of low cell area entitles a certain place in the hardware search Swapped-XOR matching for search [30] is faster but design
engine class. Such a design is presented and analyzed in this of the refresh module is challenging due to coupling. The
paper that provides a reliable fully dynamic data and mask mask transistor (T7 ) has been provided for making the design
storage with the feature of low overhead refreshing. full ternary (global and local). A NAND-type ML scheme
with matchline pass transistors (T5 and T6 ) has been used
II. DYNAMIC T ERNARY CAM for discussion and the results are compared with NOR-type
A basic structure of a dynamic CAM architecture is shown ML scheme in Section IV. The data storage in the presented
in Fig. 1. Writing scheme, storage and matchline sensing are DTCAM has been emulated from the 5T DCAM [19] with
much similar to a conventional CAM except the precharge the advantage of a fully ternary storage with an improved
and search control. A hold signal is provided by the refresh low-overhead refresh scheme. The DTCAM functionality is
module to the DCAM macro which is set through the summarized in Table I and discussed in brief in accordance
retention time profiling to block the search by deactivating with various operational phases.
the associated modules. The proposed structure adds a full 1) Write: During the write phase, bitlines (BL and BL) are
ternary approach and the self-refresh module to the earlier written into storage nets. Write control (W) is deselected
DCAMs [18]–[20], [22]. The DTCAM cell is shown in Fig. 2 during non-write phases and prefix bit (P) is deselected
which also illustrates the parasitics of the search transis- in local mask state as depicted in Fig. 3. The mask
tors (T3 and T4 ) and coupling capacitors between storage bit (M) is either “0” for no mask state or “1” in
nets (N1 and N2 ) and bitlines (BLs). A large amount of case of masking which is selected from the SRAMs.
the cell area is consumed by the interconnections and rout- The data-mask bit register has the data (D) and prefix
ings at lower technology nodes. A conventional TCAM has values. A 32-bit dataline-maskline driver is used for the
16T/18T (Regular/Transmission gate search transistors) with 4 kb memory. Searchlines (SL and SL) are held low
8 I/Os (bitlines, searchlines, separate mask and data writelines to block the ML value passage through transistor T5 .
with IN and OUT matchlines). Thus, most part of the core A high value of M turns ON the ML path through T6 .
MISHRA et al.: LOW-OVERHEAD DYNAMIC TCAM WITH PIPELINED READ-RESTORE REFRESH SCHEME 1593

TABLE I
S TATE TABLE OF A DYNAMIC T ERNARY C ONTENT A DDRESSABLE
M EMORY (*: S AME B ITLINES AND S EARCHLINES )

Fig. 4. TCAM array structure of proposed design.

Fig. 5. Dependency and significance of retention time (TRET ) over environ-


ment variation.

32-bit coupled dataline-maskline driver. Storage nets N1 and


N2 are connected to the search transistors as well as to the
read decoupled refresh transistors (RC1 and RC2 ). The read
wordline (RWL) is activated only during the refreshing and
the equalized readline values (Q and Q) change depending on
the storage pattern which are sensed by the column S/As. The
Fig. 3. Dataline-maskline driver for a DTCAM. refresh timing and detailed functionalities are discussed in the
next section. A latch-type voltage mode MLSA [14] has been
It also acts as a gating operation to deselect bitlines from used for faster access. It has a monoport charge-up sensing
write. This concept has power advantage over regular structure useful for TCAMs where the weak signals drive
write as described in [8]. matchline control transistors. A refresh module is necessary
2) Precharge: All matchlines are pulled-up to VDD during with feature of low search overhead for a potential large scale
this phase. Both bitlines and searchlines are deselected search. Estimation on the area and refresh time requirement is
during this phase. Bitlines can retain the original state presented based on the TCAM array size.
but it requires the write driver to act. Local mask-
ing in all cells in an entry will result in consider- III. P IPELINED R EAD -R ESTORE R EFRESH S CHEME
able short-circuit current but it is an unlikely event.
The storage net charge over time leaks through the coupling
Precharge as well as search phase are disabled during
parasitics at the bitline and search pass transistors. Thus,
refresh (REF=1) duration and are noted as floating
the DCAM cell eventually loses its data or results in false
in Table I.
match (flipped value). The DCAM cells must be refreshed
3) Search: A complementary search value (SL= SL)
periodically to retain the stored data on or before the refresh
results in regular match or mismatch state. In a global
interval. The refresh structure is designed based on the reten-
masking state, search lines are set equal (SL=SL=1 for
tion time profiling used for the estimation of refresh interval.
NAND in the present discussion and SL=SL=0 for
NOR core cells) which provides a high gate voltage
to the matchline pass transistor (T5 ) for all storage A. Retention Time Profiling
states. In case of local masking, a hit (match) occurs Retention time (TRET ) profiling is extremely a key in design-
in the corresponding matchline segment for all stored ing the timing and control unit. The significance of various
and search values. retention time dependent parameters are not same as the
The dynamic TCAM array structure and the matchline DRAM due to the access and storage schemes. The simulated
sense amplifier (MLSA) are shown in Fig. 4. The DTCAM TRET variation over these parameters and their significance are
bank is supplied with the BL, BL and M values from the plotted in Fig. 5. The dependency on size, supply, temperature
1594 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

Fig. 6. Port charge variation prior to and after over the refresh duration.

and process variation are similar as DRAM. Variable retention 2) Write/Refresh Control: The write/refresh control is
time does not incur a cell failure as the shortest retention time used for switching between regular write and sensed
is still larger. The charge stored in a cell is vulnerable to out data write during refresh. The active high refresh
coupling through data pattern dependency (DPD), which is sig- signal (REF) activates the upper TG to pass SO and SO
nificant due to the column read [column access strobe (CAS)] values [through (A) and (B) respectively]. The register
through SAs. But, a DTCAM matches across a word (row) on data for regular data write is denoted as “Reg”.
the contrary and storage data are never sensed directly. The 3) Write Driver: The driver used for writing the refreshed
charge sense during refresh is performed through decoupled data is same as the dateline-maskline driver shown
read transistors as shown in Fig. 4. The frequent search in a in Fig. 3. The output of the write/refresh control is
DTCAM affects the retention time due to associated global coupled together for SOs/Reg values and are provided
SL capacitance. A 10% safety measure is considered in the as the register value to the write driver. The write driver
search duration to maintain a reliable data retention. is partially illustrated in Fig. 7 as it is discussed earlier
in Section II. It can be observed from Fig. 6 that the
B. Refresh Module write control (W) signal is enabled during the whole
The timing of refresh duration and associated signal varia- refresh duration which writes the SO/SO values into the
tion are shown in Fig. 6. A readline precharge is executed in DTCAM cells.
every dynamic memories at the start of each refresh duration. Search drivers are disabled during the refresh and all match-
The improvement in refreshing time is achieved by performing lines are set at mismatch to avoid any false match. Use of the
read and write in alternate cycles during the refresh. Elabo- column sense amplifier leads to a small area overhead but
ration on the refresh scheme is made based on the structure the use is essential to provide a low search overhead. Sensed
shown in Fig. 7 and illustrated refresh timing diagram (Fig. 6). out values (SO and SO) are discharged after the refresh.
Primarily the structure contains three modules: 1) refresh Readlines (Q and Q) are also released to avoid any false
driver; 2) write/refresh control; 3) write driver which is also coupling through the refresh transistors (RC1 or RC2 ) during
shared during the regular write operation. The discussion on non-refresh states. The write signal is deselected at the end of
the refreshing is made on these for an 8-entry DTCAM. refresh duration which allows the regular CAM search.
1) Refresh driver: An active low read precharge sig-
nal (R_PRE) pulls-up both readlines (Q and Q) to VDD C. Estimation of Area and Refresh Overhead
as shown in Fig. 6. Depending upon the storage pattern Estimation of the area and refresh overhead is carried out
either RC1 or RC2 discharges one of the readlines’ value. for the DTCAM refresh module. The estimation takes the
The nMOS connected to nets N1 and N2 are scaled to module layout size and average interconnection routing space
have lower voltage threshold and the charge variations into consideration. The data provided here are normalized and
are sensed by the column sense amplifier [31] present can be taken as a standard for all DTCAMs of any matchline
in the refresh driver. The strengthen sensed out values size at any technology node.
(SO and SO) are then sent to the write/refresh control. The area overhead is the relative percentage of extra space
The values SO and SO can be visualized as data acquired by the refresh modules and associated routing space.
(D and D) values during regular write phase. The refresh Most of the interconnections required by refresh structure are
driver is precharged in pipeline for the subsequent reads shared by the DTCAM bank drivers and hence not considered
during writing of present row. in the estimation. The primary modules in the architecture
MISHRA et al.: LOW-OVERHEAD DYNAMIC TCAM WITH PIPELINED READ-RESTORE REFRESH SCHEME 1595

Fig. 7. Refresh module of the proposed architecture.

shown in Fig. 7 are: 1) a DTCAM cell; 2) write and word (M) dependent. The delay through the refresh driver and
search drivers; 3) row decoder; 4) matchline sense amplifier; write/refresh control decides the entry refresh time. The refresh
5) refresh driver and controller. The size dependency for a overhead is estimated based on worst case entry refresh for
M-word×N-bit DTCAM can be represented as: NAND and NOR MLs. The overhead can be calculated as:
 
DTCAM bank → M × N × 1-bit DTCAM cell (A) (1-entry refresh time) × Entries
OHR = × 100 % (2)
Search driver → N × 1-bit searchline driver (B) Refresh interval
Write driver → N × 1-bit dataline–maskline driver (C) Presented DTCAM structure has been designed using
 
M
Z 
the predictive 45-nm CMOS technology. Replacing all
Row decoder → × M × 2 − input NAND (D) the 1-bit/1-entry modules’ area designed in 45-nm in the
M Equation (1), the area overhead can be re-structured as
Z=4
Matchline S/A → N × 1-entry MLSA (E)
[15N(M + 5) + 22M] − 40N
Refresh driver → N × 1-bit refresh driver (F) OHA (45 − nm) = 1 − (3)
[15N(M + 5) + 22M]
Write/refresh control → N × 2 × 1-bit controller (G)
The area overhead is almost independent of TCAM array
Most of these are dependent on the matchline size. Row size irrespective of NAND or NOR-type matchlines. Refresh
decoder is reliant on words/entries and the DTCAM bank overhead is dependent on it but has been minimized to a great
is array size dependent. The relative design space difference extent through the proposed refresh scheme.
between the refresh module with other can be written as:
OHA IV. R ESULTS AND P ERFORMANCE C OMPARISON

M The proposed structure has been implemented using the
[MNA + NB + NC + ( M )MD+NE]−(NF+2NG)
Z
generic process design kit (GPDK) 45-nm CMOS technology.
Z=4
= 1− An extensive performance comparison with relevant dynamic

M
MNA+NB + NC + ( M )MD + NE
Z CAMs (decoupled 4T [18] and 6T DCAM [21]) have been
Z=4 carried out with a reorganization for both NOR and NAND-type

M ML sensing to prove the efficacy of the proposed design.
N[(MA + B + C + E) − (F + 2G)] + ( M )MD
Z
Z=4
Arrays of 128×32-bit DCAM structures have been designed
= 1− (1) using the same technology for comparison at the same

M
N[(MA + B + C + E)] + ( M )MD
Z environment (PVT). Transistors with standard thresholds
Z=4 (0.36 and –0.4 V) and sizes (120/45 nm) except IV-C have
The overhead is mostly ML size dependent as concluded been used in all the compared designs for a legitimate analysis.
from Equation (1) which allows the designer to increase the The structures are scaled from 8-bit to 64-bit for testing their
entry size to the maximum. The time required for refresh is stability and integrity. Energy-delay analysis are made on
1596 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

Fig. 8. DTCAM core cell layouts, refresh performance and NAND versus
NOR-ML scheme performance comparison.
Fig. 10. Power dissipation dependency over temperature variation.

Fig. 11. Power dissipation distribution comparison between (a) Static TCAM.
(b) Proposed dynamic TCAM.

readlines (Q and Q). Hence, a low write energy dissipation


Fig. 9. Extracted layout and performance results of the proposed architecture.
carrying out these operations can be noted which has a
certain advantage in using the proposed refresh scheme. The
the compared DCAMs with different transistor sizes. Feature low leakage in the dynamic cells give an added benefit and
and performance comparison summary is made to present the compensates the dissipation during the refreshing. Read energy
proposed DTCAM advantages over the referred designs. is dependent on the S/A consumption and almost twice of the
write dissipation.
A. Refresh Scheme Performance
As a design demonstration, search is performed prior to B. Power Dissipation Analysis
and after refresh; the DTCAM performance is presented The CAM power performance is characterized by the con-
in Fig. 8. DTCAM cell net charge degrades over time and sumption during search as well as precharge it follow. This
affects the matchline delay. The low match time after refresh occurs due to the frequent ML voltage switching and one of
is obtained through the use of column S/As shown in Fig. 7. the noticeable metric in the measurement. Power distribution
A NAND core cell is shown in Fig. 8 which has 8 I/O ports; comparison over a temperature range of –20 ◦ C to 100 ◦ C is
the NOR counterpart does not have the matchline of prefix plotted in Fig. 10. A high amount of short-circuit current flows
segment (MLP) port. The performance results shown here is in the 6T DCAM [21] during phase change from precharge
from the extracted layout of a 8 × 8 NOR-type DTCAM and to search which puts it on the higher side in both peak and
the results are compared in brief with the NAND structure. average power dissipation metric. On the other hand, both
A good search frequency performance of ‘NOR’ design betters decoupled 4T [18] and proposed structure have gate controlled
the good power performer ‘NAND’ and all the comparisons ML discharge and result in low power dissipation.
made with referred structures are made using this scheme Static TCAMs have attractive feature of higher hit rate but
except in IV-F. have two leaky static storages for data and mask. Higher sized
The refresh module is designed at an additional area of 18% TCAM macro hence consumes more power and it gets signif-
as illustrated in Fig. 9. The extreme low refresh overhead icant at lower technology nodes [2]. Dynamic CAMs solve
is due to the smaller DTCAM size of 8 × 8-bit (≈ 0.43% this issue and it can be seen from Fig. 11 that the DTCAM
for 256 rows). During the write phase, sensed out values consumes 19% less leakage from static TCAM. At higher
(SO and SO) are written into the DTCAM cells through search frequency, short-circuit power becomes pronounced
write driver and a simultaneous precharging occurs on the and the lower value in the proposed structure ensures better
MISHRA et al.: LOW-OVERHEAD DYNAMIC TCAM WITH PIPELINED READ-RESTORE REFRESH SCHEME 1597

Fig. 12. Power distribution and stability analysis. (a) Energy dissipation comparison over temperature variation. (b) Operational phase dependent power
distribution comparison. (c) Peak power analysis over process corner variation.

power performance. The energy dissipation at the mentioned


temperature range is illustrated in Fig. 12(a). Low average
change in the variation can be noted in the proposed DTCAM
with lower dissipation among all the compared structures.
Fig. 12(b) shows the normalized power distribution among
all three operational phases (write, precharge and search).
The 6T DCAM consumes less power during write but has
notable dissipation in the search phase. Proposed structure has
least dissipation during this phase and it improves the energy
for per search (EfS) metric. Decoupled 4T has higher write
dissipation among the compared designs which is a drawback
as the dynamic CAM goes through frequent cell restore.
C. Sensitivity to Process Corner and Technology Variation
The compared design performances have been ana-
lyzed at various process corners (FF-TT) as depicted in Fig. 13. Energy dissipation and EDP comparison at various process
corners (FF: fast corner; FS: fast nMOS, slow pMOS; SF: slow nMOS, fast
Fig. 12(c) and 13. The 6T DCAM exhibits large change in the pMOS; SS: slow corner; TT: typical corner).
peak power of 46.9% where as the proposed scheme has trivial
variation of just 25%. The design presented in [18] is vulnera-
TABLE II
ble to process variation while the proposed structure performs
E FFECT OF D EVICE S CALING ON THE E NERGY D ISSIPATION [fJ]
with lower search rate at slow corners. The cell has a ternary AND E NERGY-D ELAY [fj×ns] OF C OMPARED DCAMs
structure and the ML voltage has to be discharged through both
evaluation transistors (from data and mask storage) unlike the
cells in [18] and [21] which results in higher ML delay but
the ternary storage is essential to meet the network and pattern
matching application requirements. NAND-type ML scheme
does not suffer from this issue and the proposed DTCAM
with this scheme performs faster than the compared designs as
discussed later in this section. Table II presents the compared
design performances at various channel lengths at aspect ratios
presented in [14]. Device scaling on the presented DCAMs in Transistor threshold becomes crucial at lower technology
other analysis are not performed for a fair comparison. From nodes for a good power-delay trade-off. The supply volt-
the table it is clear that at higher channel sizes, proposed age scaling performances of compared NOR-type designs
design performs at higher hit rate. It dissipated least at all (1.2 to 0.6 V) are plotted in Fig. 14 and NAND-DCAMs
transistor sizes among compared DCAMs which ensures its (1.2 to 0.8 V) in Fig. 15. All compared structures are designed
stable power performance at any technology node. with same thresholds (0.36 V for nMOS and -0.4 V for pMOS)
for an efficient comparison. The peak power has a linear
D. Supply Voltage Scaling Performance reduction for both 6T DCAM and proposed structure but an
Wireless devices are expected to operate at low supply volt- increment can be observed in the decoupled 4T design at lower
ages without significant degradation in the search performance. supply voltages. It also has an increased energy dissipation
1598 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

Fig. 14. Design sensitivity to supply voltage scaling from 1.2 to 0.6 V. (a) Peak power variation. (b) Energy dissipation. (c) Energy delay product variation.

Fig. 15. Variation of the matchline voltage and evaluation current over Fig. 16. Scattergram of the energy dissipation versus matchline delay
1000 runs of MC sampling method. on 1000 runs of MC sampling method.

below 0.8 V supply. The coupled bitline-searchline approaches


are therefore least preferred for low voltage operations. Higher
delay in the proposed design due to the ternary structure results
in the trivial peaks in Fig. 14(c).

E. DTCAM Performance Over Variation


Beside the environment variation (PVT), local mismatches
can cause independent random variations those can change the
design behaviour. The Monte-Carlo (MC) sampling method
rather helps in deciding the design reliability over these
variations in the most conscientious way. The MC method is
applied for 1000 runs with a random search key. Variation of
the matchline voltage and evaluation current during the phase
change are plotted in Fig. 15. The proposed architecture has Fig. 17. ML delay averaged over 1000 MC runs illustrating the 3σ dispersion.
shown good stability in the matchline delay with a maximum
search time of 0.794 ns. The peak variation of 14% can be
noted in the evaluation current. slope (m = –1.08) ensures good energy-delay trade-off.
Scattergram of the most two important metrics under con- Matchline delay averaged over 1000 MC runs with
sideration (EfS and matchline delay) is illustrated in Fig. 16. 3 sigma (3σ ) dispersion is depicted in Fig. 17. A standard
It shows less scattered points in the energy for per search deviation of 41.22 ps in the matchline delay with a mean value
metric with moderate variation in the ML delay. The negative of 653.27 ps certainly proves the robustness of the design.
MISHRA et al.: LOW-OVERHEAD DYNAMIC TCAM WITH PIPELINED READ-RESTORE REFRESH SCHEME 1599

TABLE III
TCAM C ORE C ELL P ERFORMANCE OF C OMPARED DYNAMIC CAMs

F. TCAM Core Cell Comparison


The key provided in search is compared bit by bit with all
entries in the TCAM array. When all the bits in one entry
match with the search key, a hit is indicated. The hit or match Fig. 18. Energy-delay comparison between compared DCAMs with
signal [32] can be written as NAND -type ML sensing at supply voltage scaling from 1.2 to 0.8 V.

M = [SL(N − 1)  BL(N − 1)].............[SL(0)  BL(0)] (4)


Similarly, the mismatch signal is expressed as
MM = [SL(N − 1)⊕BL(N−1)] + .............[SL(0) ⊕ BL(0)]
(5)
Most designers prefer NOR core cells over NAND-type for
its lower matchline delay but the NAND counterpart has
advantages of lower energy dissipation and better stability.
A comprehensive comparison between these are summarized
in Table III. A miss at any cell stops the serial matchline path
in a NAND-type ML which reduces the switching in subsequent
cells and the primary reason for low power consumption. In a
DCAM, it includes lower refresh energy dissipation and lesser
peak current with 35.79% and 12.3% reductions over the
NOR-type structure respectively as presented in Fig. 8. The
extreme low dissipation results in lower energy-delay for all Fig. 19. Energy-delay performance comparison at various CAM macros.
structures with lowest difference in case of proposed structure.
In a NOR-type matchline scheme, the parallel ML capac-
itance and pull-down resistance combination through each Lower leakage over the static TCAM with low core cell
core cell in mismatch state helps improving the discharge area allows transistor channel shrinkage to ensure higher
rate with higher bit mismatches which increases the search design density. The proposed DTCAM is proved to be a
rate and paves more search frequency. The ability to func- better power performer in the mentioned temperature and
tion at lower supply voltages in combination with lower supply voltage range. Device sizing improves the proposed
ML delay makes NOR-type ML structure superior than structure performance as discussed in IV-C and it is sturdy
NAND -type DCAM. NAND -type designs do not perform well over supply voltage scaling. It also shows good stability over
at lower supply voltages due to serial passage of match data temperature variation at various process corners. NAND-type
and hence the illustration of matchline delay in Fig. 18 is made ML scheme suffers from lower matchline delay while the
for supply voltage scaling from 1.2 to 0.8 V. The NAND cells NOR-DCAM with higher energy dissipation. The proposed
are cascaded to form a matchline with a match in a cell allows structure performs better than the compared designs in accor-
the match value “0” to the subsequent segment. A large serial dance to these with the respective core cells.
resistance is created at this state and the weak nets in DCAMs Energy-delay performance comparison at various
controls each stage evaluation, resulting in higher ML delay. DCAM macro sizes are illustrated in Fig. 19. The proposed
The coupled 4T DCAM does not function below 0.9 V and has design performs stronger in both energy and delay measures
higher ML delay at other supply voltages. Higher delay limits from 8-bit to 64-bit matchline size. It affirms the cascading
the number of searches per second and makes the NAND-type capability to form higher sized TCAMs. Features and
DCAMs less worthy in forming large TCAM arrays. performances of the proposed structure are compared
with recently published relevant designs and summarized
G. Performance Comparison Summary in Table IV. The dynamic structure with a low overhead
The proposed structure has a dynamic TCAM storage that refresh module provides a good design density. Low leakage
provides lower energy dissipation at comparable search rate. through TCAM cells is reflected in the energy dissipation
1600 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

TABLE IV
F EATURE AND P ERFORMANCE C OMPARISON S UMMARY OF R EFERRED D ESIGNS

metric where the proposed DTCAM has least EfS among conventional refreshing. The proposed structure is compared
all the referred designs. A matchline delay of 0.65 ns is with other dynamic CAMs to prove the efficacy of its per-
acceptable considering the lower EDP metric. formance and it stands out to be the best performer with
The following conversion estimations inspired from [33] are both NAND as well as NOR-type core cells. Stability analysis
used to provide normalized energy dissipation and ML delay. were performed with all environment variations, transistor
It renders a fair comparison among the referred architectures sizes and TCAM macros. Low average change in the energy-
designed with any technology and tested at any supply voltage. delay for matchline size variation (8-bit to 64-bit) secures
   2 the cascadability of proposed structure. With a low energy
Normalized EfS = EfS × ref.45 tech. × VDD
1
(6) dissipation of 0.583 fJ/bit/search, the DTCAM is competent
enough to perform in the low power TCAM search engine
Similarly, the normalized matchline delay can be estimated as
    class.
VDD
Normalized MLD = MLD × ref.45 tech. × 1 (7)
R EFERENCES
Equation (7) is valid for both binary as well as ternary [1] V. Gaudet, “A survey and tutorial on contemporary aspects of multiple-
CAM with NAND-type structure. But when a NOR-type valued logic and its application to microelectronic circuits,” IEEE
ML scheme is used for TCAM [8], the ON-state pull-down Trans. Emerg. Sel. Topics Circuits Syst., vol. 6, no. 1, pp. 5–12,
Mar. 2016.
resistance (RML ) increases. Therefore, the Equation (7) for [2] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, “Emerging trends in design
ternary CAMs with NOR matchline structure can be estimated and applications of memory-based computing and content-addressable
as memories,” Proc. IEEE, vol. 103, no. 8, pp. 1311–1330, Aug. 2015.
    [3] Z. Ullah, K. Ilgon, and S. Baeg, “Hybrid partitioned SRAM-based
VDD ternary content addressable memory,” IEEE Trans. Circuits Syst. I, Reg.
Nor. MLD TNOR = 2 × MLD × ref.45 tech. × 1 (8) Papers, vol. 59, no. 12, pp. 2969–2979, Dec. 2012.
[4] C.-C. Wang, C.-H. Hsu, C.-C. Huang, and J.-H. Wu, “A self-disabled
Design presented in [30] has AND-type ML structure but with sensing technique for content-addressable memories,” IEEE Trans.
a binary CAM cell that results in the lower EDP. Referred Circuits Syst. II, Exp. Briefs, vol. 57, no. 1, pp. 31–35, Jan. 2010.
structures presented in [4] and [14] are proficient for high [5] A.-T. Do, S. Chen, Z.-H. Kong, and K. S. Yeo, “A high speed low power
CAM with a parity bit and power-gated ML sensing,” IEEE Trans. Very
density requirements but the proposed design excels them Large Scale Integr. (VLSI) Syst., vol. 21, no. 1, pp. 151–156, Jan. 2013.
considering the low energy-delay. [6] S. K. Maurya and L. T. Clark, “A dynamic longest prefix matching
content addressable memory for IP routing,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 19, no. 6, pp. 963–972, Jun. 2011.
V. C ONCLUSION [7] S. Matsunaga et al., “Standby-power-free compact ternary content-
addressable memory cell chip using magnetic tunnel junction devices,”
A dynamic ternary CAM with low overhead refresh scheme Appl. Phys. Exp., vol. 2, no. 2, p. 023004, Feb. 2009.
is presented in this paper. As CMOS size shrinkage is [8] Y.-J. Chang, K.-L. Tsai, and H.-J. Tsai, “Low leakage TCAM for IP
inevitable in the modern information age, leakage becomes lookup using two-side self-gating,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 60, no. 6, pp. 1478–1486, Jun. 2013.
more pronounced and lessens the scope of high density TCAM [9] H. Jarollahi et al., “A nonvolatile associative memory-based context-
implementation. DCAMs certainly are the best alternatives driven search engine using 90 nm CMOS/MTJ-hybrid logic-in-memory
other than complex non-CMOS technologies to fill this gap architecture,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 4, no. 4,
pp. 460–474, Dec. 2014.
but have issues of periodical cell refreshing. A pipelined [10] Y.-J. Chang and T.-C. Wu, “Master–Slave match line design for
read-restore refresh scheme is proposed which reduces both low-power content-addressable memory,” IEEE Trans. Very Large Scale
area and refresh overheads. It accommodates more number Integr. (VLSI) Syst., vol. 23, no. 9, pp. 1740–1749, Sep. 2015.
[11] Y.-J. Chang, “Using the dynamic power source technique to reduce
of searches per second while reducing the refresh energy TCAM leakage power,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
dissipation by avoiding the extra equalization phase from vol. 57, no. 11, pp. 888–892, Nov. 2010.
MISHRA et al.: LOW-OVERHEAD DYNAMIC TCAM WITH PIPELINED READ-RESTORE REFRESH SCHEME 1601

[12] K. Noda, K. Matsui, K. Takeda, and N. Nakamura, “A loadless CMOS [32] N. Mohan, “Low-power high-performance ternary content addressable
four-transistor SRAM cell in a 0.18-μm logic technology,” IEEE Trans. memory circuits,” Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ.
Electron Devices, vol. 48, no. 12, pp. 2851–2855, Dec. 2001. Waterloo, Waterloo, ON, Canada, 2006.
[13] I. Arsovski, T. Chandler, and A. Sheikholeslami, “A ternary content- [33] T.-S. Chen, D.-Y. Lee, T.-T. Liu, and A.-Y. Wu, “Dynamic reconfigurable
addressable memory (TCAM) based on 4T static storage and including ternary content addressable memory for openflow-compliant low-power
a current-race sensing scheme,” IEEE J. Solid-State Circuits, vol. 38, packet processing,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63,
no. 1, pp. 155–158, Jan. 2003. no. 10, pp. 1661–1672, Oct. 2016.
[14] S. Mishra, T. V. Mahendra, and A. Dandapat, “A 9-T 833-MHz
1.72-fJ/bit/search quasi-static ternary fully associative cache tag with
selective matchline evaluation for wire speed applications,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 63, no. 11, pp. 1910–1920, Nov. 2016.
[15] L. Frontini, S. Shojaii, A. Stabile, and V. Liberali, “A new XOR-based Sandeep Mishra (M’14) received the B.Tech. and
content addressable memory architecture,” in Proc. 19th IEEE Int. Conf. M.Tech. degrees in electronics and communication
Electron., Circuits Syst. (ICECS), Dec. 2012, pp. 701–704. engineering from the Biju Patnaik University of
[16] J. P. Wade and C. G. Sodini, “Dynamic cross-coupled bit-line content Technology, Rourkela, India, in 2011 and 2013,
addressable memory cell for high-density arrays,” IEEE J. Solid-State respectively. He is currently pursuing the Ph.D.
Circuits, vol. SCC-22, no. 1, pp. 119–121, Feb. 1987. degree with the Department of Electronics and
[17] H. Noda et al., “A 143 MHz 1.1 W 4.5 Mb dynamic TCAM with Communication Engineering, National Institute of
hierarchical searching and shift redundancy architecture,” in IEEE Technology at Meghalaya, Shillong, India.
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2004, His research interests include low-power memory
pp. 208–523. design, high-speed sense amplifier, and intelligent
[18] M. Chae, J.-W. Lee, and S. H. Hong, “Decoupled 4T dynamic CAM transportation system.
suitable for high density storage,” Electron. Lett., vol. 47, no. 7,
pp. 434–436, Mar. 2011.
[19] V. Vinogradov, J. Ha, C. Lee, A. Molnar, and S. H. Hong, “Dynamic
ternary cam for hardware search engine,” Electron. Lett., vol. 50, no. 4,
pp. 256–258, Feb. 2014.
[20] J. G. Delgado-Frias, J. Nyathi, and T. Sb, “Decoupled dynamic ternary Telajala Venkata Mahendra (M’16) received the
content addressable memories,” IEEE Trans. Circuits Syst. I, Reg. B.Tech. degree in electronics and communication
Papers, vol. 52, no. 10, pp. 2139–2147, Oct. 2005. engineering from JNTU, Kakinada, India, in 2013,
[21] S. Hanzawa, T. Sakata, K. Kajigaya, R. Takemura, and T. Kawahara, and the M.Tech. degree in VLSI design from the
“A large-scale and low-power CAM architecture featuring a one-hot- National Institute of Technology at Meghalaya,
spot block code for IP-address lookup in a network router,” IEEE Shillong, India, in 2016, where he is currently
J. Solid-State Circuits, vol. 40, no. 4, pp. 853–861, Apr. 2005. pursuing the Ph.D. degree with the Department of
[22] V. Lines et al., “66 MHz 2.3 M ternary dynamic content addressable Electronics and Communication Engineering.
memory,” in Proc. Rec. IEEE Int. Workshop Memory Technol., Des. His research interests include the design of low-
Test., Aug. 2000, pp. 101–105. power VLSI circuits, content addressable memories,
[23] H. Noda et al., “A cost-efficient high-performance dynamic TCAM volatile memories, and digital circuits.
with pipelined hierarchical searching and shift redundancy architecture,”
IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 245–253, Jan. 2005.
[24] Y. Riho and K. Nakazato, “Partial access mode: New method for
reducing power consumption of dynamic random access memory,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 7, Jyotishman Saikia received the B.Tech. degree in
pp. 1461–1469, Jul. 2014. electronics and telecommunication engineering from
[25] I. Bhati, M.-T. Chang, Z. Chishti, S. L. Lu, and B. Jacob, “DRAM KIIT University, Bhubaneswar, India, in 2016. He
refresh mechanisms, penalties, and trade-offs,” IEEE Trans. Comput., is currently an Assistant Project Engineer with the
vol. 65, no. 1, pp. 108–121, Jan. 2016. Department of Electronics and Electrical Engineer-
[26] Y.-H. Gong and S. Chung, “Exploiting refresh effect of DRAM read ing, IIT Guwahati, Amingaon, India.
operations: A practical approach to low-power refresh,” IEEE Trans. His research interests include the design of
Comput., vol. 65, no. 5, pp. 1507–1517, May 2016. memory systems, computer architecture, SoC/NoC,
[27] A. Teman, P. Meinerzhagen, R. Giterman, A. Fish, and A. Burg, “Replica secure hardware, and reconfigurable computing.
technique for adaptive refresh timing of gain-cell-embedded DRAM,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 4, pp. 259–263,
Apr. 2014.
[28] D. Hellkamp and K. Nepal, “Metallic tube-tolerant ternary dynamic
content-addressable memory based on carbon nanotube transistors,” IET
Micro Nano Lett., vol. 10, no. 4, pp. 209–212, Mar. 2015. Anup Dandapat (M’10–SM’15) received the Ph.D.
[29] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm degree in digital VLSI design from Jadavpur Uni-
configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit versity, Kolkata, India, in 2008.
cell enabling logic-in-memory,” IEEE J. Solid-State Circuits, vol. 51, He is currently an Associate Professor with
no. 4, pp. 1009–1021, Apr. 2016. the Department of Electronics and Communication
[30] A. Agarwal et al., “A 128 × 128 b high-speed wide- and match-line Engineering, National Institute of Technology at
content addressable memory in 32 nm CMOS,” in Proc. ESSCIRC, Meghalaya, Shillong, India. He has authored or
Sep. 2011, pp. 83–86. co-authored over 50 national and international
[31] B. Wicht, T. Nirschl, and D. Schmitt-Landsiedel, “Yield and speed journal papers. His current research interests include
optimization of a latch-type voltage sense amplifier,” IEEE J. Solid-State low-power VLSI design, low-power memory design,
Circuits, vol. 39, no. 7, pp. 1148–1158, Jul. 2004. and low-power digital design.

You might also like