You are on page 1of 6

THE JOURNAL OF CHINA UNIVERSITIES OF POSTS AND TELECOMMUNICATIONS

Volume 15, Issue 1, March 2008

XING Ji-peng, ZOU Xue-cheng, GUO Xu

Ultra-low power S-Boxes architecture for AES


CLC number TN47

Document A

Abstract It is crucial to design energy-efficient advanced


encryption standard (AES) cryptography for low power embedded
systems powered by limited battery. Since the S-Boxes consume
much of the total AES circuit power, an efficient approach to
reducing the AES power consumption consists in reducing the
S-Boxes power consumption. Among various implementations
of S-Boxes, the most energy-efficient one is the decoder-switchencoder (DSE) architecture. In this paper, we refine the DSE
architecture and propose one faster, more compact S-Boxes
architecture of lower power: an improved and full-balanced DSE
architecture. This architecture achieves low power consumption
of 68 pW at 10 MHz using 0.25 pm 1.8V UMC CMOS
technology. Compared with the original DSE S-Boxes, it further
reduces the delay, gate count and power consumption by 8%,
14% and 10% respectively. At the sane time, simulation results
show that the improved DSE S-Boxes has the best performance
among various S-Boxes architectures in terms of power-area
product and power-delay product, and it is optimal for
implementing low power A E S cryptography.

Article ID

1005-8885 (2008) 01-01 12-06

algorithm is showed in Fig. 1, including SubBytes, ShiftRows,


MixColumns, AddRoundKey and KeyExpansion units.

I 28

output

Cipher text

Fig. 1 AES encryption architecture

Keywords AES, S-Boxes, DSE, cryptography, low power

1 lntraductlon
AES is a new symmetric block cipher standard, which was
issued by the National institute of standards and technology
(NIST) on November 26, 2001 111. There have been many
studies on hardware implementations of the AES algorithm
using FPGAs [2-4] and ASIC libraries [ 5 , 6 ] .
The AES consists of an initial round key addition, variable
Nr-1 rounds and a final round, and Nr is 10, 12, or 14 depending
on the key length. The round is composed of sixteen 8 bit
S-Boxes computing SubBytes, 128 bit block ShiftRows, and
four 32 bit Mixcolumns operations. Equivalent decryption
structure has exactly the same sequence of transformations as
in the encryption structure. The AES encryption structure
Received date 2007-02-25
XING Ji-peng (~-), ZOU Xue-cheng, GUO Xu
Research Center for VLSI and Systems, Department of Electronic Science
and Technology, Huazhong University o f Science and Technology,
Wuhan 430074, China
E-mail jpxing@ 126 corn

SubBytes transformation is a non-linear byte substitution


operated on each byte of the state independently. It can be
processed by using sixteen byte substitution tables. An
S-Boxes is the multiplicative inverse in a Galois field GF (2')
followed by an affine transformation. ShiftRows is a cyclic
shift operation in each row with different offsets. The
Mixcolumns function clings along to the columns of the state
by multiplying the data modulo n4+1 with a fixed polynomial.
AddRoundKey is a simple bit-wise XOR operation between
the 128 bit round keys and the data.
Our power estimation results for each primitive AES
component are shown in Table 1. In this table, the S-Boxes in
a composite field GF(24) [7] is used. It is obvious that
SubBytes operations contribute to much of the total power
consumption in AES encryption operations. The same fact was
reported in S. Morioka's research [8]. Generally, there are
twenty S-Boxes in basic AES structure, including sixteen
S-Boxes for SubBytes unit, and the others four S-Boxes for
KeyExpansion unit. Therefore, the main concern of designing
a low power AES cryptography lies in how to design low
power SubBytes transformation, i.e., low power S-Boxes.

No. 1

XING Ji-peng, et al.: Ultra-low power S-Boxes architecture for


Table 1 Power consumption of each AES component
(UMC 0.25 pm 1.8 V CMOS standard cell)
Function unit
SubBvtes
Mixcolumns
KeyShedule
Others

Average power
(mW@ 10M H ~ )

24.7
10.7
9.30

x 90

Ratio'%
46
20
17
17

In this paper, we have proposed a low power S-Boxes


architecture: an improved full-balanced DSE S-Boxes architecture,
In this S-Boxes, we focus on logic level and architecture
optimization:
1) The signal arrival time at the gate is as close as possible
to avoid generating dynamic hazards;
2) Reusing ratio of cells is as high as possible to decrease
the number of gates;
3) OR gates and AND gates are replaced with NOR gates
and NAND gates respectively to reduce the size of gates.
This architecture achieves the lowest power consumption of
68pW at 10 MHz using 0.25 pm 1.8 V UMC CMOS
technology. Compared with the original DSE S-Boxes, it
further reduces the delay, gate count and power consumption by
8%, 14%and 10% respectively.

2 Relatedworks
There exists rich literature devoted to the efficient design of
cryptographic S-Boxes, all of which can be divided into three
basic ways.
The first one is constructing circuit directly from the
truth-table of the S-Boxes. Simply, an asynchronous ROM
with 256 bytes for each S-Boxes could be instantiated. Since
ROMs do not have good electrical characteristics and short
response time, combinatorial logic is chosen for the implementation of S-Boxes. The second method is implementing
multiplicative inverse and affine transform with combinatorial
circuits using look-up tables or direct relationship between
input and output values of the S-Boxes. The third approach is
implementing the S-Boxes by combinatorial logic using its
arithmetic properties.
For the second approach, the S-Boxes hardware can be
achieved from its truth table by using two-level logic, such as
sum of products (SOP), or by using decision diagrams, such as
binary decision diagram (BDD) [9]. In addition, the decoderwwitch-encoder structure (DSE) [ 101 is developed, which is
more efficient than straight-forward implementation of a
hardware look-up table (LUT) in terms of delay and power,
while both of them directly use the input-output relations.
For the third approach, the implementation of multiplicative
inverse in the composite field (denoted as GF) [7], which can
create compact structures, is well studied to substitute the
original implementation in the Galois field GF(2'). Then, after
converting some parts of the GF S-Boxes into two-level logic,
a power-optimized structure called 3-stage positive polarity

AES

113

reed-muller (PPRM) [8] is also developed.

Power analyds of SBoxes

In order to find an optimal structure of S-Boxes to reduce


the power consumption, we investigate the origins of the
power dissipation in various aspects at first. Since the S-Boxes
only contains combinatorial circuits, the power dissipation
consists of several main components, as shown in Eq. (1):
(1)
= P\w1tc$ng + e"1crnal + Lago
where Pswltchlng
is dynamic switching power due to the load
capacitance charge or discharge; P,,,,,,,is the intrinsic power
dissipated within a logic cell and PI,* is the leakage power
due to the reverse-biased junction leakage and sub-threshold
leakage.
Dynamic switching power is the dominant active power
dissipation component in CMOS circuits, which is defined as
[I11
P\w,rch,ng = ao+,c,vLf
(2)
where a0+,is the switching activity; C, is the load capacitance;
V , is the supply voltage, and f is the frequency of the
operation.
Once more, we take the S-Boxes in composite field ~ i F ( 2 ~ )
as an example to illustrate the percentage of the total power
consumption by each part.
According to Table 2, the switching power and the internal
power are the main source of the total consumption, while the
leakage power, also called static power, can be almost
neglected. The internal power, determined by gate count,
clock speed and technology library, is due mostly to either
short-circuit power or the charging and discharging of internal
capacitance within library cells. Our work focuses on the logic
level optimization to reduce the gate count. The switching
power of a driving cell is the power dissipated by the charging
and discharging of the load capacitance at the output of the
cell. Because such charging and discharging are the results of
logic transitions at the output of the cell, the switching power
increases as logic transitions increase. The glitch power, as
part of the switching power caused by dynamic hazards, is
also considered. Regarding the illustrated S-Boxes in
composite field, it involves many crossing and branching
signal paths. The signal arrival time of the internal gates is
very different, hence if multiple gates are connected serially,
the hazards propagate into the circuit path and some extra
power is consumed.
Table 2 Average power of the S-Boxes
(UMC 0.25 pm 1.8 V CMOS standard cell)
Power type
Total
Internal
Switching
Leakage

Average power
10MHz)
478
289
189
0.003

w@

Ratio/%
100
60.5
39.5
=0

114

The Journal of CHUPT

Based on the above analysis, we consider that the power


consumption of the S-Boxes is strongly influenced by the
number of dynamic hazards and the amount of internal power
consumption. So our strategy is to construct balanced signal
path to eliminate the differences of signal arrival time at each
gate and to reduce gate count in S-Boxes circuits.

4 Improved DSE SBoxes


4.1 Orlglnal DSE SBoxes archltecture

The original DSE S-Boxes architecture addressed in Ref.


[ 101 is shown in Fig. 2. It is composed of three parts: decoder,
switch, and encoder. Decoder is the function unit which
translates 8-bit inputs lo+ to 2 bit outputs with one-hot
representation. The switch unit executes hardware wire
permutation to complete S-Boxes non-linear mapping with
28 bit inputs and 2 bit outputs. And the encoder unit translates
one-hot 2bit inputs to 8 bit binary outputs Oo-07.

2008

structure that has balanced signal paths to eliminate the


dynamic hazards and maximize reusing ratio of the gates.
We first consider a simple sub-encoder which translates
one-hot four inputs: X,, X1,X2 and X, to two outputs: Y1 and Y2,
and the Boolean function can be presented as follows:

Y = x,+x,
Y,=X,+X,

(3)

From Eq. (3), we observe that the input X, is unused and we


can arrange the eight outputs of encoder into four pairs with
four sub-encoders. As shown in Fig. 3, all 256 inputs ( l ~ ~ ~ - l o ~ ~ )
are enumerated. Note that XXXX-XXOO denotes all 8 bits
binary digits with the last two bits as OO, and x0 denotes all
8 bits hexadecimal digits with the last four bits as 0x0. Both
of them represent the subscripts of the inputs. The values in
dashed frame are all unused as with those in Eq. (3). And the
values in real frame are ORed to 2-way OR gate as inputs of
Eq. (3).

E Lxxxxooxx a
xxxxolxx c3
xxxxloxx
xxxx I 1xx

33xxooxxxx
xx01xxxx c3
xxloxxxx 13

Fig. 2

Schematic diagram of DSE S-Boxes

Considering the DSE S-Boxes, the switch unit consumes


zero power. Therefore, the power optimization is mainly
concerned with the low power encoder and decoder design.
4.2

Improved encoder architecture

In Ref. [lo], the encoder design only considered the maximum


reusing of the OR gate optimized by synthesis tools. However,
aiming at minimizing the gate count, we find that the resulted
net-list can still be improved more power- efficient.
Through the implementation of various S-Boxes we notice
that when the circuit structure of the S-Boxes is changed, the
power of the S-Boxes circuits can vary more than several-fold.
It happens due to the changes in creating and propagating
dynamic hazards, even though the total circuit size has less
effect on the power consumption than expected. In this case,
merely considering the resource reusing, the number of gates
can be decreased, but we would neglect other factors that may
further decrease the power consumptions. Inspired by the
3-stage decoder structure, we develop a 4-stage encoder

xx I I xxxx

3x. l x , bx. /i
I

llXXXXXX

olxxxxxx 13
Ioxxxxxx a
II

xxxxxx

Fig. 3 Schematic diagram of the improved encoder

From Fig. 3 we note that many four consecutive inputs can


be reused to generate outputs, for instance, Ioxcc-lox~f
can be
reused to generate 02.O,, O6 and 07.Therefore, we try
maximizing the reusing ratio of 4-way OR gates. On the other
hand, in order to eliminate spurious transitions caused by the
creation and propagation of dynamic hazards, we build the
circuits with balanced signal paths.
In addition, since NAND gate and NOR gates have smaller
area, lower delay and lower power consumption than OR gate,
by maintaining the logic function of the encoder module and
the total number of logic cells, we substitute the OR gates
with NAND gates or NOR gates. Since the encoder unit is
wholly composed of OR gates, we simply use Eq. (4) to
represent it. According to De Morgen law, Eq. (4) can be

No. 1

XING Ji-wna, et al.: Ultra-low wwer S-Boxes architecture for AES

translated to Eq. (5) and further to Eq.(6). Accordingly, OR


gates of the 1-stage and 3-stage can be substituted by NOR
gates, and OR gates of the 2-stage and 4-stage can be
substituted by NAND gates.

Y = x, + x,+ x,+ x,
---Y = x,.x,*x2.x,

(4)
(5)

-~

Y = X" +x,*x,
+x,

(6)
Adopting the above three methods, the improved encoder
structure is shown in Fig. 4 in detail. Figure 4(a) shows the
generation of outputs Oo and O,, while the outputs of all
stages are not able to be reused. Figure 4(b) shows the
generation of outputs 02-07,
and the outputs of 1-stage can be
reused at high ratio.

115

original DSE structure has been well studied in Ref. [lo], we


can further optimize power consumption by decreasing the
area of gates, i.e., decreasing the internal power while
maintaining the logic function of the decoder unit and 3-stage
full-balanced structure. Since the decoder unit is composed of
AND gates wholly, we simply use Eq. (7) to represent it.
According to De Morgen law, Eq. (7) can be translated to
Eq. (8) and further to Eq. (9). So AND gates of the 1-stage and
AND gates of the 3-stage can be substituted by NAND gates
and NOR gates respectively.
Y = x,.x,.x,.x,.x,.x,.x,.x,
(7)

- ---

Y = x,*x,+ x2*x,*x,.x,
+ x,.x7

(8)

Y = (x,=x,
+ x,*x,)+ (x,*x,+ x,*x7)

(9)

-- --

Figure 5 shows the improved decoder architecture in detail.


From Fig. 5, by substituting the 2-way AND gate with the
2-way NAND gate, we save about 5 gates area in the first
stage. In the second stage, the area is unchanged since 2-way
OR gate and 2-way AND gate have the same size; in the last
stage, 85 gates area have been saved by replacing the 2-way
AND gate with 2-way NOR gate. In all, the silicon area of the
decoder is decreased to only 78% with respect to the original
architecture. Combining our proposed decoder and encoder
circuits, we obtain a new full-balanced S-Boxes characterized
as low delay, compact and power-efficient.
1 st stage

2nd stage

3rd stage

2'

(a) Generation of outputs 00and 01


1 st stage

256 NOR 2
16 OR 2
2 INV& 4 NAND 2
Fig. 5 Improved decoder architecture

5 Slmulatlon resub
""

63 NOR 4

(b) Generation of outputs 02-07

Fig. 4 Improved encoder architecture

4.3 improved decoder architecture

Although the power-efficient decoder module within the

For analyzing the power consumption of the S-Boxes


circuits, a simulation-based analysis method is used. In this
method, after functional simulation, a net-list is acquired with
the UMC 0.25 pm 1.8 V technology library using synopsys
design compiler. Then, a timing simulation at the gate level is
performed using a given set of test input data, and the
switching activities of all internal gates are logged. The

116

The Journal of CHUPT

simulation is performed at the clock frequency of 10 MHz


with all patterns of the primary input switching. The circuit
average power is computed using synopsys prime power. It
should be pointed out that although the absolute values of the
circuit performance are different among different ASIC
libraries, the ratios of the parameters are almost the same.
To compare our proposed S-Boxes, we have implemented
all known S-Boxes and some of the results are shown in Table 3.
Our improved DSE S-Boxes is denoted as IDSE. It can be
seen from the simulation results that IDSE is the most
power-efficient S-Boxes among known S-Boxes. Compared
with the original DSE S-Boxes, the performance of IDSE
exceeds DSE in all aspects, the delay, gate count and power
consumption are reduced by 8%, 14% and 10% respectively.
Furthermore, the critical path delay of IDSE S-Boxes is lower
than others except BDD S-Boxes which obtains the lowest
critical path delay at the expense of the largest size and the
highest power consumption.
Table 3 Comparison of various S-Boxes architectures
(UMC 0.25 p n 1.8 V CMOS standard cell, 1 gate = 2 way NAND)
~~~

Architecture

Delaylns

Size,gate

LUT
GF

4.54
8.04
6.70
4.46
1.15
3.17
2.92

573
373
709
575
3283
780
670

PPRM
SOP
BDD
DSE
DSE

"

LUT

GF

PPRM

SOP

Average Power
(FWOIOMHZ)
180
478

111

156
1744
76
68

DSE

IDSE

Fig. 6 Comparison of power-area product of various S-Boxes

2008

power-area product and power-delay product than the others


(see Table 3), and the product is not illustrated in the figure for
display convenience. Both power-area product and powerdelay product have been normalized to the product of IDSE.
The power-area metric is particularly relevant to the
applications which require both small silicon area and low
power consumption, e.g., cryptographically enhanced RFID
tags or sensor nodes [ 121. Moreover, the power-delay product
is a very important metric for circuit performance of circuit
[ 111. Our proposed improved DSE S-Boxes gains the smallest
power-area product and power-delay product as expected.

6 Conclurlons
In this paper, we have developed an improved DSE
architecture for low critical path delay, small size and
low-power S-Boxes circuits. The designed S-Boxes for AES
cryptography, using optimized balanced architectures of
3-stage decoder and 4-stage encoder, is applicable to security
applications which require high speed, compact area and
power-efficiency. The power consumption of S-Boxes circuits
can be reduced by avoiding the creation and propagation of
dynamic hazards, and the silicon size and the critical path
delay can be decreased by optimizing the logic at gate level.
Simulation results obtained at 10 MHz using a UMC 0.25 Fm
1.8 V CMOS technology, show that the delay, gate count and
power consumption are reduced by 8%. 14% and 10%
respectively compared with the original DSE S-Boxes.
Moreover, we have analyzed and compared two cost metrics
of the six selected S-Boxes implementations. Our proposed
IDSE S-Boxes achieves the smallest power-area product and
power-delay product. Insofar to our knowledge, it is the
lowest power S-Boxes circuit among all known S-Boxes
architectures.
Acknowledgements This work is supported by the Hi-Tech Research
and Development Program of China (2006AAOlZ226), HUST-SRF
(2006ZOl lB), Program for New Century Excellent Talents in
University and the Natural Science Foundation of Hubei
(2006ABA080).

References

"

LUT

GF

PPRM

SOP

DSE

IDSE

Fig. 7 Comparison of power-delay product of various S-Boxes

Figures 6 and 7 show our results in terms of the power-area


and the power-delay product respectively. Due to its high
power consumption and large size, BDD has much bigger

AES. Federal Information Processing Standards Publication 197,


200 1
Hodjat A, Verbauwhede I. A 21.54 Gb/s fully pipelined AES
processor on FPGA. Proceedings of 12th Annual IEEE
Symposium on Field-Programmable Custom Computing
Machines (FCCM'04). Apr 20-23, 2004 Napa, CA, USA. Los
Alamitos, CA, USA: IEEE Computer Society, 2004: 308-309
Fischer V, Drutarovsky M. Two methods of Rijndael implementation

No. 1

XING Ji-peng, et al.: Ultra-low power S-Boxes architecture for AES

in reconfiguration hardware. Proceedings of 3rd International


Workshop on Cryptographic Hardware and Embedded Systems
(CHESOI). May 14-16, 2001, Paris, France. Heidelberg,
Germany: Springer verlag, 2001: 77-92
4. Elbirt A J, Yip W, Chetwynd B, et al. An FPGA-based
performance evaluation of the AES block cipher candidate
algorithm finalists. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems. 2001, 9(4): 545-557
5. Verbauwhede I, Schaumont P, Kuo H. Design and performance
testing of a 2.29-GBIs Rijndael processor. IEEE Journal of
Solid-state Circuits, 2003, 38(3): 569-572
6. Morioka S, Satoh A. A 10 Gbps full-AES crypto design with a
twisted-BDD S-Box architecture. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 2004, 12(7): 686-691
7. WolkerstorferJ, Oswald E, Lamberger M. An ASIC Implementation
of the AES S-Boxes. Proceedings of Cryptographers Track at
the RSA Conference, Feb 18-22, 2002, San Jose, CA, USA
Heidelberg, Germany: Springer verlag, 2002: 67-78
8. Morioka S, Satoh A. An optimized S-boxes circuit architecture
for low power AES design. Proceedings of 4th International
Workshop on Cryptographic Hardware and Einbedded Systems
(CHESOZ), Aug 13-15, 2002, San Francisco: CA, USA,
Heidelberg, Germany: Springer verlag, 2002: 172-1 86
9. Bryant R E. Graph-based algorithms for Boolean function
manipulation. IEEE Transactions on Computer, 1986, 35(8):
677-69 I
10. Bertoni G, Macchetti M, Negri L, et al. Power-efficient ASIC
Synthesis of Cryptographic S-boxes. Proceedings of the 14th
ACM Great Lakes Symposium on VLSI (GLSVLSI04), Apr

From p. 101
22. Lye Wil Liam, Chekima A, Liau Chung Fan, et al. Iris
recognition using self- organizing neural network. Proceedings of
Student Conference on Research and Development, Jul 16-17,
2002, Shah Alam, Malaysia. Piscataway, NJ, USA: IEEE, 2002:
169-1 72
23. University of Bath Iris image database. http://www.bath.ac.uW
elec-eng/research/sipg/irisweb/index.htm[Bath].

Biographies: XU Guang-zhu, received the B. S.


degree and Ph. D. degree in radio physics from
Lanzhou University in 2002 and 2007 seperately.
He is currently associate professor of China Three
Gorges University. His major research interests
include biometrics, neural network, digital image
processing and analysis, pattern recognition.

117

26-28, 2004, Boston, MA, USA. New York, NY, USA: ACM
Press, 2004: 277-281
11. Rabaey J M, Chandrakasan A, Nikolic B. Digital Integrated
circuits: A design perspective. 2nd ed. Upper Saddle River, NJ,
USA: Prentice-Hall, 2003
12. Tillich S, Feldhofer M, GroBschadl J . Area, Delay, and Power
Characteristics of Standard-Cell Implementations of the AES
S-boxes. Proceedings of 6th Workshop on Embedded Computer
Systems: Architectures, Modeling, and Simulation. Samos
(SAMOS06). Jul 17-20, 2006, Samos, Greece. Heidelberg,
Germany: Springer Verlag, 22006: 457-466

Biographies: XING Ji-peng, from Hubei, Ph. D.


Candidate in Department of Electronic Science
and Technology, Huazhong University of Science
and technology, interested in the research on
VLSI design and wireless sensor networks and
embedded systems.

ZOU Xue-cheng, a professor and doctoral advisor


in the Department of Electronic Science and
Technology, Huazhong University of Science and
Technology. His research interests include design
of VLSI, research of RFID system, and information
security SoC design.

ZHANG Zai-feng, received the B. S. degree in mechanism manufacture


and equipments from Lanzhou institute of technology in 1990. In 2002,
he received the M. S. degree in communication and information system
from Lanzhou University. He is currently pursuing the Ph. D. degree in
biomeuics in the school of information science and engineering of
Lanzhou University. His current research interests include biometrics,
digital signal processing, digital watermark, pattern recognition, etc.
MA Yi-de, received the B. S. and M. S. degrees in radio technology
from Chengdu University of Engineering Science and Technology in
1984 and 1988, respectively. In 2001, he received the Ph. D. degree
from the Department of Life Science, Lanzhou University. He is current
professor in School of Information Science and Engineering of Lanzhou
University and interested in the research on artificial neural network,
digital image processing, pattern recognition, digital signal processing,
computer vision, etc.

You might also like