






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A high{performance implementation of the Inter- national Data Encryption Algorithm (IDEA) is pre- sented in this paper. Using a novel bit{serial archi-.
Typology: Summaries
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Abstract
A high{performance implementation of the Inter- national Data Encryption Algorithm (IDEA) is pre- sented in this paper. Using a novel bit{serial archi- tecture to perform multiplication modulo 2 16 + 1 , the implementation occupies a minimal amount of hard- ware. The bit{serial architecture enabled the algorithm to be deeply pipelined to achieve a system clock rate of 125MHz on a Xilinx Virtex XCV300{6, delivering a throughput of 500Mb/sec. With a XCV1000{6 device, the estimated performance is 2Gb/sec, three orders of magnitude faster than a software implementation on a 450MHz Intel Pentium II. This design is suitable for applications in on{line encryption for high{speed networks.
Cryptography is concerned with the transfer of in- formation b etween parties so that only the intended parties can read the data. Despite an assumption that an adversary may have full knowledge of the algo- rithms used, and has access to the media where data is transmitted, it is desired that the retrieval of data without knowledge of a secret piece of information called a key is intractable. We b elieve that cryptography is an ideal applica- tion for Field{programmable Custom Computing Ma- chines (FCCMs), since they o er the following advan- tages over VLSI technologies
it is p ossible to use the same FCCM hardware for many di erent cryptographic proto cols Mo ore's law continues to o er improved silicon technology at exp onential rates which is available to FCCM designers without the costly manufac- turing pro cess required in VLSI
it is p ossible to sp ecialize the hardware to an ex- tent not p ossible in VLSI devices to improve p er- formance
the recon gurable nature makes it feasible to at- tempt designs employing more sophisticated al- gorithms which leads to an improvement in p er- formance.
The Data Encryption Standard (DES) algorithm has b een a p opular secret key encryption algorithm and is used in many commercial and nancial appli- cations. Although intro duced in 1976, it has proved resistant to all forms of cryptanalysis. However, its key size is to o small by current standards and its en- tire 56 bit key space can b e searched in approximately 22 hours [1]. In 1990, Lai and Massay intro duced an iterated blo ck cipher known as Prop osed Encryption Stan- dard (PES) [2]. The same authors, joined by Mur- phy, prop osed a mo di cation of PES called Improved PES (IPES) [3], which improves the security of the original algorithm against di erential analysis and truncated di erentials [4, 5, 6]. In 1992, IPES was commercialized and was renamed the International Data Encryption Algorithm (IDEA). Some b elieve that, to date, the algorithm is the b est and the most secure blo ck algorithm available to the public [7]. Although IDEA involves only simple 16{bit op era- tions, software implementations of this algorithm still cannot o er the encryption rate required for on{line encryption in high{sp eed networks. Ascom's imple- mentation of IDEA (Ascom are the holders of the patent on the IDEA algorithm) achieves 0 : 37 106 en- cryptions p er seconds, or a equivalent encryption rate of 23.53Mb/sec, on an Intel Pentium I I 450MHz ma- chine. Our optimized software implementation run- ning on a Sun Enterprise E4500 machine with twelve 400MHz Ultra{I Ii pro cessor, p erforms 2 : 30 106 en-
cryptions p er second or a equivalent encryption rate of 147.13Mb/sec, still cannot b e applied to applica- tions such as encryption for 155Mb/sec Asynchronous Transfer Mo de (ATM) networks. Hardware implementations o er signi cant sp eed improvements over software implementations by ex- ploiting parallelism among op erators. In addition, they are likely to b e cheap er, have lower p ower con- sumption and smaller fo otprint in emb edded appli- cations than a high sp eed software implementation. A pap er design of an IDEA pro cessor which achieves 528Mb/sec on four XC4020XL devices was prop osed by Mencer et. al. [8]. The rst VLSI implementa- tion of IDEA was develop ed and veri ed by Bonnen- b erg et. al. in 1992 using a 1 : 5 m CMOS technol- ogy [9]. This implementation had an encryption rate of 44Mb/sec. In 1994, VINCI, a 177Mb/sec VLSI im- plementation of the IDEA algorithm in 1 : 2 m CMOS technology, was rep orted by Curiger et. al. [10, 11 ]. A 355Mb/sec implementation in 0 : 8 m technology of IDEA was rep orted in 1995 by Wolter et. al. [12]. The fastest single chip implementation of which we are aware is a 424Mb/sec implementation of 0 : 7 m technology by Salomao et. al. [13]. A commercial im- plementation of IDEA called the IDEACrypt copro- cessor, develop ed by Ascom achieves 300Mb/sec [14]. In this pap er, a Xilinx Virtex XCV300{6 based implementation of the IDEA algorithm is describ ed with a throughput of 500Mb/sec. Furthermore, with a XCV1000{6 device, the estimated p erformance is 2Gb/sec. This design is faster than all VLSI im- plementations mentioned ab ove. The implementation employs a novel bit{serial architecture which o ers the following advantages
high degree of ne{grain parallelism
scalable so that throughput and area tradeo s can b e addressed
high clo ck rate
compact implementation.
Applications of this design include Virtual Pri- vate Networks (VPNs) and emb edded encryp- tion/decryption devices. This pap er is organized as follows. In Section 2 the IDEA algorithm as well as algorithms for multiplica- tion mo dulo 2 n^ + 1 are describ ed. In Section 3 the bit{serial implementation of IDEA is presented. In Section 4 results are given. Conclusions are drawn in Section 5.
IDEA b elongs to a class of cryptosystems called secret{key cryptosystems which is characterized by the symmetry of encryption and decryption pro cesses, and the p ossibility of implying the decryption key from the encryption key and vice versa. IDEA takes 64{bit plaintext inputs and pro duces 64{bit ciphertext out- puts using a 128{bit key. The design philosophy b ehind IDEA is mixing op er- ations from di erent algebraic groups including XOR, addition mo dulo 216 , and multiplication mo dulo the Fermat prime 216 + 1. All these op erations work on 16{bit sub{blo cks. The IDEA blo ck cipher [7] (depicted in Figure 1) consists of a cascade of eight identical blo cks known as rounds, followed by a half{round or output trans- formation. In each round, XOR, addition and mo d- ular multiplication op erations are applied. IDEA is b elieved to b e of strong cryptographic strength b e- cause its primitive op erations are of three distinct al- gebraic groups of 216 elements, multiplication mo dulo 216 + 1 provides desirable statistical indep endence b e- tween plaintext and ciphertext, and its prop erty of having iterative rounds made di erential attacks di- cult. The encryption pro cess is as follows, the 64{bit plaintext is divided into four 16{bit plaintext sub{ blo cks, X 1 to X 4. The algorithm converts the plain- text blo cks into ciphertext blo cks of the same bit{ length, similarly divided into four 16{bit sub{blo cks, Y 1 to Y 4. 52 16{bit subkeys, Z (^) i( r^ ), where i and r are the subkey numb er and round numb er resp ectively, are computed from the 128{bit secret key. Each round uses six subkeys and the remaining four subkeys are used in the output transformation. The decryption pro cess is essentially the same as the encryption pro- cess except that the subkeys are derived using a dif- ferent algorithm [7]. The algorithm for computing the encryption sub- keys (called the key schedule) involves only logical ro- tations. Order the 52 subkeys as Z (1) 1 ,^ :^ :^ :^ ;^ Z^
(1) 6 ,^ Z^
(2) 1 , : : : , Z (2) 6 , : : : , Z 1 (8) , : : : , Z 6 (8) , Z (9) 1 , : : : , Z (9) 4. The pro cedure b egins with partitioning the 128{key secret key Z into eight 16{bit blo cks and assigning them di- rectly to the rst eight subkeys. Z is then rotated left by 25 bits, partitioned into eight 16{bit blo cks and again assigned to the next eight subkeys. The pro- cess continues until all 52 subkeys are assigned. The decryption subkeys Z 0 ( ir )can b e computed from the encryption subkeys with reference to Table 1. In electronic co deb o ok (ECB) mo de [7], the data
∆
a (^) n -1 ... a 1 a 0 bn -1 ... b 1 b 0
xn -1 ... x 1 x 0
∆
a (^) n -1 ... a 1 a 0 bn -1 ... b 1 b 0
s (^) n -1 ... s 1 s 0
full adder ∆
control
clear
Figure 2: Bit{serial XOR and addition op erators.
is transferred along asso ciated data bus. To reduce area, control signals can b e shared among the vari- ables. Since bit{serial op erators usually require the rst bits of their op erands to enter the op erators on the same clo ck cycle, appropriate stage latches must b e inserted for time{alignment [17]. Two of the primitive op erators used in IDEA, namely XOR and addition mo dulo 216 , can b e imple- mented in a bit{serial fashion using the circuits shown in Figure 2. These two op erators have latencies of one clo ck cycle and are capable of taking consecutive bit{ serial op erands. The multiplication mo dulo 216 + 1 op erator has a latency of 35 clo ck cycles. The corre- sp onding pip elined datapath for one round of IDEA is illustrated in Figure 3. For the b est area{eciency, stage latches and constants are implemented using Virtex SRL16E primitives [18, 19]. More sp eci cally, a constant is implemented as a SRL16E primitive, with its output connected to its input to form a cyclic shift register.
3.2 Multiplicati on Mo dulo 216 + 1
As describ ed in Section 2.1, multiplication mo dulo 216 + 1 is the most critical op eration in the IDEA algorithm. Cho osing a suitable multiplier is therefore a crucial design issue. An N N {bit multiplier generates a 2 N {bit result, and requires 2 N cycles to complete. Thus, through- put of bit{serial multipliers are restricted b ecause the minimum interval b etween consecutive multiplications must b e at least 2 N cycles. In the IDEA algorithm one of the op erands of every mo dular multiplication is a subkey and treated as a constant. Recall in the mo dular multiplication algorithm that the intermediate result t is divided into two p ortions (lines 6 to 8, in Section 2.1). The two p ortions are
x 1 ( r )
Z 2 ( r ) Z 1 ( r )
Z 3 ( r ) Z 4 ( r )
Z 6 ( r )
Z 5 ( r )
34 34
35
36
73 73 1 73 73
x 2 ( r )^ x 3 ( r )^ x 4 ( r )
y 1 ( r )^ y 2 ( r )^ y 3 ( r )^ y 4 ( r )
latency 0 1 35
36 71 72 107
108
109
Figure 3: Pip elined datapath for one round of IDEA.
Σ
b (^) n -
∆
∆
b (^) n -
∆ Σ
b 1
∆
∆ Σ
b (^0)
∆
∆
a (^) n -1 ... a 1 a 0
p (^) 2n -1 ... p 1 p 0
Figure 4: Lyon's serial{parallel multiplier.
resp ectively the upp er and lower 16 bits of the double{ word, which are op erands to subsequent op erations. A design that computes the upp er and lower words of t indep endently is desirable, allowing all the inputs, outputs and intermediate variables of the op erator to b e 16{bit long. Using this scheme and duplicating hardware, the throughput of a mo dular multiplication op eration can b e doubled. A mo di ed version of Lyon's serial{parallel multi- plier [20] was develop ed which addresses this problem. The original design of Lyon's multiplier is shown in Figure 4. To generate two 16{bit results in 16 cycles, the throughput of the multiplier must b e doubled. We achieved this by duplicating the hardware for multi- plication, as illustrated in Figure 5. Registers storing the constant are shared among the two multiplication pip elines. The outputs p and q corresp ond to the re- sults of two consecutive multiplications, where the two 32{bit long variables have a time{di erence of 16 cy- cles. The control signal, which is high one clo ck cycle
b efore the least signi cant bit enters the mo dule, tog- gles the control register. The vector of input variables an 1 : : : a 1 a 0 is consequently redirected into the two multiplication pip elines alternately. While the vector is b eing redirected to one pip eline, logic zero enters the other pip eline carrying out zero{padding. A timing di- agram of the mo di ed multiplier is shown in Figure 6. To obtain the time{aligned upp er and lower words of t, a 16 stage shift register is required. The input and output of the shift register are the upp er and lower words of t resp ectively, 16 cycles after t is valid. In the implementation the shift register is implemented as a SRL16E [18] primitive. The complete architecture for the mo dular multiplication op eration is shown in Fig- ure 7. Up on initialization, the subkey asso ciated with the op erator is passed into the op erator bit{serially. The pre{decremented subkey is shifted into the regis- ters of the multiplier, and at the same time stored into the SRL16E primitive resp onsible for key storage. Utilizing the idea of multiple pip elines, the mo d- ular multiplication op eration o ers a throughput of 16 cycles, even though a 32{bit intermediate result is computed. This scheme doubles the throughput but since sharing of the b registers can o ccur, the hardware cost is less than double.
3.3 IDEA Core
The core implementation of IDEA is obtained by cascading eight identical rounds of op erations shown in Figure 3, followed by a output transformation. For convenient interfacing, four parallel{to{serial convert- ers are inserted b efore the rst round and four serial{ to{parallel converters are app ended after the output transformation. The core takes one 64{bit plaintext once every 16 cycles, yielding an e ective encryption rate of f 64 16 Mb/sec at a system clo ck rate of f MHz. All the shift registers for key storage are linked dur- ing initialization cycles. Up on initialization, the pre{ computed 52 16{bit subkeys (a total of 832 bits) are passed bit{serially into the core via the shift registers. This key scheduling mechanism is advantageous for its minimum routing and logic requirements. To further optimize area, the control circuitry of all the mo dules is extracted and is replaced by a global 16{bit one{hot enco ding state machine. The data ow diagram of the IDEA core is shown in Figure 8. Each round has a latency of 109 cycles. The output transformation has a latency of 35 cycles. Each serial{ to{parallel converter at the outputs has a latency of 16 cycles. Therefore, the IDEA core has an overall latency of 109 8 + 35 + 16 = 923 cycles. At a 125MHz
system clo ck rate, the equivalent latency is 7 : 384 s, which is acceptable for many applications.
3.4 Scalability
Given more resources, the bit{serial implementa- tion of IDEA can b e eciently scaled up to achieve higher encryption rate. This is achieved by instanti- ating multiple IDEA core instances, but having the control signals shifted with resp ect to every other in- stance. With a single core, the implementation can b e scaled up to 16 times to achieve 16 times the origi- nal encryption rate without a ecting latency. A max- imally scaled version of the implementation is illus- trated in Figure 9. The timing diagram in Figure 10 il- lustrates the mechanism of input data forwarding and output data merging of a maximally scaled implemen- tation.
The bit{serial IDEA pro cessor was veri ed with Synopsys VHDL Simulator, and was synthesized us- ing Synopsys FPGA Express 3.3 and Xilinx Founda- tion Series 2.1i, with a Xilinx Virtex XCV300{4 as the target device. The fully{pip elined implementation re- quires 2801 Virtex slices, accounting for 91.18% of the total 3072 slices on an XCV300 device. The basic building blo ck of the Virtex FPGA is the the logic cell (LC). A LC includes a 4{input function generator, carry logic and a storage element. Each Virtex CLB contains four LCs, organized in two slices. The 4{input function generator are implemented as 4{input lo ok{up tables (LUTs). Each of them can provide the functions of one 4{input LUT or a 16 1{bit synchronous RAM (called \distributed RAM"). Furthermore, two LUTs in a slice can b e combined to create a 16 2{bit or 32 1{bit synchronous RAM, or a 16 1{bit dual{p ort synchronous RAM. Our implementation of IDEA was successfully im- plemented on Annap olis Micro Systems Wildcard Re- con gurable Computing Engine [21]. The device is a Typ e I I PCMCIA Card with a 33MHz 32{bit Card- Bus interface, consisting of a Xilinx Virtex XCV FPGA as Pro cessing Element (PE) and two 64k 32{ bit SDRAMs. The design was found to b e op erational at ro om temp erature up to 125MHz and this clo ck rate was used for all p erformance tests. Reliable op eration over the full commercial temp erature range would b e exp ected if a XCV300{6 was used.
round 1 round 2 round 8
output forma-trans- tion
⋅ ⋅ ⋅
parallel-to-serialconverter
parallel-to-serialconverter
parallel-to-serial converter parallel-to-serial converter
serial-to-parallelconverter
serial-to-parallelconverter
serial-to-parallel converter serial-to-parallel converter
16-bit one-hot shift register
Y 1
Y 2
Y 3
Y 4
X 1
X 2
X 3
X 4
Z (^) ctrl Z (^) in
16
16
16
16
16
16
16
16
16
Figure 8: Data ow diagram of the IDEA core.
IDEA core 1
16-bit one-hot shift register
IDEA core 2
IDEA core 16
64
64
X 1-
Y 1-
Figure 9: Maximally scaled bit{serial implementation of IDEA.
plaintext
plaintext taken by IDEA core 2 plaintext taken by IDEA core 3
time
Pn plaintext taken by IDEA core 1
Pn +1 Pn +2 Pn +3 Pn +922 Pn +923 P (^) n +924 Pn +
ciphertext produced by IDEA core 1 ciphertext produced by IDEA core 2 ciphertext produced by IDEA core 3
ciphertext
Figure 10: Timing diagram of a maximally scaled implementation, showing the instantiations of IDEA cores which take plaintext and pro duce ciphertext in a round{robin order.
Sp eed grade {4 {5 {
Rep orted clo ck rate (MHz) 106.378^ 116.229^ 125. Encryptions p er second ( 106 ) 6.648^ 7.264^ 7. Encryption rate (Mb/sec) 425.5^ 464.9^ 500. Latency (s) 8.677 7.941 7.
Table 2: Performance of IDEA core on devices of dif- ferent sp eed grades.
Device (XCV) 300{6 600{6 1000{
Scaling 1 2 4 Numb er of slices^2801 5602 Device slices utilization 91.18%^ 81.05%^ 91.18% Clo ck rate (MHz) 125.202^ 125.202^ 125. Encryptions p er second ( 106 ) 7.825^ 15.650^ 31. Encryption rate (Mb/sec) 500.8^ 1001.6^ 2003.
Table 3: Tradeo s b etween p erformance and area of the IDEA core.
4.1 Performance of IDEA Core
Bit{serial architectures facilitate high system clo ck rates compared with traditional bit{parallel imple- mentations. The p erformance of the core (assuming a high bandwidth interface to the data sources and sinks) is summarized in Table 2. Rep orted clo ck rate refer to the clo ck frequencies rep orted by timing anal- ysis. In an attempt to explore tradeo s b etween p erfor- mance and area, the core was generated for FPGAs of di erent capacities. The core was maximally scaled within the resource limitation of each device using the metho d describ ed in Section 3.4. Results are summer- ized in Table 3.
It is estimated that a maximally scaled implemen- tation requires 2801 16 = 44816 slices, which can pro duce an encryption rate of 500 16 = 8 Gb/sec at a 125MHz clo ck rate.
4.2 Performance on the Wildcard Plat- form
On the Wildcard implementation, the time taken to complete a transaction b etween the FPGA and host is dominated by op erating system overheads. When de- signing the interface b etween the IDEA core and the host, it is crucial that the numb er of discrete Card- Bus read and write transactions is minimized and the amount of data transferred p er transaction is maxi- mized. A blo ck diagram of the interface is shown in Fig- ure 11. Data is written directly to the core using a burst mo de transfer of 512 64{bit plaintext blo cks. After the latency p erio d, the ciphertext is written to consecutive lo cations in the Blo ckRAM. For XCV devices, there are eight 256 32{bits Blo ckRAM [22] on the chip and they are all used in the host/IDEA interface. The results are read by the host from the IDEA pro cessor by doing a burst mo de transfer of the contents of the blo ck RAM. The decryption pro cess is similar except the ciphertext is written to the IDEA core and the plaintext app ears in the Blo ckRAM. The maximum transfer rate of CardBus is 33MHz 32-bits = 1056 Mb/sec, but the bit{serial core, clo cked at 125MHz, has an encryption rate of approximately 125 64 16 = 500 Mb/sec. In order to match the bandwidth of the IDEA core, the host inserts three blank 32{bit words b etween every two 64{bit plaintext double{words. The maximum data rate is therefore 1056 0 : 4 = 422 : 4 Mb/sec. The interface b etween host and IDEA core on Wildcard requires an additional 238 slices, resulting in a total of 3039 slices, or 98.93% utilization of the XCV300. Although the CardBus has a 1056Mb/sec maxi- mum transfer rate, its actual data transfer rate us- ing programmed I/O is degraded due to very large op erating system overheads in setting up a CardBus transaction. The implementation achieves a mea- sured p erformance of 0 : 61 106 encryptions p er sec- ond (39Mb/sec). The situation could b e improved by using Direct Memory Access (DMA) but the DMA in- terface requires an additional 400 slices and would not t on an XCV300. The DMA interface was tested in a stand-alone con guration and measured p erformance for a write of 5120 words (2048 0 :4) followed by a read of 2048 words was 142Mb/sec. A larger device which can accommo date b oth the IDEA core and the DMA
[10] A. Curiger, H. Bonnenb erg, R. Zimmerman, N. Felb er, H. Kaeslin, and W. Fichtner, \VINCI: VLSI implementation of the new secret{key blo ck cipher IDEA," in Proceedings of the IEEE Cus- tom Integrated Circuits Conference, pp. 15.5.1{ 15.5.4, 1993.
[11] R. Zimmermann, A. Curiger, H. Bonnenb erg, H. Kaeslin, N. Felb er, and W. Fichtner, \A 177Mb/sec VLSI implementation of the interna- tional data encryption algorithm," IEEE Jour- nal of Solid{State Circuits, vol. 29, pp. 303{307, March 1994.
[12] S. Wolter, H. Matz, A. Schub ert, and R. Laur, \On the VLSI implementation of the interna- tional data encryption algorithm IDEA," in Pro- ceedings of the IEEE International Symposium on Circuits and Systems, vol. 1, pp. 397{400, 1995.
[13] S. L. C. Salomao, V. C. Alves, and E. M. C. Filho, \HiPCrypto: A high{p erformance VLSI crypto- graphic chip," in Proceedings of the Eleventh An- nual IEEE ASIC Conference, pp. 7{11, 1998.
[14] Ascom, IDEACrypt Coprocessor Data Sheet,
[15] A. V. Curiger, H. Bonnenb erg, and H. Kaeslin, \Regular VLSI architectures for multiplication mo dulo 2 n^ + 1," IEEE Journal of Solid{State Cir- cuits, vol. 26, pp. 990{994, July 1991.
[16] C. Meier and R. Zimmerman, \A multiplier mo d- ule 2 n^ + 1," Diploma thesis, Institut f ur Integri- erte Systeme, ETH, Z urich, Switzerland, Febru- ary 1991.
[17] R. Hartley and K. K. Parhi, Digit{Serial Compu- tation. Kluwer Academic Publishers, 1995.
[18] Xilinx, Inc., Xilinx Libraries Guide, 1999.
[19] M. George and P. Alfke, Linear Feedback Shift Registers in Virtex Devices. Xilinx, Inc., August
[20] R. F. Lyon, \Two's complement pip eline multi- pliers," IEEE Transactions on Communications, vol. 12, pp. 418{425, April 1976.
[21] Annap olis Micro Systems, Inc., Wildcard Refer- ence Manual, 1999. Revision 1.1.
[22] Xilinx, The Programmable Logic Data Book,