You are on page 1of 6

Integration of Hardware Cryptography Acceleration on Embedded Systems under Linux 2.

Olaf Christ mycable GmbH Gartenstrasse 10 24534 Neumnster Germany oc@mycable.de Tel: +49 4321 559 56-22 http://www.mycable.de

Introduction With the advent of embedded devices being connected to each other and to the internet, the need for secure transmissions is increasing steadily. Not only industrial, telecom and network, but also consumer and automotive embedded devices get connected to networks. Establishing encrypted VPN connections with the help of IPsec, for example, puts huge loads on current embedded processors. Also, technologies such as DRM (Digital Rights Management) cause the need for cryptographic operations. The problem with encryption algorithms on embedded processors is that nowadays processors are scalar and thus only perform poorly processing these algorithms. This certainly demands cryptography hardware. Cryptographic operations should be offloaded to this piece of hardware by the operating system. The following paper describes the problems occurred while developing the driver to access those devices under Linux 2.6.

Linux Kernel 2.6's Cryptographic API The Cryptographic API (see Figure 1) is a main part of Linux Kernel 2.6 and has been initiated to deliver cryptographic functionality to the whole Kernel. Several other Kernel parts access these functions in their cryptographic routines, for example, cryptoloop and the IPsec stack. The Cryptographic API is divided into three parts: the Algorithm API, the Transform Operations (OPS) (split into ciphers, digests and compressions) and the Transform API. The Algorithm API provides functions for registering algorithms with the Cryptographic API. These algorithms can be compiled statically or as modules, which call the function crypto_register_alg() (in crypto/api.c) on initialisation. Every algorithm can only register with one facility within the API, therefore being a cipher, a digest or a compression algorithm. These transform operations are accessed by clients through the Transform API, which maintains transformation states and handles common logical operations (e.g. HMAC). The available functions are crypto_alg_available() for checking if the desired algorithm is registered with the API, crypto_alloc_tfm() for allocating a transformation session and crypto_free_tfm() for freeing sessions (all in crypto/api.c).

Figure 1: Cryptographic API Block Diagram

All cryptographic operations (transformations) are performed on scatterlists, which are arrays of scatterlist structures. A scatterlist is a structure, describing a specific memory area via a pointer to the memory page it is located on, its offset on the page and its length. Linux divides its main memory into zones, which are then split up into equally sized pages. The size of these pages depends on the processors architecture, being 4KB on i386 and x86 64 architectures and one of 4KB, 8KB, 16KB or 64KB on MIPS machines. Scatterlists are defined in include/asm/scatterlist.h.
struct scatterlist { struct page *page; unsigned int offset; dma_addr_t dma_address; unsigned int length; };

For memory directly accessible by other bus masters (DMA memory), the optional entry dma_address exists in the scatterlist structure. Scatterlists have been introduced to speed up cryptographic operations. The best cryptographic performance can be achieved with data being located on a single page. This ensures that the data is continuous and does not have to be copied around before processing. A scatterlist should furthermore contain an amount of data which is a multiple of the ciphers block size (typically 8 bytes). Hardware support and the Cryptographic API Knowing the architecture of the Cryptographic API, implementing hardware cryptography drivers is straightforward. There is just one issue to resolve: The Cryptographic API is a synchronous interface. This means, the kernel waits until a function in the Cryptographic API has finished processing a request. For software implemented algorithms this may be fine, since the CPU has to do the work anyway. Hardware cryptography chips, however, may take some time to process a request. During this time, the CPU has to wait while the Cryptographic API function call returns. The best way to overcome this problem would be to use acrypto [1], a kernel patch, which has been created to allow for asynchronous cryptographic operations. It extends the Cryptographic API to be more hardware crypto friendly by providing call-back functions and notifications to the API upon completion of a request. There are several additional features including load balancing between multiple hardware chips and software implemented algorithms and a priority mechanism for selecting which implementations to use fist. Although the acrypto patch would be the best approach, it still has some disadvantages: First of all, the acrypto patch is very big and thus not guaranteed to work with future kernels. This makes supporting drivers for several kernels difficult. Additionally, acryptos API and internals are subject to change before being included into the kernel. Drivers based on this API would have to be rewritten every time a change would be made. Therefore another solution has to be found. In the following, two patches for the Cryptographic API from the community will be presented. OCF-Linux OCF-Linux [2] is a port of OpenBSDs Cryptographic Framework (OCF) to Linux. It aims at bringing asynchronous hardware cryptography support to Linux Kernel 2.6. Unfortunately it currently only provides acceleration for OpenSSL and programs that rely on this library. Future plans include direct processing of skbuffs and IPsec. OCFLinux currently only supports i386, SuperH and ARM platforms. Eugene Surovegins Hardware Crypto Patches Eugenes first patch [3] extends Linux Kernel 2.6s built-in Cryptographic API to support hardware cryptography accelerators. This patch is very trivial, since it only extends the Cryptographic APIs structures and functions to know about hardware cryptography chips. His second patch is a bit more complicated. It changes the function esp_hmac_digest() to be hardware cryptography friendly by allowing processing of multiple blocks of data at a time. This increases performance significantly if supported by the crypto hardware.

Eugenes Hardware Crypto Patches seem to be best suitable for adding hardware crypto support to Linux Kernel 2.6 quickly. They allow for writing a driver, which plugs into the existing Cryptographic API, enabling hardware cryptographic support throughout the whole

Kernel. This also allows writing a modular driver, which can be easily adapted (and enhanced) to acrypto when it becomes available. Example implementation on AMD's Au1550 processor In this section, an example solution for AMD's Au1550 processor will be presented. This processor includes a so called security engine, which supports the encryption algorithms DES, 3DES, AES and ARC-4 and the hash algorithms MD5 and SHA-1 in hardware. It also features a DMA engine and an interrupt controller. This example can be ported to other hardware easily. First of all, Eugine Surovegins hardware cryptography patches have to be applied to a current Kernel (Linux-mips version 2.6.5 rc4 in this case). Now, the Cryptographic API is ready to support hardware cryptography chips. Registration with the Cryptographic API is easy: First, a structure with information about the modules capabilities has to be created:
static struct crypto_alg des3_ede_alg = { .cra_name = "des3_ede", .cra_flags = CRYPTO_ALG_TYPE_CIPHER, .cra_blocksize = DES3_EDE_BLOCK_SIZE, .cra_ctxsize = sizeof(struct au1550_crypto_cipher_ctx), .cra_module = THIS_MODULE, .cra_list = LIST_HEAD_INIT(des3_ede_alg.cra_list), .cra_u = { .cipher = { .cia_min_keysize = DES3_EDE_KEY_SIZE, .cia_max_keysize = DES3_EDE_KEY_SIZE, .cia_setkey = au1550_crypto_des3_ede_setkey, .cia_encrypt = au1550_crypto_des3_ede_encrypt, .cia_decrypt = au1550_crypto_des3_ede_decrypt } } };

It is then passed to the Cryptographic API by calling its function crypto_register_alg() during module initialisation. The Cryptographic API places crypto requests by calling one of the modules registered functions. In these functions the request is converted into a (proprietary) request packet the security engine understands. This request packet is then transmitted to the security engine, which fetches the payload from main memory via DMA, performs the cryptographic operations requested, writes the result back to main memory via DMA and issues an interrupt in the CPU. The driver went to sleep after handing over the request to the security engine. Upon completion, the driver continues processing at that point. This approach causes a lock in the Cryptographic API but since processing is synchronous anyways, this issue can be neglected.

Performance In order to prove the driver works correctly, performance tests have to be performed. Figure 2 shows the test network used for benchmarking. Two private networks, 10.0.0.0/8 on the left and 192.168.128.0/24 on the right side are connected via two routers. The left router is a regular personal computer with an Intel Pentium 4 1.3 GHz CPU, 512 MB RAM and two 100 Mbit/s network interfaces, the one on the right is AMDs DBAu1550 development board. The computers in the private networks have their routing tables set up to find the other networks.

Figure 2: Test network

The two test machines (10.0.0.2 and 192.168.128.71) are usual personal computers, too. They are running SuSE Linux 9.2 standard installations without any netfilter rules set and with their X servers shut down. Cron jobs and unnecessary servers were disabled too. The same applies for the router PC (10.0.0.1/5.0.0.1). On both routers the current version of the IPsec-Tools (0.6.3) has been used. Since no hardware network testing equipment had been available for these benchmarks, software tools have been used instead. FTP transfers of random data have been measured to obtain a general overview of throughput at the end-user level (excluding protocol overhead).

Figure 3: FTP throughput

FTP throughput results are shown in Figure 3. They are divided into unencrypted, IPsec (software) and IPsec (hardware) connections. Unencrypted communication lead to a throughput of 63.65 Mbit/s, software encryption to 4.91 Mbit/s and hardware supported encryption to 24.32 Mbit/s.

Figure 4: CPU load IPsec

Figure 4 shows the results for CPU load measurements. The first values show the connection establishment phase, where CPU loads are small. The results have to be split into two independent measurements: simple software connections and hardware supported ones on DBAu1550. For software only connections the CPU load is at about 31% on the router PC and almost 100% on the DBAu1550. For hardware supported connections this drops to about 52% for the DBAu1550, but increases to about 74% on the PC. This is because the PC has to do the encryption in software faster now, since the DBAu1550 increases its encryption speed by using the security engine. Since both values for the PC are under 90%, the ordinary PC can be excluded as the bottleneck. With hardware supported encryption, CPU load decreases significantly on the DBAu1550. Conclusion The benchmark clearly proves a significant increase in performance when using hardware encryption support for IPsec links. The performance boost with hardware supported encryption is 415% compared to bare software encryption. Additionally CPU load on the embedded processor drops from 100% to 52%, leaving processing power to do other tasks. All this is achieved with a synchronous Cryptographic API. Using an asynchronous interface, the values should increase even more. References [1] Evgeniy Polyakov: Asynchronous Crypto Layer [http://lists.logix.cz/pipermail/cryptoapi/2004/000163.html] [2] David McCullough: OCF-Linux [http://ocf-linux.sourceforge.net/] [3] Eugene Surovegin: HW Crypto Patches [http://kernel.ebshome.net/]

- All trademarks herein before mentioned are the property of their respective owners. -

You might also like