Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers
Abstract
:1. Introduction
Contributions
- First parallel implementation of CTR mode on embedded devices using ARMv8 architectureUntil now, CTR mode optimization has been conducted only on 8-bit AVR, Intel Core i2, and Intel Core i7 [4,5,6,7,8]. However, there are many types of IoT platforms. Among them, ARMv8 is widely used in various IoT devices, and it supports a NEON engine that is capable of efficient parallel processing. Therefore, we present a first parallel implementation of CTR mode with ARX-based block ciphers on ARMv8 architecture. For parallel implementation, we not only present an efficient data parallelism, but also present register scheduling that maximizes the use of vector registers to simultaneously process multiple encryptions. Finally, we propose the parallel implementation of CTR mode by applying the proposed parallelism technique to CTR mode optimization, which pre-computes the initial few rounds using fixed nonce of input block. The proposed parallel implementation of CTR mode can be easily applied to various parallel environments such as Single Instruction Multiple Data (SIMD) and Advanced Vector Extensions (AVX2).
- Proposing the efficient data parallelism of ARX-Based Block CiphersWe propose parallel implementation of ARX-based block ciphers (LEA, HIGHT, and revised CHAM) by utilizing the NEON engine in ARMv8 architecture. The proposed parallel technique is more efficient than the existing data parallel processing techniques in [9,10]. In LEA and revised CHAM, we eliminate the transpose operations required to apply data parallelism through LD4 and ST4 instructions when loading data from memory to four vector registers and storing data from four vector registers into memory. In HIGHT, since it processes round operations in 8-bit units, it is difficult to process data parallelism without additional costs. Thus, we present an optimized transpose operation for data parallelism in HIGHT. Furthermore, to perform as much encryption as possible simultaneously, we present an efficient vector register scheduling. Through the proposed data parallelism techniques, 24 encryptions are performed simultaneously in LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions are performed simultaneously in HIGHT and revised CHAM-64/128. In case of HIGHT and revised CHAM, more encryptions are simultaneously performed than in the previous works. As a result, we outperformed the previous work by %, %, and %, respectively in proposed data parallelism. The proposed data parallelism techniques can be applied to various lightweight cryptography such as SIMECK [11] and SKINNY [12].
- Presenting the first parallel implementation of CTR modeWe apply the proposed parallel techniques to the CTR mode implementation. The existing works of CTR optimization have been conducted on 8-bit AVR MCU [6,7]. They utilized the property of CTR mode that the input block consists of nonce and counter parts and the nonce part is always the same during encryption. Thus, we can precompute the initial few founds which are related to the fixed nonce part. In other words, by precomputing the operations related to the nonce part, it is possible to efficiently improve performance through utilizing precomputed values rather than computing round operations until 4 rounds in LEA, 5 rounds in HIGHT, and 7 rounds in revised CHAM. We extend the optimization concept of the existing works for the proposed parallel implementation of CTR mode. However, in the case of LEA, the input position of the nonce in CTR mode is not fixed. Thus, by changing the position of the nonce in the existing CTR mode optimization, we can precompute one more round operation than the previous work [7]. In addition, with the proposed data parallelism, the maximum encryption is performed simultaneously in CTR mode, and the number of encryption even in CTR mode is the same as the number in data parallelism. Through the parallel implementation of CTR mode optimization, we could achieve enhanced performance of %, %, and % in LEA-CTR, HIGHT-CTR, and revised CHAM-CTR by comparison with the previous works.
2. Background
2.1. NEON Engine and ASIMD Instructions
2.2. Target Block Ciphers
2.2.1. LEA Block Cipher
2.2.2. HIGHT Block Cipher
2.2.3. Revised CHAM Block Cipher
3. Related Works
Parallel Implementation of Block Ciphers on NEON Engine and CTR Mode Optimization
4. Proposed Parallel Implementation of LEA-CTR, HIGHT-CTR, and Revised CHAM-CTR on ARMv8 Microcontrollers
4.1. Proposed Data Parallelism Technique
4.1.1. LEA Optimization
Algorithm 1 Data parallel implementation of LEA-128 Round Function. |
Require: v0–v3 (Plaintexts), v28–v31 (Roundkeys) |
Ensure: v0–v3 (Ciphertexts) |
Loading the PT, (x0: PT Address) |
1: LD4 {v0.4s-v3.4s}, [x0], #64 |
Loading the RK, (x1: RK Address) |
2: LD1 {v28.4s-v31.4s}, [x1], #64 |
Round Function |
3: EOR v3.16b, v3.16b, v31.16b |
4: EOR v24.16b, v2.16b, v30.16b |
5: ADD v24.4s, v3.4s, v24.4s |
Operation |
6: SHL v3.4s, v24.4s, #29 |
7: SRI v3.4s, v24.4s, #3 |
Round Function |
8: EOR v2.16b, v2.16b, v31.16b |
9: EOR v24.16b, v1.16b, v29.16b |
10: ADD v24.4s, v2.4s, v24.4s |
Operation |
11: SHL v2.4s, v24.4s, #27 |
12: SRI v2.4s, v24.4s, #5 |
Round Function |
13: EOR v1.16b, v1.16b, v31.16b |
14: EOR v24.16b, v0.16b, v28.16b |
15: ADD v24.4s, v1.4s, v24.4s |
Operation |
16: SHL v1.4s, v24.4s, #9 |
17: SRI v1.4s, v24.4s, #23 |
4.1.2. HIGHT Optimization
Algorithm 2 Efficient Transpose Operation for HIGHT. |
Require: v0–v7(Plaintexts) |
Ensure: v0–v7(Ciphertexts) |
Transpose operation to be performed before encryption |
1: LD4 {v0.16b-v3.16b}, [x0], #64 |
2: LD4 {v4.16b-v7.16b}, [x0], #64 |
3: MOV v24.16b, v0.16b |
4: TRN1 v0.16b, v0.16b, v4.16b |
5: TRN2 v4.16b, v4.16b, v24.16b |
6: MOV v24.16b, v1.16b |
7 TRN1 v1.16b, v1.16b, v5.16b |
8 TRN2 v5.16b, v5.16b, v24.16b |
9 MOV v24.16b, v2.16b |
10: TRN1 v2.16b, v2.16b, v6.16b |
11: TRN2 v6.16b, v6.16b, v24.16b |
12: MOV v24.16b, v0.16b |
13: TRN1 v0.16b, v0.16b, v4.16b |
14: TRN2 v4.16b, v4.16b, v24.16b |
Transpose operation to be performed after encryption |
15: MOV v24.16b, v4.16b |
16: TRN1 v4.16b, v0.16b, v4.16b |
17: TRN2 v0.16b, v0.16b, v24.16b |
18: MOV v24.16b, v5.16b |
19: TRN1 v5.16b, v1.16b, v5.16b |
20: TRN2 v1.16b, v1.16b, v24.16b |
21: MOV v24.16b, v6.16b |
22: TRN1 v6.16b, v2.16b, v6.16b |
23: TRN2 v2.16b, v2.16b, v24.16b |
24: MOV v24.16b, v7.16b |
25: TRN1 v7.16b, v3.16b, v7.16b |
26: TRN2 v3.16b, v3.16b, v24.16b |
27: ST4 {v0.16b-v3.16b}, [x0], #64 |
28: ST4 {v4.16b-v7.16b}, [x0], #64 |
Algorithm 3 Data parallel implementation of HIGHT Round Function. |
Require: v0–v7(Plaintexts), v28–v31(Roundkeys) |
Ensure: v0–v7(Ciphertexts) |
Loading the RK, (x1: RK Address) |
1: LD1 {v28.16b-v31.16b}, [x1], #64 |
Function |
2: SHL v24.16b, v7.16b, #3 |
3: SRI v24.16b, v7.16b, #5 |
4: SHL v25.16b, v7.16b, #4 |
5: SRI v25.16b, v7.16b, #4 |
6: EOR v24.16b, v24.16b, v25.16b |
7: SHL v25.16b, v7.16b, #6 |
8: SRI v25.16b, v7.16b, #2 |
9: EOR v24.16b, v24.16b, v25.16b |
Round Function |
10: EOR v24.16b, v24.16b, v28.16b |
11: ADD v6.16b, v6.16b, v24.16b |
Function |
12: SHL v24.16b, v5.16b, #1 |
13: SRI v24.16b, v5.16b, #7 |
14: SHL v25.16b, v5.16b, #2 |
15: SRI v25.16b, v5.16b, #6 |
16: EOR v24.16b, v24.16b, v25.16b |
17: SHL v25.16b, v5.16b, #7 |
18: SRI v25.16b, v5.16b, #1 |
19: EOR v24.16b, v24.16b, v25.16b |
Round Function |
20: ADD v24.16b, v24.16b, v29.16b |
21: EOR v4.16b, v4.16b, v24.16b |
Function |
22: SHL v24.16b, v3.16b, #3 |
23: SRI v24.16b, v3.16b, #5 |
24: SHL v25.16b, v3.16b, #4 |
25: SRI v25.16b, v3.16b, #4 |
26: EOR v24.16b, v24.16b, v25.16b |
27: SHL v25.16b, v3.16b, #6 |
28: SRI v25.16b, v3.16b, #2 |
29: EOR v24.16b, v24.16b, v25.16b |
Round Function |
30: EOR v24.16b, v24.16b, v30.16b |
31: ADD v2.16b, v2.16b, v24.16b |
Function |
32: SHL v24.16b, v1.16b, #1 |
33: SRI v24.16b, v1.16b, #7 |
34: SHL v25.16b, v1.16b, #2 |
35: SRI v25.16b, v1.16b, #6 |
36: EOR v24.16b, v24.16b, v25.16b |
37: SHL v25.16b, v1.16b, #7 |
38: SRI v25.16b, v1.16b, #1 |
39: EOR v24.16b, v24.16b, v25.16b v25.16b |
Round Function |
40: ADD v24.16b, v24.16b, v31.16b |
41: EOR v0.16b, v0.16b, v24.16b |
4.1.3. Revised CHAM Optimization
Algorithm 4 Data parallel implementation of revised CHAM-64/128 Round Function. |
Require: v0–v3(Plaintexts), v27(Counter), v28–v31(Roundkeys) |
Ensure: v0–v3(Ciphertexts) |
Loading the PT and RK, (x0: PT Address, x1: RK Address) |
1: LD4 {v0.8h-v3.8h}, [x0], #64 |
2: LD1 {v28.8h-v31.8h}, [x1], #64 |
Counter ⊕ PT |
3: EOR v0.16b, v0.16b, v27.16b |
Operation |
4: SHL v24.8h, v1.8h, #1 |
5: SRI v24.8h, v1.8h, #15 |
Roundkey ⊕ |
6: EOR v24.16b, v24.16b, v28.16b |
Step 3 ⊞ Step 6 |
7: ADD v0.8h, v24.8h, v0.8h |
Operation |
8: REV16 v0.16b, v0.16b |
Adding 1 to the counter value |
9: MOVi v24.8h, #1 |
10: ADD v27.8h, v27.8h, v24.8h |
Counter ⊕ PT |
11: EOR v1.16b, v0.16b, v27.16b |
Operation |
12: REV16 v24.16b, v2.16b |
Roundkey ⊕ |
13: EOR v24.16b, v24.16b, v28.16b |
Step 11 ⊞ Step 13 |
14: ADD v0.8h, v24.8h, v0.8h |
Operation |
15: SHL v1.8h, v24.8h, #1 |
16: SRI v1.8h, v24.8h, #15 |
Adding 1 to the counter value |
17: MOVi v24.8h, #1 |
18: ADD v27.8h, v27.8h, v24.8h |
4.2. Parallel Implementation of CTR Mode of Operation on the NEON Engine
4.2.1. LEA-CTR Optimization
Algorithm 5 Parallel implementation of LEA-128 CTR mode optimization. |
Require: v0–v3(Plaintexts), v26–v28(Table), v29–v31(Roundkeys) |
Ensure: v0–v3(Ciphertexts) |
Loading the RK and the table, (x1: RK Address, x2: Pre-computation table Address) |
1: LD1 {v29.4s-v31.4s}, [x1], #48 |
2: LD1 {v26.4s-v28.4s}, [x2], #48 |
1 Round Function |
3: EOR v3.16b, v3.16b, v29.16b |
4: ADD v24.16b, v3.16b, v26.16b |
5: SHL v3.4s, v24.4s, #29 |
6: SRI v3.4s, v24.4s, #3 |
2 Round Function |
7: EOR v24.16b, v3.16b, v30.16b |
8: ADD v24.4s, v24.4s, v27.4s |
9: SHL v0.4s, v24.4s, #29 |
10: SRI v0.4s, v24.4s, #3 |
11: EOR v24.16b, v3.16b, v31.16b |
12: ADD v24.4s, v24.4s, v28.4s |
13: SHL v3.4s, v24.4s, #27 |
14: SRI v3.4s, v24.4s, #5 |
Loading the RK and table |
15: LD1 {v29.4s-v31.4s}, [x1], #48 |
16: LD1 {v26.4s-v28.4s}, [x2], #48 |
3 Round Function |
17: EOR v24.16b, v0.16b, v30.16b |
18: ADD v24.4s, v24.4s, v26.4s |
19: SHL v1.4s, v24.4s, #29 |
20: SRI v1.4s, v24.4s, #3 |
21: EOR v0.16b, v0.16b, v31.16b |
22: EOR v24.4s, v3.4s, v29.4s |
23: ADD v24.4s, v24.4s, v0.4s |
24: SHL v0.4s, v24.4s, #27 |
25: SRI v0.4s, v24.4s, #5 |
26 EOR v3.16b, v3.16b, v31.16b |
27: ADD v24.4s, v27.4s, v3.4s |
28: SHL v3.4s, v24.4s, #9 |
29: SRI v3.4s, v24.4s, #23 |
Loading the RK |
30: LD1 {v28.4s-v31.4s}, [x1], #64 |
4 Round Function |
31: EOR v24.16b, v1.16b, v30.16b |
32: ADD v24.4s, v24.4s, v28.4s |
4.2.2. HIGHT-CTR Optimization
Algorithm 6 Parallel implementation of HIGHT-CTR mode optimization. |
Require: v0–v7(Plaintexts), v24–v31(Roundkeys and Table) |
Ensure: v0–v7(Ciphertexts) |
Loading RK and the table, (x1: RK Address, x2: pre-computation table Address) |
1: LD1 {v30.16b-v31.16b}, [x1], #32 |
2: LD1 {v28.16b-v29.16b}, [x2], #32 |
2 Round Function |
3: ADD v7.16b, v7.16b, v28.16b |
4: SHL v24.16b, v4.16b, #3 |
5: SRI v24.16b, v4.16b, #5 |
6: SHL v25.16b, v4.16b, #4 |
7: SRI v25.16b, v4.16b, #4 |
8: EOR v24.16b, v24.16b, v25.16b |
9: SHL v25.16b, v4.16b, #6 |
10: SRI v25.16b, v4.16b, #2 |
11: EOR v24.16b, v24.16b, v25.16b |
12: EOR v24.16b, v31.16b, v24.16b |
13: ADD v3.16b, v29.16b, v24.16b |
Loading RK |
14: LD1 {v29.16b-v31.16b}, [x1], #48 |
3 Round Function |
15: SHL v24.16b, v3.16b, #1 |
16: SRI v24.16b, v3.16b, #7 |
17: SHL v25.16b, v3.16b, #2 |
18: SRI v25.16b, v3.16b, #6 |
19: EOR v24.16b, v24.16b, v25.16b |
20: SHL v25.16b, v3.16b, #7 |
21: SRI v25.16b, v3.16b, #1 |
22: EOR v24.16b, v24.16b, v25.16b |
23: ADD v24.16b, v31.16b, v24.16b |
Loading table |
24: LD1 {v30.16b-v31.16b}, [x2], #32 |
25: EOR v2.16b, v24.16b, v30.16b |
Loading RK |
26: LD1 {v28.16b-v30.16b}, [x1], #48 |
4 Round Function |
27: SHL v24.16b, v2.16b, #3 |
28: SRI v24.16b, v2.16b, #5 |
29: SHL v25.16b, v2.16b, #4 |
30: SRI v25.16b, v2.16b, #4 |
31: EOR v24.16b, v24.16b, v25.16b |
32: SHL v25.16b, v2.16b, #6 |
33: SRI v25.16b, v2.16b, #2 |
34: EOR v24.16b, v24.16b, v25.16b |
35: EOR v24.16b, v30.16b, v24.16b |
36: ADD v1.16b, v31.16b, v24.16b |
Loading table |
37: LD1 {v30.16b-v31.16b}, [x2] |
38: EOR v7.16b, v7.16b, v30.16b |
Loading RK |
39: LD1 {v28.16b-v31.16b}, [x1], #64 |
5 Round Function |
40: SHL v24.16b, v1.16b, #1 |
41: SRI v24.16b, v1.16b, #7 |
42: SHL v25.16b, v1.16b, #2 |
43: SRI v25.16b, v1.16b, #6 |
44: EOR v24.16b, v24.16b, v25.16b |
45: SHL v25.16b, v1.16b, #7 |
46: SRI v25.16b, v1.16b, #1 |
47: EOR v24.16b, v24.16b, v25.16b |
48: ADD v24.16b, v31.16b, v24.16b |
4.2.3. Revised CHAM-CTR Optimization
Algorithm 7 Parallel implementation of the revised CHAM-64/128-CTR mode optimization. |
Require: v0–v3(Plaintexts), v28–v31(Roundkeys and Table) |
Ensure: v0–v3(Ciphertexts) |
Loading RK (x1: RK Address) |
1: LD1 {v29.8h}, [x1], #16 |
Loading the table (x2: pre-computation table Address) |
2: LD1 {v28.8h-v29.8h}, [x2], #32 |
2 Round Function |
3: EOR v1.16b, v27.16b, v1.16b |
4: ADD v24.8h, v1.8h, v28.8h |
5: SHL v1.8h, v24.8h, #1 |
6: SRI v1.8h, v24.8h, #15 |
7: MOVi v29.8h, #1 |
8: ADD v27.8h, v27.8h, v29.8h |
3 Round Function |
9: MOVi v29.8h, #1 |
10: ADD v27.8h, v27.8h, v29.8h |
Loading RK |
11: LD1 {v30.8h-v31.8h}, [x1], #32 |
4 Round Function |
12: REV16 v24.16b, v0.16b |
13: EOR v24.16b, v30.16b, v24.16b |
14: ADD v24.8h, v29.8h, v24.8h |
15: SHL v3.8h, v24.8h, #1 |
16: SRI v3.8h, v24.8h, #15 |
17: MOVi v29.8h, #1 |
18: ADD v27.8h, v27.8h, v29.8h |
Loading the table |
19: LD1 {v30.8h-v31.8h}, [x2] |
6 Round Function |
20: EOR v1.16b, v27.16b, v1.16b |
21: ADD v24.8h, v1.8h, v30.8h |
22: SHL v1.8h, v24.8h, #1 |
23: SRI v1.8h, v24.8h, #15 |
24: MOVi v29.8h, #1 |
25: ADD v27.8h, v27.8h, v29.8h |
Loading RK |
26: LD1 {v28.8h-v29.8h}, [x1], #32 |
7 Round Function |
27: SHL v24.8h, v3.8h, #1 |
28: SRI v24.8h, v3.8h, #15 |
29: EOR v24.16b, v24.16b, v28.16b |
30: ADD v24.8h, v24.8h, v31.8h |
31: REV16 v2.16b, v24.16b |
32: MOVi v29.8h, #1 |
33: ADD v27.8h, v27.8h, v29.8h |
5. Evaluation
5.1. Parallel Implementation of LEA-CTR Mode on the NEON Engine
5.2. Parallel Implementation of HIGHT-CTR and Revised CHAM-CTR Mode on the NEON Engine
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hong, D.; Lee, J.K.; Kim, D.C.; Kwon, D.; Ryu, K.H.; Lee, D.G. LEA: A 128-Bit Block Cipher for Fast Encryption on Common Processors. In Proceedings of the 14th International Workshop, WISA 2013, Jeju Island, Korea, 19–21 August 2013; Revised Selected Papers. Springer: Cham, Switzerland, 2013; pp. 3–27. [Google Scholar]
- Hong, D.; Sung, J.; Hong, S.; Lim, J.; Lee, S.; Koo, B.S.; Lee, C.; Chang, D.; Lee, J.; Jeong, K.; et al. HIGHT: A new block cipher suitable for low-resource device. In Proceedings of the International Workshop on Cryptographic Hardware and Embedded Systems, Yokohama, Japan, 10–13 October 2006; pp. 46–59. [Google Scholar]
- Roh, D.; Koo, B.; Jung, Y.; Jeong, I.W.; Lee, D.G.; Kwon, D. Revised Version of Block Cipher CHAM. In Information Security and Cryptology—ICISC 2019, Proceedings of the 22nd International Conference, Seoul, Korea, 4–6 December 2019; Revised Selected Papers; Springer: Cham, Switzerland, 2019; pp. 1–19. [Google Scholar]
- Park, J.H.; Lee, D.H. FACE: Fast AES CTR mode Encryption Techniques based on the Reuse of Repetitive Data. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 469–499. [Google Scholar] [CrossRef]
- Kim, K.; Choi, S.; Kwon, H.; Liu, Z.; Seo, H. FACE–LIGHT: Fast AES–CTR Mode Encryption for Low-End Microcontrollers. In Information Security and Cryptology—ICISC 2019, Proceedings of the 22nd International Conference, Seoul, Korea, 4–6 December 2019; Revised Selected Papers; Springer: Cham, Switzerland, 2019; pp. 102–114. [Google Scholar]
- Kwon, H.; An, S.; Kim, Y.; Kim, H.; Choi, S.J.; Jang, K.; Park, J.; Kim, H.; Seo, S.C.; Seo, H. Designing a CHAM Block Cipher on Low-End Microcontrollers for Internet of Things. Electronics 2020, 9, 1548. [Google Scholar] [CrossRef]
- Kim, Y.; Kwon, H.; An, S.; Seo, H.; Seo, S.C. Efficient Implementation of ARX-Based Block Ciphers on 8-Bit AVR Microcontrollers. Electronics 2020, 8, 1837. [Google Scholar] [CrossRef]
- Kim, Y.; Seo, S.C. An Efficient Implementation of AES on 8-Bit AVR-Based Sensor Nodes. In Information Security Applications; Springer International Publishing: Cham, Switzerland, 2020; pp. 276–290. [Google Scholar]
- Seo, H. High Speed Implementation of LEA on ARMv8. J. Korea Inst. Inf. Commun. Eng. 2017, 21, 1929–1934. [Google Scholar]
- Song, J.; Seo, S.C. Secure and Fast Implementation of ARX-Based Block Ciphers Using ASIMD Instructions in ARMv8 Platforms. IEEE Acess 2020, 8, 193138–193153. [Google Scholar] [CrossRef]
- Gangqiang, Y.; Bo, Z.; Valentin, S.; Mark D., A.; Guang, G. The Simeck Family of Lightweight Block Ciphers. In Cryptographic Hardware and Embedded Systems—CHES 2015, Proceedings of the 17th International Workshop, Saint-Malo, France, 13–16 September 2015; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9293, pp. 307–329. [Google Scholar]
- Beierle, C.; Jean, J.; Kölbl, S.; Leander, G.; Moradi, A.; Peyrin, T.; Sasaki, Y.; Sasdrich, P.; Sim, S.M. The SKINNY Family of Block Ciphers and Its Low-Latency Variant MANTIS. In Advances in Cryptology—CRYPTO 2016, Proceedings of the 36th Annual International Cryptology Conference, Santa Barbara, CA, USA, 14–18 August 2016; Lecture Notes in Computer Science; Proceedings, Part II; Robshaw, M., Katz, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9815, pp. 123–153. [Google Scholar]
- Arm® A64 Instruction Set Architecture: Armv8, for Armv8-A Architecture Profile. Available online: https://developer.arm.com/docs/ddi0596/c/simd-and-floating-point-instructions-alphabetic-order (accessed on 2 February 2021).
- ISO. ISO/IEC 29192-2: 2019: Information Security—Lightweight Cryptography—Part 2: Block Ciphers; International Organization for Standardization: Geneva, Switzerland, 2019. [Google Scholar]
- ISO. ISO/IEC 18033-3: 2010: Information Technology—Security Techniques—Encryption Algorithms—Part 3: Block Ciphers; International Organization for Standardization: Geneva, Switzerland, 2010. [Google Scholar]
- Koo, B.; Roh, D.; Kim, H.; Jung, Y.; Lee, D.G.; Kwon, D. CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices. In Proceedings of the International Conference on Information Security and Cryptology (ICISC’17), Seoul, Korea, 29 November–1 December 2017. [Google Scholar]
- Beaulieu, R.; Shors, D.; Smith, J.; Treatman-Clark, S.; Weeks, B.; Wingers, L. The SIMON and SPECK lightweight block ciphers. In Proceedings of the 52nd Annual Design Automation Conference, San Francisco, CA, USA, 7–11 June 2015; pp. 175:1–175:6. [Google Scholar]
- Seo, H.; Park, T.; Heo, S.; Seo, G.; Bae, B.; Hu, Z.; Zhou, L.; Nogami, Y.; Zhu, Y.; Kim, H. Parallel Implementations of LEA, Revisited. In Information Security Applications—WISA 2016, Proceedings of the 17th International Workshop, Jeju Island, Korea, 25–27 August 2016; Revised Selected Papers; Springer: Cham, Switzerland, 2016; pp. 318–330. [Google Scholar]
- Seo, H.; An, K.; Kwon, H.; Park, T.; Hu, Z.; Kim, H. Parallel Implementations of CHAM. In Information Security Applications—WISA 2018, Proceedings of the 19th International Conference, Jeju Island, Korea, 23–25 August 2018; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 93–104. [Google Scholar]
- Fujii, H.; Carvalho Rodrigues, F.; López, J. Fast AES Implementation Using ARMv8 ASIMD Without Cryptography Extension. In Information Security and Cryptology—ICISC 2019, Proceedings of the 22nd International Conference, Seoul, Korea, 4–6 December 2019; Revised Selected Papers; Springer: Cham, Switzerland, 2019; pp. 84–101. [Google Scholar]
- Raspberry Pi 4B Specification. Available online: https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/ (accessed on 2 February 2021).
Instructions | Operands | Description | Cycles |
---|---|---|---|
ADD | Vector addition = + | 1 | |
SHL | Vector left shift | 1 | |
SRI | Vector right shift and insert ⊕ | 2 | |
TRN1 | Vector transpose (primary) | 1 | |
TRN2 | Vector transpose (secondary) | 1 | |
REV16 | Vector reverse elements in 16-bit halfwords (reverse) | 1 | |
LD4 | Loading data from memory to 4 vector registers by applying the transpose operation | 4 | |
ST4 | Storing data from 4 vector registers to memory by applying the transpose operation | 4 |
Cipher | ||||
---|---|---|---|---|
LEA-128 | 128 | 128 | 24 | 32 |
LEA-192 | 128 | 192 | 28 | 32 |
LEA-256 | 128 | 256 | 32 | 32 |
Cipher | ||||
---|---|---|---|---|
CHAM-64/128 | 64 | 128 | 80 | 16 |
CHAM-128/128 | 128 | 128 | 80 | 32 |
CHAM-128/256 | 128 | 256 | 96 | 32 |
Methods | Target Device | Target Block Cipher | #Data Parallelism |
---|---|---|---|
Seo et al. [18] | Cortex-A9 (ARMv7) | LEA | 12 |
Seo et al. [19] | Cortex-A53 (ARMv8) | CHAM-64/128 | 24 |
Seo Hwajeong [9] | Apple A7 and Apple A9 (ARMv8) | LEA | 24 |
H. Fujii et al. [20] | Cortex-A53 (ARMv8) | AES | 4 |
Song et al. [10] | Cortex-A72 (ARMv8) | HIGHT | 24 |
Revised CHAM-64/128 | 16 | ||
Revised CHAM-128/128 | 10 | ||
Revised CHAM-128/256 | 8 |
Work | Suggested Structure | #Data Parallelism | Cpb | Improvement |
---|---|---|---|---|
LEA-128 [9] | ECB mode | 24 | 3.88 | - |
LEA-192 [9] | ECB mode | 24 | 4.69 | - |
LEA-256 [9] | ECB mode | 24 | 5.32 | - |
LEA-128 (Our Work 1) | ECB mode | 24 | 3.76 | 3.09% |
LEA-192 (Our Work 1) | ECB mode | 24 | 4.56 | 2.77% |
LEA-256 (Our Work 1) | ECB mode | 24 | 5.18 | 2.63% |
LEA-128 (Our Work 2) | CTR mode (Our Work) | 24 | 3.54 | 8.76% |
LEA-192 (Our Work 2) | CTR mode (Our Work) | 24 | 4.39 | 6.39% |
LEA-256 (Our Work 2) | CTR mode (Our Work) | 24 | 5.01 | 5.82% |
Work | Suggested Structure | #Data Parallelism | Cpb | Improvement |
---|---|---|---|---|
HIGHT-64/128 [10] | ECB mode | 24 | 8.35 | - |
Revised CHAM-64/128 [10] | ECB mode | 16 | 6.3 | - |
Revised CHAM-128/128 [10] | ECB mode | 10 | 9.85 | - |
Revised CHAM-128/256 [10] | ECB mode | 8 | 10.81 | - |
HIGHT-64/128 (Our Work 1) | ECB mode | 48 | 7.91 | 5.26% |
HIGHT-64/128 (Our Work 2) | CTR mode [7] | 48 | 7.63 | 8.62% |
Revised CHAM-64/128 (Our Work 1) | ECB mode | 48 | 5.7 | 9.52% |
Revised CHAM-128/128 (Our Work 1) | ECB mode | 24 | 9.71 | 1.52% |
Revised CHAM-128/256 (Our Work 1) | ECB mode | 24 | 10.38 | 4.02% |
Revised CHAM-64/128 (Our Work 2) | CTR mode [6] | 48 | 5.3 | 15.87% |
Revised CHAM-128/128 (Our Work 2) | CTR mode [6] | 24 | 9.56 | 2.94% |
Revised CHAM-128/256 (Our Work 2) | CTR mode [6] | 24 | 10.23 | 5.36% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, J.; Seo, S.C. Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers. Appl. Sci. 2021, 11, 2548. https://doi.org/10.3390/app11062548
Song J, Seo SC. Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers. Applied Sciences. 2021; 11(6):2548. https://doi.org/10.3390/app11062548
Chicago/Turabian StyleSong, JinGyo, and Seog Chung Seo. 2021. "Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers" Applied Sciences 11, no. 6: 2548. https://doi.org/10.3390/app11062548
APA StyleSong, J., & Seo, S. C. (2021). Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers. Applied Sciences, 11(6), 2548. https://doi.org/10.3390/app11062548