22
8/28/19 1 Introduction to Post-Quantum Cryptography CERG: Cryptographic Engineering Research Group 2 3 faculty members, 9 Ph.D. students, 6 affiliated scholars Previous Cryptographic Contests time 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 AES NESSIE CRYPTREC eSTREAM SHA-3 34 stream 4 HW winners ciphers ® + 4 SW winners 51 hash functions ® 1 winner 15 block ciphers ® 1 winner IX.1997 X.2000 I.2000 XII.2002 IV.2008 X.2007 X.2012 XI.2004 CAESAR I.2013 57 authenticated ciphers ® multiple winners II. 2019 Cryptographic Contests 2007-Present Year 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 SHA-3 51 hash functions ® 1 winner X.2007 CAESAR I.2013 57 authenticated ciphers ® multiple winners II.2019 Post-Quantum 56 Lightweight authenticated ciphers & hash functions VIII.2018 Lightweight 69 Public-Key Post-Quantum Cryptography Schemes XII.2016 TBD TBD Completed In progress X.2012

Introduction to CERG:Cryptographic Engineering Research Group … · 2019-08-28 · ASIC Stratix III FPGA Keccak –an early leader in hardware and winner of the competition 10 CAESAR,

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

8/28/19

1

Introduction to Post-Quantum Cryptography

CERG: Cryptographic Engineering Research Group

23 faculty members, 9 Ph.D. students,

6 affiliated scholars

Previous Cryptographic Contests

time97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17

AES

NESSIE

CRYPTREC

eSTREAM

SHA-3

34 stream 4 HW winnersciphers ® + 4 SW winners

51 hash functions ® 1 winner

15 block ciphers ® 1 winnerIX.1997 X.2000

I.2000 XII.2002

IV.2008

X.2007 X.2012

XI.2004

CAESARI.2013

57 authenticated ciphers ®multiple winnersII. 2019

Cryptographic Contests 2007-Present

Year07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22

SHA-3

51 hash functions ® 1 winnerX.2007 X.2012

CAESARI.2013

57 authenticated ciphers ®multiple winners

II.2019

Post-Quantum

56 Lightweight authenticated ciphers & hash functions

VIII.2018Lightweight

69 Public-Key Post-Quantum Cryptography Schemes

XII.2016

TBD

TBD

CompletedIn progressX.2012X.2012

8/28/19

2

Why Cryptographic Contests?Avoid back-door theoriesSpeed-up the acceptance of a new standardStimulate non-classified research on methods of designing a specific cryptographic transformationFocus the effort of a relatively small cryptographic community

5

Evaluation Criteria

6

Security

Software Efficiency Hardware Efficiency

Simplicity

FPGAs ASICs

Flexibility Licensing

µProcessors µControllers

AES Contest, 1997-2000: Block Ciphers

7

Straw Poll @ NIST AES 3 conferenceGMU FPGA Results

Rijndael second best in FPGAs,selected as a winner due to much better performance

in software & ASICs

SHA-3 Round 2: 14 candidates

8

Throughput vs. Area: Normalized to Results for SHA-2 and Averaged over 7 FPGA Families

Area

Throughput

Best: Fast & Small

Worst: Slow & Big

8/28/19

3

SHA-3 Contest, 2007-2012: Hash Functions

9

ASIC Stratix III FPGA

Keccak – an early leader in hardware and winnerof the competition 10

CAESAR, 2013-2018: Authenticated Ciphers

Relative (w.r.t. AES-GCM) Throughput in Virtex 6

in red – algorithms qualified to Round 3

14-16x better than the current standard

AEGIS and MORUS – finalists in the use case High-Performance Applications

Post-Quantum Cryptography (PQC)

11

• Public-key cryptographic algorithms for which there are no known attacks using quantum computers

• Capable of being implemented using any traditional methods, including software and hardware

• Running efficiently on any modern computing platforms: PCs, tablets, smartphones, servers with FPGA accelerators, etc.

• Term introduced by Dan Bernstein in 2003• Equivalent terms: quantum-proof, quantum-safe or

quantum-resistant• Based entirely on traditional semiconductor VLSI technology!

Quantum Computers

12

• First perceived by physicists (Richard Feynman, David Deutsch) in 1980s

• First significant quantum algorithms(capable of running on quantum computers only) developed in 1990s

• First practical realization in 1998(2 qubits)

• Significant technological advances during the last 20 years

• One of ten breakthrough technologiesof 2017

Photo: https://www.technologyreview.com

8/28/19

4

13Source: Vandersypen, PQCrypto 2017

Major advances during the last 20 years

Timeline of Quantum Computing: https://en.wikipedia.org/wiki/Timeline_of_quantum_computing

Quantum Computers

14

• Substantial investments by:Google, IBM, Intel, Microsoft,Alcatel-Lucent, NTT

• Quantum computers based on superconducting circuitsoperating in the temperatureclose to absolute 0 (~0.01 K)

Photos: https://www.technologyreview.com

• November 2017: IBM’s 50-qubit chip• January 2018: Intel’s 49-qubit chip,

“Tangle-Lake” • March 2018: Google’s 72-qubit chip

“Bristlecone”

Progress in Quantum Computing

15Photos: https://www.technologyreview.com

Remaining Challenges in Quantum Computing

1. High sensitivity to manufacturing variationsSolution: Best industry cleanrooms, e.g., QuTech-Intel collaboration toward quantum-dot arrays made @ Intel 300mm wafers

2. Scalable control circuits (currently bulky & expensive)Solution: Tailored cryo-CMOS digital control

3. Multitude of interconnects and external pinsSolution: Multiplexing electronics co-integrated with qubits

4. Non-standard architecture & limited programmabilitySolution: System layer approach

Likely to be overcome in the next 10-15 years

Source: Vandersypen, PQCrypto 2017

8/28/19

5

System Layer Approach

17Source: Vandersypen, PQCrypto 2017

What Quantum Computers Can Do?

18Source: Vandersypen, PQCrypto 2017

Quantum Computers & Cryptography

19

1994: Shor’s Algorithm, breaks major public key cryptosystems based on

Factoring: RSA

Discrete logarithm problem (DLP): DSA, Diffie-Hellman

Elliptic Curve DLP: Elliptic Curve Cryptosystems

independently of the key size assuming

a sufficiently powerful and reliable quantum computer available

Underlying Mathematical Problem - RSA

N =P*Q (P, Q random primes) 1230186684530117755130494958384962720772853569595334792197322452151726400507263657518745202199786469389956474942774063845925192557326303453731548268507917026122142913461670429214311602221240

479274737794080665351419597459856902143413=

334780716989568987860441698482126908177047949837137685689124313889828837938780022876147116525317

43087737814467999489*

36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396

810270092798736308917

Record Using Classical Computers, 232 digits, 768 bits

8/28/19

6

How Real Is the Danger?

Source: Vandersypen, PQCrypto 2017

“There is a 1 in 7 chance that some fundamental public-key crypto will be broken by quantum by 2026, and a 1 in 2 chance of the same by 2031.”Dr. Michele MoscaDeputy Director of the Institute for Quantum Computing, University of WaterlooApril 2015

21

“Theorem” by Mosca

y – Time to Develop & Deploy PQC Standards

x – Time Information Must Remain Protected

z – Time to Build Quantum Computers

If z < y + x, then worry!

Encrypted Data Stored by Powerful Adversaries

No Announcement when Quantum Computer Available to NSA, Foreign Governments,

or Organized Crime

NIST PQC Standardization Process

Feb. 2016: NIST announcement of standardization plans at PQCrypto 2016, Fukuoka, Japan,

Dec. 2016: NIST Call for Proposals and Request for Nominations for Public-Key Post-Quantum Cryptographic Algorithms:

Nov. 30, 2017: Deadline for submitting candidates

Dec. 2017: Announcement of the First Round Candidates

23

Three Types of PQC Schemes

1. Public Key Encryption

2. Digital Signature

3. Key Encapsulation Mechanism (KEM)

8/28/19

7

Five Security Categories

Level Security DescriptionI At least as hard to break as AES-128 using exhaustive

key searchII At least as hard to break as SHA-256 using collision

searchIII At least as hard to break as AES-192 using exhaustive

key searchIV At least as hard to break as SHA-384 using collision

searchV At least as hard to break as AES-256 using exhaustive

key search

Leading PQC Families

26

Family Encryption/KEM

Signature

Hash-based XX

Code-based XX X

Lattice-based XX X

Multivariate X XX

Isogeny-based X

XX – high-confidence candidates, X – medium-confidence candidates

Underlying Mathematical Problem – PQC

Closest Vector Problem

Lattice in dimension n=2:Set of points given by

p = i⋅b1 + j ⋅b2

where i and j are arbitrary integers

Problem: Find the point of latticegiven by the base vectorsb1 and b2 closest to the arbitrary point of an n-dimensional space tImagine n in the range

of 825

Underlying Mathematical Problem – PQC

Solving a system of m quadratic equations with n unknowns

Imagine m and n in the range of 70 and above

8/28/19

8

Round 1 Candidates

Family Signature Encryption/KEM Overall

Lattice-based 5 22 27

Code-based 2 16 18

Multivariate 7 2 9

Hash-based 2 2

Isogeny-based 1 1

Other 3 4 7

Total 19 45 64

69 accepted as complete, 5 withdrawn26 Countries, 260 co-authors

29

Round 1 Submissions

69 Submissions, 26 Countries, 260 co-authors

NIST PQC Standardization Process

Jan. 30, 2019: Announcement of candidates qualified to Round 2

April 10, 2019: Publication of Round 2 submission packages

Aug. 22-24, 2019: Second NIST PQC Conference

2020: Beginning of Round 3 and/or selection of first future standards

2022-2024: Draft standards published

31

Round 2 Candidates

Family Signature Encryption/KEM Overall

Lattice-based 3 9 12

Code-based 7 7

Multivariate 4 4

Symmetric-based

2 2

Isogeny-based 1 1

Total 9 17 26

26 Candidates announced on January 30, 2019

32

8/28/19

9

Round 2 Submissions

Sources: Moody, PQCrypto May 2019

The 2nd Round Candidates• Encryption/KEMs (17)

▪ Digital Signatures (9)

• CRYSTALS-KYBER• FrodoKEM• LAC• NewHope• NTRU (merger of NTRUEncrypt/NTRU-HRSS-KEM)• NTRU Prime• Round5 (merger of Hila5/Round2)• SABER• Three Bears

• BIKE• Classic McEliece• HQC• LEDAcrypt (merger of LEDAkem/pkc)• NTS-KEM• ROLLO (merger of LAKE/LOCKER/Ouroboros-R)• RQC

• SIKE

• CRYSTALS-DILITHIUM• FALCON• qTESLA

• Picnic• SPHINCS+

• GeMSS• LUOV• MQDSS• Rainbow

NIST Report on the 1st Round: https://doi.org/10.6028/NIST.IR.8240

• Lattice-based• Code-based• Isogenies

• Lattice-based• Symmetric-based• Multivariate

Similarities with Previous Contests

Evaluation to be performed in rounds (12-18 months each)A pool of candidates narrowed down after each roundSmall tweaks allowed at the beginning of each roundOptimized software implementation developedfor various platformsNo immediate plans for obligatory hardware implementations

Differences from Previous Contests

Candidates not qualified to the next round (and not withdrawn by the authors) may be considered at a later dateTaking into account quantum attacks, possible only on the platforms that do not exist at the time of the standard developmentSecurity analysis much more challengingand often controversial

Adi Shamir’s Proposal

After the initial evaluation period (e.g., 3 years) the division of all schemes into the following categories:

2 Productions Schemes: recommended for actual wide-scale deployment, Highly Trusted4 Development Schemes: Time-Tested, Trusted;at least 15 years of analysis behind them;Intended for initial R&D by industry.8 Research Schemes: Promising Properties, Good Performance. May contain some high-risk candidates. Main Goal: Concentrate the effort of the research community.

8/28/19

10

PQCRYPTO Consortium

11 universities and companiesFunded by European Commission under the H2020 program

Initial Recommendations published in 2015Co-authors of 22 Submissions

The Most Trusted Schemes – Encryption/KEM

Classical McElieceProposed 40 years ago as an alternative to RSACode-based familyBased on binary Goppa codesNo patentsConservative parameters (Category 5, 256-bit security):a) length n=6960, dimension k= 5413, errors=119b) length n=8192, dimension k= 6528, errors=128

Complexity of the best attack identical after 40 years of analysis, and more than 30 papers devoted to thorough cryptanalysisSizes:Public key: a) 1,047,319 bytes, b) 1,357,824 bytesPrivate key: a) 13,908 bytes, b) 14,080 bytesCiphertext: a) 226 bytes, b) 240 bytes Efficient Software (Haswell, larger parameter set)

295,930 for encryption, 355,152 for decryptionConstant time

Efficient Hardware (Yale)

The Most Trusted Schemes – Signatures

Hash-based Schemes:

Representatives:

SPHINCS-256 => SPHINCS+

Security based on the security of a single underlying primitive

hash function

Features:

Efficient signature generation and verificationRelatively large keys (~ tens of kilobytes)

Likely Development Schemes – Lattice-based

• NTRUEncrypt – proposed in 1996, published in 1998

• Efficient encryption & decryption

• Relatively small key sizes (kilobytes)

• Suitable for constrained environments

• Standards: IEEE P1363.1-2008, ANSI X9.98-2010,

Consortium for Efficient Embedded Security EESS 2015

• No proof of security

8/28/19

11

Likely Development Schemes – Lattice-based

• New lattice-based schemes with extended security proof,

smaller key sizes, and better efficiency

CRYSTALS-KYBER

Ring-LWE (Learning with Errors)

Pilot Hardware Implementations:

• Ruhr University of Bochum, Germany• Technical University Munich, Germany• ESAT/COSIC KU Leuven, Belgium• The University of Texas at Austin, USA• George Mason University, USA, etc.

Binary RLWE

NewHope CRYSTALS-DILITHIUM

PQC Major Challenges

42

Fairness Number of Candidates

Challenges of Post-Quantum CryptographyMathematical complexity

Large amount of man-power

Large keys and internal statesHardware resources required

New types of basic operationsNeed for random sampling not only from uniform but also from discrete Gaussian and/or other distributionsConstant-time implementationsNeed for new SCA (Side-Channel Attack) countermeasures against power and electromagnetic analysisPlug-and-play replacement for current public-key cryptography units

Intermediate use of hybrid systems43

Hardware Benchmarking

44

8/28/19

12

Major Optimization Targets

High-SpeedLightweight

• Parallel processing• Constant-time• Parametric code

• Small area, power,energy per bit

• Resistance to power& electromagnetic analysis

Round 2 Candidates in Hardware#Round 2candidates

5

14

29

26

AES

SHA-3

CAESAR

PQC

Implementedin hardware

5

14

28

?

Percentage

100%

100%

97%

?

46

Software/HardwareCodesign

47

Most time-critical operation

Software

Hardware

Software/Hardware Codesign

48

8/28/19

13

SW/HW Codesign: Motivational Example 1

49

Software Software/Hardware

91% major operation(s) 9% other operations

~1% major operation(s) in HW9% other operations in SW

speed-up ≥ 100

Total Speed-Up ≥ 10

Other9%

Major91%

Major~1%

Other9%

Time saved90%

SW/HW Codesign: Motivational Example 2

50

Software Software/Hardware

99% major operation(s) 1% other operations

~1% major operation(s) in HW1% other operations in SW

speed-up ≥ 100

Total Speed-Up ≥ 50

Other1%

Major99%

Major~1%

Other1%

Time saved98%

SW/HW Codesign: Advantages

51

v Focus on a few major operations, known to be easily parallelizable

§ much shorter development time (at least by a factor of 10)§ guaranteed substantial speed-up§ high-flexibility to changes in other operations (such as candidate

tweaks)v Insight regarding performance of future instruction set extensions of

modern microprocessors

v Possibility of implementing multiple candidates by the same research group, eliminating the influence of different

§ design skills§ operation subset (e.g., including or excluding key generation)§ interface & protocol§ optimization target

§ platform

SW/HW Codesign: Potential Pitfalls

52

v Performance & ranking may strongly depend on

A. features of a particular platform

o Software/hardware interface

o Support for cache coherency

o Differences in max. clock frequency

B. selected hardware/software partitioning

C. optimization of an underlying software implementation

v Limited insight on ranking of purely hardware implementations

First step, not the ultimate solution!

8/28/19

14

Two Major Types of Platforms

53

FPGA Fabric& Hard-core Processors

FPGA Fabric, including Soft-core Processors

Examples:• Xilinx Zynq 7000 System on Chip (SoC)• Xilinx Zynq UltraScale+ MPSoC• Intel Arria 10 SoC FPGAs• Intel Stratix 10 SoC FPGAs

Examples:Xilinx Virtex UltraScale+ FPGAsIntel Stratix 10 FPGAs, including • Xilinx MicroBlaze• Intel Nios II• RISC-V, originally UC Berkeley

Processorw/ Memory

& I/O

FPGA Fabric

FPGA Fabric

Soft-coreProcessor

Two Major Types of Platform

54

Feature FPGA Fabric and Hard-core Processor

FPGA Fabric with Soft-core Processor

Processor ARM MicroBlaze, NIOS II, RISC-V, etc.

Clock frequency >1 GHz max. 200-450 MHzPortability similar FPGA SoCs various FPGAs, FPGA SoCs,

and ASICs

Hardware accelerators Yes YesInstruction set extensions No Yes

Ease of design (methodology, tools, OS support)

Easy Dependent on a particular soft-core processor and

tool chain

Xilinx Zynq UltraScale+ MPSoC1.2 GHz ARM Cortex-A53 + UltraScale+ FPGA logic

Experimental Setup

55

Output FIFOInput FIFO Hardware Accelerator

Zynq Processing System

AXI DMA

FIFO Interface

FIFO Interface

AXI StreamInterface

AXI StreamInterface

AXI L

ite

Inte

rfac

e

AXI F

ull

Inte

rface

AXI L

ite

Inte

rface

IRQ

Clocking wizard

rd_clkwr_clk wr_clk rd_clkclk

UUT_clk

Main Clock

AXI L

ite

Inte

rface

AXI TimerAXI Lite Interface

All elements located on a single chip

Code Release

56

• Full Code & Configuration of the Experimental Setup

• Software/Hardware Codesign of Round 1 NTRUEncrypt

to be made available at

https://cryptography.gmu.edu/athena

under PQC

by August 31, 2019

8/28/19

15

GMU CERGCase Study

57

SW/HW Codesign: Case Study

58

7 IND-CCA*-secure Lattice-Based Key Encapsulation Mechanisms (KEMs)representing

5 NIST PQC Round 2 Submissions

LWE (Learning with Error)-based:FrodoKEM

RLWR (Ring Learning with Rounding)-based:Round5

Module-LWR-based:Saber

* IND-CCA = with Indistinguishability under Chosen Ciphertext Attack

NTRU-based:NTRU

• NTRU-HPS• NTRU-HRSS

NTRU Prime• Streamlined NTRU Prime• NTRU LPRime

SW/HW Partitioning

Top candidates for offloading to hardware

From profiling:

v Large percentage of the execution time

v Small number of function calls

From manual analysis of the code:

v Small size of inputs and outputs

v Potential for combining with neighboring functions

From knowledge of operations and concurrent computing:

v High potential for parallelization59

Example: LightSaber Decapsulation

MatrixVectorMul43.44%

InnerProduct43.52%

GenMatrix5.03%

GenSecret2.30%Hash

3.30%

Other2.40%

60

8/28/19

16

LightSaber Decapsulation

MatrixVectorMul43.44%

InnerProduct43.52%

GenMatrix5.03%

GenSecret2.30%

Hash3.30%

Other2.40%

Execution time saved88.83%

Other2.40%

HardwareAccelerator

8.77%

Total Speed-Up = 100/11.17=9.0

Execution time remaining

11.17%

Accelerator Speed-Up = 97.60/8.77=11.1

Execution time of functions to be moved to hardware

97.60%Execution time of functions

remaining in software2.40% 61

TentativeResults

62

Total Execution Time in Software [𝜇s]Encapsulation

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

Roun d5 Saber Str NTRUPrime

NTRULP Rime

NTRU-HRSS NTRU-HPS FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

16,19262,07634,609

63

Total Execution Time in Software/Hardware [𝜇s]Encapsulation

0

100

200

300

400

500

600

Round 5 Saber NTRU-HRSS Str NTRUPrime

NTRU-HPS NTRULP Rime

FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

1,2232,1861,642

1

3⇒4

7

5⇒3

4⇒6

2

6⇒5

64

8/28/19

17

Total Speed-ups: Encapsulation

17.8

13.2

8.3 7.9 7.1

3.2 2.4

21.1

10.2 9.3

12.7

3.4 2.53.6 2.6

28.4

10.6

18.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

NTRU-HRSS FrodoKEM Round5 NTRU-HPS Sa ber NTRU L PRime Str N TRUPrime

Level 1 Level 2 Level 3 Level 4 Level 5

65

Accelerator Speed-ups: Encapsulation

146.4 140.4

43.5

8.7 8.520.4

12.3

192.2

44.3

27.515.3

10.6 14.5

32.517.7

46.1

11.1 21.3

0.020.040.060.080.0

100.0120.0140.0160.0180.0200.0

NTRU-HRSS NTRU-HPS FrodoKEM NTRULP Rime

Str NTRUPrime

Round 5 Saber

Level 1 Level 2 Level 3 Level 4 Level 5

66

SW Part Sped up by HW[%]: Encapsulation

99.36 98.49

95.03 94.62

88.03

70.35

62.53

99.5998.93 97.45

89.71

71.44

63.43

73.26

65.00

99.55 99.14 98.62

50.00

60.00

70.00

80.00

90.00

100.00

Roun d5 Saber NTRU-HRSS FrodoKEM NTRU-HPS NTRULP Rime

Str NTRUPrime

Level 1 Level 2 Level 3 Level 4 Level 5

67

Total Execution Time in Software [𝜇s]Decapsulation

16,19262,37734,649

01,0002,0003,0004,0005,0006,0007,0008,0009,000

10, 00011, 00012, 000

Round 5 Saber NTRULP Rime

Str NTRUPrime

NTRU-HPS NTRU-HRSS FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

Order reversedcompared to encapsulation

3 4

5 6

68

8/28/19

18

0

100

200

300

400

500

600

700

Roun d5 Saber NTRU-HPS Str NTRUPrime

NTRU-HRSS NTRULP Rime

FrodoKEM

Level 1 Level 2 Level 3 Level 4 Level 5

Total Execution Time in Software/Hardware [𝜇s]:Decapsulation

1,3193,1201,866

1

7

5⇒3

3⇒6

26⇒54

69

Total Speed-ups: Decapsulation

77.4 74.1

12.3 9.0 8.1

38.3

3.9

119.3

45.5

18.613.4 9.4 4.1

54.8

4.520.0 17.9

9.8

0.010.020.030.040.050.060.070.080.090.0

100.0110.0120.0130.0

NTRU-HPS NTRU-HRSS Str NTRUPrime

FrodoKEM Saber Round 5 NTRULP Rime

Level 1 Level 2 Level 3 Level 4 Level 5

70

Accelerator Speed-ups: Decapsulation

188.1 182.8

44.4

11.1 8.1

86.2

27.5

235.7

112.0

46.032.8

17.7 9.4

132.7

37.746.0

24.69.8

0.025.050.075.0

100.0125.0150.0175.0200.0225.0250.0

NTRU-HRSS NTRU-HPS Str NTRUPrime

FrodoKEM NTRULP Rime

Saber

Level 1 Level 2 Level 3 Level 4 Level 5

Round5

71

SW Part Sped up by HW[%]: Decapsulation100.00 99.25 99.18 97.60

93.9698.53

76.47

100.00 99.5898.69 98.10

96.78

77.46

98.92

79.22

100.0098.41 97.11

50.00

60.00

70.00

80.00

90.00

100.00

Round 5 NTRU-HPS NTRU-HRSS Str NTRUPrime

Saber FrodoKEM NTRULP Rime

Level 1 Level 2 Level 3 Level 4 Level 5

72

8/28/19

19

Conclusionsv Total speed-ups

§ for encapsulation from 2.4 (Str NTRU Prime) to 28.4 (FrodoKEM)§ for decapsulation from 3.9 (NTRU LPRime) to 119.3 (NTRU-HPS)

v Total speed-up dependent on the percentage of the software execution time taken by functions offloaded to hardware and the amount of acceleration itself

v Hardware accelerators thoroughly optimized using Register-Transfer Level design methodology

v Determining optimal software/hardware partitioning requires more workv Ranking of the investigated candidates affected, but not dramatically

changed, by hardware accelerationv It is possible to complete similar designs for all Round 2 candidates within

the evaluation period (12-18 months) v Additional benefit: Comprehensive library of major operations in hardware

73

Future Work

74

Breadth

Depth

4 RemainingLattice-based

KEMs

3 Lattice-based Digital

Signatures

Current work

7 Code-basedKEMs

7 OtherCandidates

More operations moved to hardware / C code optimized for ARM Cortex-A53*Algorithmic optimizations of software and hardware*

Hardware library of basic operations

of lattice-based candidates

Hardware library of basic operations

of code-based candidates

Hardware library

for PQC

Full hardware implementations

*collaboration with submission teams and other groups very welcome

High-LevelSynthesis

High-Level Synthesis (HLS)

76

High Level Language(C, C++, Java, Python, etc.)

Hardware Description Language(VHDL or Verilog)

High-Level Synthesis(HLS)

8/28/19

20

Popular HLS Tools

77

Commercial (FPGA-oriented):

Academic: • Bambu: Politecnico di Milano, Italy• DWARV: Delft University of Technology, The Netherlands• GAUT: Universite de Bretagne-Sud, France• LegUp: University of Toronto, Canada

• Vivado HLS: Xilinx – selected for this study• FPGA SDK for OpenCL: Intel

Software/Hardware Codesign with HLS

Most time-critical operation

Software

HLS-Generated Hardware

78

Transformation to HLS-ready C/C++ Code

1. Interface mapping

2. Addition of HLS Tool directives (pragmas)

3. Hardware-driven code refactoring

79

Using Existing Code

Function: Polynomial Multiplication in NTRUEncrypt

Source of C code: OnBoard Security

Goal: The same or comparable number of clock cycles as in the Register-Transfer Level (manual) implementation in VHDL.

Attempt 1: Reference implementation based on the grade school algorithm for multiplication (a.k.a. schoolbook, paper-and-pencil, etc.).

Attempt 2: Optimized implementation based on rotation.Multiple attempts at optimization using Vivado HLS directives (pragmas) and minor code changes.

80

8/28/19

21

Outcome & Revised Approach

Outcome 1: Tens of thousands of clock cycles, compared to the expected n=743 clock cycles

Solution: Rewriting the code in C in such a way to match the block diagram used to generate VHDL code

Outcome 2:Expected functionalityAround n clock cycles of the execution time

Similar approach applied and successfully repeated for 4 NTRU-based Round 1 KEMs!

81

Sources of Productivity Gains

• Higher-level of abstraction• Focus on datapath rather than control logic• Debugging in software (C/C++)

• Faster run time• No timing waveforms

82

Achievements: Number of Clock CyclesAlgorithm RTL HLS HLS/RTL

EncapsulationNTRUEncrypt 744 750 1.008NTRU-HRSS 702 708 1.009Str NTRU Prime 762 769 1.009NTRU LPRime 1524 1538 1.009

DecapsulationNTRUEncrypt 1491 1502 1.007NTRU-HRSS 2111 2162 1.024Str NTRU Prime 2291 2346 1.024NTRU LPRime 2287 2307 1.009

83

#Clock Cycles increased by less than 2.5%

Overhead: Clock Frequency & Resource Utilization

84

Algorithm RTL HLS HLS/RTL

Clock Frequency [MHz]NTRUEncrypt 330 270 0.82NTRU-HRSS 300 261 0.87Str NTRU Prime 255 191 0.75NTRU LPRime 255 191 0.75

SlicesNTRUEncrypt 4,431 7,361 1.66NTRU-HRSS 5,476 7,358 1.34Str NTRU Prime 9,699 18,428 1.90NTRU LPRime 8,483 15,585 1.84

Clock Frequency reduced by 25% or less#Slices increased by 90% or less

8/28/19

22

Overhead: Resource Utilization

85

Algorithm RTL HLS HLS/RTLLUTs

NTRUEncrypt 27,912 50,061 1.79NTRU-HRSS 33,230 48,452 1.46Str NTRU Prime 65,207 117,424 1.80NTRU LPRime 52,297 94,062 1.80

Flip-flopsNTRUEncrypt 24,697 24,728 1.00NTRU-HRSS 30,327 33,235 1.10Str NTRU Prime 32,929 51,253 1.56NTRU LPRime 39,730 40,089 1.00

#LUTs increased by 80% or less#Flip-flops increased by 56% or less

Possible Future Work

HLS-Generated HardwareMinimal Hand Optimization

HLS-Generated HardwareThorough Hand Optimization

86

Most time-critical operation

PQC Opportunities & Challenges

The biggest revolution in cryptography, since the invention of

public-key cryptography in 1970s

Efficient hardware implementations in FPGAs and ASICs desperately needed to prove the candidates suitability for

high-performance applications and constrained

environments. Collaboration sought by submission teams!

Likely extensions to Instruction Set Architectures

of multiple major microprocessors

Start-up & new-product opportunities

Once in the lifetime opportunity! Get involved!87

Q&A

88

Suggestions?CERG: http://cryptography.gmu.edu

ATHENa: http://cryptography.gmu.edu/athena

Questions? Comments?

Thank You!