Memory Efficient Signature Matching in Deep Packet

Memory Efficient Signature Matching in Deep Packet Inspection Applications at Line Rates Memory Efficient Signature Matching in Deep Packet Inspection Applications at Line Rates
Shiva Shankar Subramanian
Doktor-Ingenieurs (Dr.-Ing.)
genehmigten Dissertation.
Vorsitzender: Prof. Dr.-Ing. Georg Sigl
Prufende der Dissertation: 1. Prof. Dr. sc.techn. Andreas Herkersdorf 2. Prof. Dr.-Ing. Ulf Schlichtmann
Die Dissertation wurde am 19.12.2018 bei der Technischen Universitat Munchen eingereicht und durch die Fakultat fur Elektrotechnik und Informationstechnik am 19.08.2019 angenommen.
Abstract
Deep Packet Inspection (DPI) is the process of inspecting all the headers and the payload in a network packet. Various content-aware networking applications such as intrusion detection/prevention, load balancing, copyright enforcement and content aware Quality- of-Service (QoS) drive the need for performing DPI in the next-generation networks. Comparing the payload bytes against the signatures, which are defined through string and regular expression patterns is the most time-critical function in DPI. So, it is essential to accelerate the signature matching function to perform DPI at multi gigabit rates (∼10 Gbps) in modern embedded network processors.
The signatures or patterns which are used for payload inspection are either represented through the Non-Deterministic Finite Automaton (NFA) or the Deterministic Finite Au- tomaton (DFA). The DFA is generally preferred to represent the signatures, especially for high speed signature matching applications as it is highly processing efficient in comparison to the NFA. On the other hand, the DFA is highly storage inefficient due to the presence of redundant state transitions and the exponential state blow-up corresponding to certain signature sets. So, the DFA is generally compressed before being stored in the memory, while the payload bytes are actually compared against the compressed DFA.
The transition compression algorithms play a significant role in defining the memory footprint of the compressed DFA, as well as in mandating the rate at which the signature matching function can be performed. The state-of-the-art transition compression algorithms either focus on effectively compressing the DFA [1, 2, 3] or focus on compressing the DFA in such a way that the signature matching can be performed at multi gigabit rates [4, 5], but not both. Addressing this shortcoming, two transition compression methods, the MSBT and the LSCT are proposed and evaluated in this dissertation which can effectively compress the DFA as well as perform the signature matching function at about ∼ 10 Gbps. The proposed methods efficiently compress the redundant transitions through a combination of bitmaps and bitmasks. Additionally, the linearity in the arrangement of state transitions in the DFA is further leveraged to compress the redundant bitmasks in the compressed DFA. The MSBT and the LSCT are capable of achieving transition compression rates of the order of over 98% primarily due to the introduction of bitmasks, which is 4-5% higher than the state-of-the-art bitmap based compression algorithms. The improvement in the transition compression rates and the bitmask compression reduces the memory footprint of the compressed DFA by ∼70% in comparison to the state-of-the-art transition compression algorithms.
A bitmap based signature matching engine called BiSME is proposed and implemented which performs the decompression corresponding to the MSBT. BiSME encompasses two efficient storage architectures, the Packed Storage Methodology and the Shared Memory Methodology to effectively store the compressed DFA in the on-chip memories in a con-
iii
Abstract
figurable manner without resulting in memory wastage. The Bitmask Optimized BiSME (BOBiSME) extends the BiSME architecture to support bitmask compression further reducing the area requirements of BiSME. The BiSME and the BOBiSME were effectively pipelined to perform the signature matching function at 9.3 Gbps and 10.6 Gbps, respectively. The signature matching engines were designed and synthesized on the TSMC 28nm technology node and are capable of storing a maximum of 1000 string signatures. The BiSME and BoBiSME require 1.43 mm2 and 1.2 mm2 of silicon area and only consume 155 mW and 170 mW of power, respectively making them area-efficient and power-efficient hardware coprocessor architectures. The BiSME and the BOBiSME architectures were prototyped on the cadence palladium platform and network traffic traces of over 2 GBytes consisting of different traffic characteristics were injected to validate the hardware implementation.
iv
Acknowledgments
This dissertation would not have been possible without the support and encouragement from a number of people to whom I would like to express my gratitude from the bottom of my heart.
First and foremost, I would like to thank Prof. Dr. sc. techn. Andreas Herkersdorf for agreeing to be my doctoral advisor. The profound technical discussions which I had with him greatly helped me to structurally formulate my research problems and propose innovative solutions during my research work. His extensive experience in working with industry based research programs enabled me to not only make quality contributions to the academic community, but also to conceive effective solutions which could be easily adopted by the industry. Additionally, I would like to thank him for accommodating me as a visiting researcher in his institute, which also allowed me to experience the research environment in TUM.
I would like to thank Dr. Pinxing Lin from Intel Technology Asia Pte. Ltd. for accepting to be my industry supervisor for the research program. He has always been the go-to person during all the challenging times and I would like to thank him for all the great discussions which he has had with me over these times. His calm demeanor always put me at ease and is a quality which I have always admired in him.
I would also like to thank Dr.-Ing. Thomas Wild for his technical insights during various stages of my research work. I really enjoyed the discussions which I had with him and I really appreciate his prompt and timely feedback over the course of the research project.
I would like to express my gratitude to Mr. Mario Traeber for believing in me and giving me an opportunity to be the first candidate for the pioneering Industry PhD Program (IPP). I would like to thank Ms. Ko Kah Goh and Mario for the enormous effort which they put in to define the organization of this program. I would also like to thank Mr. Bing Tao Xu, for supporting my career aspirations and allowing me to proceed with the IPP. My sincere thanks to Mr. Daniel Artusi for his continuous moral support throughout the course of the IPP. Finally, I would like to thank Intel Technology Asia Pte. Ltd., Singapore for funding this research program and providing me an opportunity to work on such a great research topic.
I would like to acknowledge the technical contribution of certain members within Intel without whose help this work would not have been possible. My thanks to Roshini John who setup the software simulations on automata models which helped me to get a deeper understanding on automata based signature matching. My thanks to Hariharan for helping me to prepare the synthesis setup for the hardware implementation of the signature matching engine which was developed in this research work. My thanks to Anmol Prakash Surhonne, Nihar Sriram and Sagar Paramesh for their support in the
v
Acknowledgments
preparation of the evaluation setup to verify the hardware implementation. Additionally, I would like to thank all my colleagues in Intel Technology Asia Pte. Ltd who have helped me over the course of this program. My special thanks to Abishek Godaa for all the thought provoking discussions which he has had with me. Many thanks to Jyothi, Kavitha, Dharmesh, Seshagiri, Giridhar, Jincy, Rekha, Suresh Nagaraj, Suresh Kumar for being great friends and colleagues at work. Many thanks to all my colleagues in Institute for Integrated Systems in TUM who made my stay memorable in Munich. Especially, my special thanks to Michael Vonbun for all his help during my stay in Munich and during the course of the preparation of this dissertation. I would also like to thank Mrs. Doris Zeller, Mrs. Draga Verena and Mrs. Gabriele Spohrle for all their secretarial support during the course of the program.
Last but not least, I would like to thank my family for their unconditional love and affection which gave me the strength to take up such a monumental journey. I would also like to thank my wife Varsha Sharma for her love. Her support and encouragement was in the end what made this dissertation possible. Knowing that these words are wholly inadequate, I express my deepest thanks to them for bearing with me, for being so constant and supportive, and for giving me their faith and trust. It is to them that I dedicate this dissertation.
vi
Contents
List of Figures xi
List of Tables xv
1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Introduction to Deep Packet Inspection . . . . . . . . . . . . . . . . . . . 3
1.2.1 Packet Inspection Methodologies - Overview . . . . . . . . . . . . 3 1.2.2 Steps in Deep Packet Inspection . . . . . . . . . . . . . . . . . . . 4 1.2.3 Signature Matching - Requirements & Challenges . . . . . . . . . . 6
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 State-of-the-Art 13 2.1 String Based Signature Matching Engines . . . . . . . . . . . . . . . . . . 13 2.2 Automata Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Introduction to NFA & DFA . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Automata Based Signature Matching - Complexity Analysis . . . . 18
2.3 DFA Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 State Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1.1 State Explosion Problem . . . . . . . . . . . . . . . . . . 19 2.3.1.2 State Compression Solutions . . . . . . . . . . . . . . . . 21
2.3.2 Transition Compression . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2.1 Software Oriented Algorithms . . . . . . . . . . . . . . . 23 2.3.2.2 Hardware Oriented Algorithms . . . . . . . . . . . . . . . 26 2.3.2.3 Alphabet Compression . . . . . . . . . . . . . . . . . . . 32
2.3.3 Summary of DFA Compression . . . . . . . . . . . . . . . . . . . . 34 2.4 Line Rate Signature Matching Engine Implementations . . . . . . . . . . 34
2.4.1 FPGA - Logic Based Implementations . . . . . . . . . . . . . . . . 35 2.4.2 GPU Based Signature Matching Engines . . . . . . . . . . . . . . . 36 2.4.3 Hardware Accelerators - ASIC . . . . . . . . . . . . . . . . . . . . 36
2.4.3.1 Automata Storage - TCAM . . . . . . . . . . . . . . . . . 37
vii
Contents
2.5 State-of-the-Art Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Bitmask: A Secondary Indexing Layer for Bitmap based Transition Com- pression 43 3.1 Member State Bitmask Technique . . . . . . . . . . . . . . . . . . . . . . 44
3.1.1 MSBT - An Example . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2 Compressed DFA Organization . . . . . . . . . . . . . . . . . . . . 47
3.1.3 Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.3.2 Hardware Decompression Engine for MSBT - Logical Block Level Description . . . . . . . . . . . . . . . . . . . 50
3.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Leader Transition Bitmask . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Compressed DFA Organization - LSCT . . . . . . . . . . . . . . . 54
3.2.3 Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.3.2 Hardware Decompression Engine for LSCT - Logical Block Level Description . . . . . . . . . . . . . . . . . . . . . . 57
3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.3 Estimated Memory Usage . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.4 Functional Evaluation - Software Model . . . . . . . . . . . . . . . 64
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Memory Footprint Optimizations - MSBT & LSCT 69 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Improved Transition Compression through State Grouping . . . . . . . . . 70
4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2.1 The Divide Step . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.2.2 State Reorganization . . . . . . . . . . . . . . . . . . . . 74
4.2.2.4 Optimal Leader State Identification . . . . . . . . . . . . 78
4.2.3 Complexity Analysis Comparison . . . . . . . . . . . . . . . . . . . 79
4.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.5 Discussion & Summary . . . . . . . . . . . . . . . . . . . . . . . . 84
viii
Contents
4.3 Alphabet Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.1 Combining Alphabet Compression with MSBT & LSCT . . . . . . 85 4.3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.2.1 Transition Compression Rate . . . . . . . . . . . . . . . . 89 4.3.2.2 Estimated Memory Usage . . . . . . . . . . . . . . . . . . 90
4.3.3 Discussion & Summary . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 Bitmask Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.2 Bitmask Optimized Member State Bitmask Technique . . . . . . . 94
4.4.2.1 Bitmask Compression . . . . . . . . . . . . . . . . . . . . 94 4.4.2.2 Memory Organization . . . . . . . . . . . . . . . . . . . . 97 4.4.2.3 Transition Decompression . . . . . . . . . . . . . . . . . . 97 4.4.2.4 Hardware Engine for Decompression - Logical Block Level
Description . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.3.1 Estimated Memory Usage . . . . . . . . . . . . . . . . . . 102 4.4.3.2 Functional Evaluation - Software Model . . . . . . . . . . 104
4.4.4 Discussion & Summary . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5 Hardware Coprocessors for Signature Matching 109 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Bitmask Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.2 Packed Storage Methodology . . . . . . . . . . . . . . . . . . . . . 112
5.2.2.1 Split Memory Implementation . . . . . . . . . . . . . . . 113 5.2.2.2 Bitmask Extraction . . . . . . . . . . . . . . . . . . . . . 115
5.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3 Compressed Transition Storage . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.2 Shared Memory Methodology . . . . . . . . . . . . . . . . . . . . . 120
5.3.2.1 Transition Memory Access . . . . . . . . . . . . . . . . . 121 5.3.3 Discussion & Summary . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4 BiSME - Internal Architecture . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4.1 Memory Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4.2 Address Decoder & Memory Access Multiplexer . . . . . . . . . . 128 5.4.3 Decompression Engine . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4.4 BOBiSME - Modified BiSME to support Bitmask Compression . . 132 5.4.5 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5 Deep Packet Inspection Accelerator . . . . . . . . . . . . . . . . . . . . . . 135 5.5.1 DPIA Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.5.2 DPIA Internal Architecture . . . . . . . . . . . . . . . . . . . . . . 137 5.5.3 Network Data Management Engine . . . . . . . . . . . . . . . . . . 138
5.5.3.1 Configurable Stream Mapping . . . . . . . . . . . . . . . 138
ix
Contents
5.6 Experimental Evaluation & Discussion . . . . . . . . . . . . . . . . . . . . 141 5.6.1 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.6.2 Signature Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.6.2.1 BiSME . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.6.2.2 BOBiSME . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.6.3 Hardware Implementation Validation . . . . . . . . . . . . . . . . . 145 5.6.3.1 Traffic Generation . . . . . . . . . . . . . . . . . . . . . . 146 5.6.3.2 Internal Architecture of the Evaluation Setup . . . . . . . 149 5.6.3.3 Results & Discussion . . . . . . . . . . . . . . . . . . . . 152
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
x
List of Figures
1.1 A smart home which has many IoT devices connected to it. (Image downloaded from [6]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 An overview of various packet inspection methodologies . . . . . . . . . . 4
1.3 Overview of the various steps to be performed in deep packet inspection . 5
2.1 Classification of various signature matching engines proposed in the literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 A signature set with 2 signatures abc, def.*12 represented as (a) NFA and (c) DFA. The states traversed during (b) NFA and (d) DFA based signature matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 (a) state explosion scenario 1 (b) state explosion scenario 2 (c) state explosion scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 (a) A DFA with state transitions for 5 states and 4 characters (b) The compressed D2FA representation of the DFA (c) δFA representation of the DFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 A signature a[b-d]ef represented as a DFA and through RCDFA . . . . . . 25
2.6 (a) A sequence of 8 Transitions within a state (b) The bitmap corresponding to the state transitions to identify if a state transition is compressed (c) Compressed state transition representation after bitmap based compression stored in a unique transition list (d) Transition decompression - Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Example to explain the FEACAN transition compression (a) An uncompressed DFA with 8 states and state transitions for 8 characters for each state (b) The compressed DFA after bitmap based intra-state transition compression (c) The DFA states grouped into subsets of states after intra- state compression (d) The compressed DFA after inter-state transition compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 (a) An uncompressed DFA with 8 states and state transitions for 8 characters for each state (b) The compressed DFA after RCDFA (c) The bitmaps across the vertical state axis (d) Unique bitmap after being combined . . 30
2.9 (a) Bitmaps corresponding to character ‘d’ and ‘f’ combined (b) Addi- tional transitions stored corresponding to the unique transition list for character ‘f’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Transition Compression through Alphabet Compression in a DFA . . . . . 33
xi
List of Figures
3.1 (a) Original DFA before compression (b) The DFA after bitmap based intra-state transition compression (c) The DFA states grouped into subset of states (d) Encoded state representation of the DFA states (e) The DFA after inter-state transition compression (f) The MTB and the cumulative sum of transitions after inter-state compression (g) Compressed DFA generated after FEACAN . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 The compressed transitions stored in the Leader and Member Transitions Tables while the control information is stored in Address Mapping Table and Member Bitmask Table . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Representation of the compressed DFA with respect to MSBT decompression system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Functional description of the hardware based decompression architecture for MSBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 (a) Compressed DFA after the MSBT (b) Compressed DFA after LSCT (c) The LTB & the MTB for each state . . . . . . . . . . . . . . . . . . . 52
3.6 The compressed DFA organized into the AMT, BT and the TT after LSCT transition compression . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 Representation of the compressed DFA with respect to the LSCT decompression system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Logical block level description of the hardware based decompression engine for the LSCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.9 Comparison of transition compression rate across various techniques . . . 60
3.10 Comparison of transition and control memory usage across various techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.11 Overview of the software based simulation environment to verify the transition compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1 (a) An uncompressed DFA with 8 states and 8 charcaters (b) DFA after Intra-state transition compression step in MSBT (c) The DFA split into 3 groups after the state clustering step (d) DFA after Inter-state transition compression step in MSBT (e) State 7 merged with G1 as it shares the same bitmap defying the constraint with transition threshold (f) More transitions compressed when Inter-state transition compression is performed after merging G1 and G2 . . . . . . . . . . . . . . . . . . . . 71
4.2 Example of DFA states clustered into a set of 5 groups using the divide step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Example of state reorganization step performed on the states after the divide step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 State 9 merged with G2 as part of conquer step . . . . . . . . . . . . . . . 77
4.5 States 5 and 8 swapped in G3 as part of optimal leader state identification 79
4.6 Percentage reduction in the Leader and Member transitions between FEA- CAN and DC grouping in MSBT, T=80%, B=256 . . . . . . . . . . . . . 82
4.7 Estimated on-chip SRAM memory usage to store the compressed DFA . . 84
xii
List of Figures
4.8 (a-c) MSBT transition compression in a DFA without Alphabet Com- pression (d-f) MSBT transition compression in a DFA after Alphabet Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.9 Example showing the LSCT implemented after Alphabet Compression . . 87
4.10 Estimated on-chip SRAM memory usage to store the compressed DFA . . 91
4.11 Example of redundant bitmasks generated during MSBT transition compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.12 Compressing the MTBs to reduce the storage requirements of bitmasks . 95
4.13 Organization of the compressed DFA between the control and the data memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.14 Compressed Transition fetch from the MTT when the unique bitmask bit for the member state is 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.15 Compressed Transition fetch from the MTT when the unique bitmask bit for the member state is 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.16 Functional Description of the Hardware Accelerator to perform the Tran- sition Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.17 The total number of MTBs generated before and after Bitmask Compression103
4.18 Comparison of estimated on-chip SRAM memory usage to store the compressed DFA before and after bitmask compression . . . . . . . . . . . . . 103
4.19 Comparison of the control memory usage between MSBT and BOMSBT . 104
5.1 Overview of the Deep Packet Inspection Accelerator . . . . . . . . . . . . 109
5.2 MTB width variation across signature sets and within a signature set . . 111
5.3 Example of how the bitmasks are stored using the “Packed Storage Method- ology” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 Packed Storage Methodology - split memory storage implementation . . . 114
5.5 Packed Storage Methodology - split memory implementation data reorganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.6 Extracting the bitmask from the data fetched from the memory . . . . . . 116
5.7 Hardware logic used to extract the bitmask from the memory . . . . . . . 118
5.8 Shared memory architecture with state transitions flexibly stored in the LTT, the MTT and the shared memories . . . . . . . . . . . . . . . . . . . 120
5.9 Overview of memory block allocation during compile time . . . . . . . . . 122
5.10 Overview of the transition memory access using the shared memory methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.11 (a) Internal Architecture of the Signature Matching Engine (b) SRAM interface description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.12 (a) DFA state encoding (b) Organization of the compressed transitions in the memory (c) Table to store the MTB and the cumulative sum of transitions (d) Table to store the base addresses and other control information 127
5.13 Details of the byte stream interface and signature match output interface 129
xiii
List of Figures
5.14 (a) DFA state encoding (b) Organization of the compressed transitions in the memory (c) Member Bitmask Table to store the MTB and the cumulative sum of transitions (d) Table to store the base addresses and other control information . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.15 (a) ALS & LTBFS split into multiple pipeline stages in BiSME (b) ALS and LTBFS split into multiple pipeline stages in BOBiSME (c) Pipelined operation of BiSME with context based byte interleaving . . . . . . . . . 136
5.16 Block level description of the Deep Packet Inspection Accelerator . . . . . 137 5.17 Functional description of the programmable rule based stream to context
mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.18 The accepting states used as the hash index to identify the postprocessing
function associated with a signature match . . . . . . . . . . . . . . . . . 140 5.19 (a) DPIA with multiple instances of signature matching engine (b) Sig-
nature count scalability (c) Signature matching throughput scalability . . 140 5.20 A pictorial representation of the evaluation setup used to verify the SME
hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.21 Functional overview of Traffic generation . . . . . . . . . . . . . . . . . . . 148 5.22 Detailed block level internal architecture of the evaluation setup . . . . . 150 5.23 Extraction of the payload bytes and injection into the signature matching
engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.24 Signature Matching results on the exact match signature set in BiSME
for 10 passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.25 Signature matching results after 2.3 GB of controlled traffic inspection on
the exact match signature set . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.26 Signature matching results after 2.3 GB of controlled traffic inspection on
the bro217 signature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.27 Signature matching results after 1 GB of random traffic inspection on the
bro217 signature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.28 Signature matching results after 2.3 GB of controlled traffic inspection on
the exact match signature set . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.29 Signature matching results after 2.3 GB of controlled traffic inspection on
the bro217 signature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.30 Signature matching results after 1 GB of random traffic inspection on the
bro217 signature set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.1 Example DFA to explain the trace generation mechanism . . . . . . . . . 175
B.1 Example of the Accumulative Parallel Adder circuitry to perform the population count function . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
xiv
List of Tables
2.1 Processing and storage complexity associated with NFA & DFA . . . . . . 18 2.2 Comparison of various platforms with respect to signature matching en-
gine implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Signature sets used for evaluation . . . . . . . . . . . . . . . . . . . . . . . 58 3.2 Signature characteristics after bitmap compression and state grouping . . 59 3.3 Comparison of the average number of transitions in the member state
after compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Average number of transitions fetched before fetching the compressed
state transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Parameters used for memory usage estimation . . . . . . . . . . . . . . . . 62 3.6 Comparison of signature matching results across the compression methods
across different PM values . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Algorithmic complexity of various algorithms proposed in DC . . . . . . . 80 4.2 Comparison of the Number of Groups and the Average Members per
Group between FEACAN & DC, B=256 . . . . . . . . . . . . . . . . . . . 81 4.3 Comparison of transition compression rate achieved in MSBT with FEA-
CAN and DC state grouping, B=256 . . . . . . . . . . . . . . . . . . . . . 81 4.4 Comparison of transition compression achieved in LSCT with FEACAN
and DC state grouping, B=256 . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Comparison of transition compression achieved in MSBT & LSCT when
the maximum number of states in the groups, T=80% . . . . . . . . . . . 83 4.6 Summary of the Signature set before and after alphabet compression . . . 89 4.7 Compressed Transitions Before and After Alphabet Compression . . . . . 90 4.8 Comparison of signature matching results across the compression methods
across different PM values . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1 MTB address calculation example . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Example values used for the parameters to explain the Shared Memory
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3 Shared memory architecture - memory addressing . . . . . . . . . . . . . . 122 5.4 Synthesis Area Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5 Comparison of BiSME against other Hardware Engines . . . . . . . . . . 143 5.6 Signature sets compiled into BiSME . . . . . . . . . . . . . . . . . . . . . 144 5.7 Signature sets compiled into BOBiSME . . . . . . . . . . . . . . . . . . . 145 5.8 Evaluation BOBiSME string signature capacity . . . . . . . . . . . . . . . 145 5.9 Source and Destination MAC address combinations to identify a context . 149
xv
xvi
1 Introduction
1.1 Motivation
The evolution of the Internet of Things (IoT) and the possibility to connect numerous electronic devices through a multitude of communication technologies is revolutionizing the way in which we interact with these devices. The ability to remotely monitor and manage these devices enable systems to make data driven decisions; further saving time for people and businesses to improve the quality of our lives. According to a report from Accenture [7], the evolution of the IoT devices is still in its nascent stages and the total number of connected devices is expected to hit about 40 billion by 2024. The growth in the number of IoT devices is spread across a multitude of environments such as industrial automation, wearables, transportation, infrastructure applications, home automation etc.
A wide range of IoT devices and applications are emerging for use in home, which include connected devices such as personal computers, mobile phones, tablets, smart televisions, gaming consoles, smart camera and many other smart appliances as shown in Figure 1.1. Though these devices bring additional convenience to our lives, the networking capability in these devices present cybercriminals an opportunity to exploit them [8]. Issues such as competitive cost, technical constraints and the time to market pressure challenge the manufacturers to adequately design effective security features into these devices. Distributed Denial-of-Service (DDoS) attacks such as mirai botnet [9] is an example of the devastation that could be caused by cybercriminals who capitalize on the always on networking connectivity in these devices. Additionally, the sheer growth in the number of IoT devices kindles their interest to target these devices. The connected devices in the home network, not only allow the attackers to execute complex attacks from the compromised devices, but also allow them to access precious user data within the devices and the other devices which are connected to the home network. Various studies by researchers have found that home appliances such as smart refrigerators or smart televisions are susceptible to be compromised [10, 11], and the biggest worry is that the end users are completely caught unaware of it. Cisco’s annual security report states that the patching rate for these IoT devices is minimal and in most cases, the vulnerabilities are not even patched [12]. So, securing the IoT devices in the home network is a major issue which has to be addressed before converting our home, really into a Smart Home.
The Residential Gateway Router (RGR) is a centralized networking hub in the connected home. The RGR has also been the networking node which is used to manage the network communication between the network service provider and the devices connected to the network. However, due to the explosion in the number of connected devices in the
1
1 Introduction
Figure 1.1: A smart home which has many IoT devices connected to it. (Image downloaded from [6]).
home network, the RGR now has to take the additional responsibility of securing and managing the devices connected to it. So, it is essential to implement Content-Aware Networking (CAN) services such as intrusion detection/prevention, load balancing, content aware Quality-of-Service (QoS), copyright enforcement etc. in the RGR, in addition to enabling connectivity with the network service provider. In this way, the RGR becomes the central networking device in the home network through which the various connected devices can be effectively secured and managed.
A Network Processor (NP) is the heart of the RGR, which is responsible to analyze the network packets which are generated by the devices connected to the home network. The data which is communicated in a network packet is generally classified into the header and the payload. The payload consists of the actual application data which is exchanged by the network devices, while the header information in a packet helps to identify the next-hop device to which the packet is sent to. In order to perform the packet for- warding function (switching/routing), the NP in the RGR generally inspects the packet header across various layers specified as part of the Open Systems Interconnection (OSI) specification [13]. However, in order to perform the content aware networking functions, it is essential to inspect the complete network packet including the headers at various
2
1.2 Introduction to Deep Packet Inspection
layers and the content of the payload in a packet. So, it is essential to perform Deep Packet Inspection (DPI) in the NP, the technology which allows to completely inspect a network packet to enable content-aware networking in the next generation RGR.
A typical network processor which is used in the RGR consists of multiple wireline1
and wireless network interfaces2 through which it can enable communication between multiple connected devices. The incredible development in the wireline (2.5/5 Gbps Ethernet [17]) and the wireless (802.11ax [18]) communication technologies which are used in the home networking ecosystem are allowing devices to communicate at multigi- gabit rates, even within the home network. Moreover, the deployment of services such as fiber-to-the-home (FTTH) [19, 20], mandates the network processors used in the RGR to perform packet processing at multi-gigabit rates with the network service provider [21]. As more applications begin to leverage the available bandwidth in the home network, it is crucial to perform DPI at multi-gigabit rates to process packets in a content aware manner.
The next section provides a short introduction to various packet inspection methodologies and further describes the challenges associated in performing DPI.
1.2.1 Packet Inspection Methodologies - Overview
Figure 1.2 shows the various layers3 which are inspected as part of network packet processing. Based on the depth of the layers until which a packet is inspected, the packet inspection methodology is classified into the Shallow Packet Inspection (SPI), Medium depth Packet Inspection (MPI) and the Deep Packet Inspection (DPI) [22].
The SPI and the MPI methodologies primarily inspect the packet header alone. The difference between the two methodologies arises depending on the specific header layers which are inspected in a packet. The packet headers corresponding to layers 2, 3 and 4 alone are inspected as part of the SPI, while the header information across all the header layers are inspected as part of the MPI. Various rules are defined to inspect specific portions of the headers in a packet, while the rules also consist of actions which have to be applied after a rule match. It should be noted that the location of the header fields which are inspected in a packet are known a priori, as the headers adhere to standard network protocols. So, the processing associated with respect to the SPI
1The current generation network processors in an RGR consists of 4 Ethernet ports each of which is capable of communicating at 10/100/1000 Mbps [14, 15, 16].
2Next-generation network processor architectures used for the RGR are enabled with multiple (typically 4) WiFi baseband processors which provide a consolidated wireless throughput of more than multiple gigabits (∼6 Gbps) per second [14, 15, 16].
3According to the OSI model, the information in a network packet is primarily split into 7 layers; physical layer, data link layer, network layer, transport layer, session layer, presentation layer and the application layer. The physical layer forms the lowest level of abstraction while the application layer is highest. The physical layer represents the medium in which the data is communicated. So the header information is generally added for the other 6 layers and the packet inspection is also performed corresponding to these 6 layers.
3
Figure 1.2: An overview of various packet inspection methodologies
and the MPI primarily involve searching for specific fields in the header whose location is known.
As shown in Figure 1.2, in the case of the DPI, the packet payload is also inspected in addition to the packet headers across various layers. So, the rules which are used for packet inspection in DPI consist of two parts. The first part consists of specific fields which have to match in the packet header. After the required fields in the packet headers match, a signature is searched in the payload portion of the packet. The signatures which are used for payload inspection are either composed of strings or regular expressions. Unlike the header inspection in which the location of the inspected portion in the header is known, the location of the signature which is being searched in the payload is unknown. Moreover, the signature can start at any byte location within the sequence of payload bytes and can even be split across the payload portions of multiple packets. So, all the payload bytes across multiple packets have to be sequentially inspected to check if the signatures are found within the payload byte sequence. Thus, the payload inspection is a computationally challenging task in DPI in comparison to header inspection.
1.2.2 Steps in Deep Packet Inspection
The various functions which are performed as part of DPI can be broadly classified into packet normalization, packet reordering, packet prefiltering, signature matching and the postprocessing [23]. Figure 1.3 shows a logical ordering of these functions and a short description of these functions is outlined below.
4
Figure 1.3: Overview of the various steps to be performed in deep packet inspection
• Normalization: Normalization [24] is the first step in DPI, in which various sanity checks are performed on a network packet before performing the payload inspection. Various researchers [24, 25] have identified that the network and the transport layer headers can be synthetically configured to evade the signature matching function. So, the main aim of normalization is to eliminate those mal- formed packets and to ensure that only those packets which comply to the network protocols are inspected.
• Packet Reordering: The packets which are processed by a network processor are generated by different network devices and have to be initially split into network packet streams before performing the payload inspection. Furthermore, the packets corresponding to a stream should also be reordered to make sure that they are inspected in the right sequence [26]. The normalization and the packet reordering functions have to be performed on each and every individual packet.
• Packet Prefiltering: Since the main aim of DPI is to enable content aware networking, the signatures are generally associated with a certain higher layer application protocol, e.g., HTTP, FTP etc. So, the signature database is also divided into subsets of signatures, where each of the subsets identify the signatures associated with a specific application protocol. The application protocol information is extracted after inspecting the header in the transport layer or the application layer headers. The extracted information can be used to compare the packet payloads against a signature subset, instead of comparing the payload bytes against all the
5
1 Introduction
signatures. The process of comparing the packet payload against a signature subset after inspecting the packet headers is called packet pre-filtering and has been used by various DPI implementations [27, 28, 29]. The packet pre-filtering is a one time step which is generally performed during the packet classification stages where a new stream is categorized into a specific higher layer protocol.
• Signature Matching: The signature matching is the most important step in DPI in which the payload bytes are compared against the signatures. In this step, each and every byte in the payload is compared against the database of signatures. Since there can be multiple signatures in a signature subset, it is essential to match the payload bytes against all the signatures at once to perform the function in a computationally effective manner. Moreover, since the application data is split into multiple packet payloads, there are chances that a signature is split across multiple network packets. So, it is essential to match signatures across multiple network packets. To summarize, the signature matching is the most computationally challenging task in DPI and is also the apt function for hardware acceleration to perform DPI at multi gigabit line rates [30]. So, considering the fact that the network traffic at such high rates have to be inspected by the network processors in the RGR, hardware acceleration of signature matching becomes quintessential to perform line rate signature matching within the home network [31].
• Postprocessing: Once a signature is successfully matched, the action corresponding to a signature is performed on the packet or the stream in this step. The action associated with a signature is generally described as part of its definition, while the criticality of the postprocessing function varies depending on the individual signatures.
To summarize, the signature matching function is the most time critical function which enables the network processors to perform line rate signature matching. The next section discusses the computational and storage challenges associated with the signature matching function.
1.2.3 Signature Matching - Requirements & Challenges
The signatures which are used for DPI are either represented through strings or regular expressions. Since the signatures cannot be directly processed by modern processor architectures, they are either converted into the Deterministic Finite Automaton (DFA) or the Non-deterministic Finite Automaton (NFA). Both these are machine readable state machine representations which are functionally equivalent to the signature set. Since the processing complexity of comparing a payload byte against the DFA is constant (O(1)), it is the preferred form to represent the signatures, especially for high speed signature matching applications [32]. On the contrary, the DFA is highly storage ineffective due to the presence of redundant state transitions in it. Moreover, when the signatures are described through regular expressions, the total number of states generated in the DFA explodes exponentially and this problem is referred to as the state explosion problem [33].
6
So, the standard approach proposed by the academia is to compress and store the DFA in the memory and eventually perform the signature matching against the compressed DFA [34]. The state explosion problem during the DFA generation is addressed by performing state compression, while the redundant transitions are compressed through the transition compression algorithms. Since both these approaches address orthogonal problems, the transition and the state compression algorithms are orthogonal to each other [35].
The transition compression algorithms play an important role in defining the memory footprint of the compressed DFA [1]. Achieving high transition compression rates reduces the memory footprint of the compressed DFA. However, it is also important to make sure that the algorithmic approach towards transition compression enables the decompression to be performed in a dedicated hardware accelerator. This would enable the signature matching function to be performed at line rates which in turn allows to perform DPI at line rates.
The transition compression algorithms proposed in the literature either focus on efficiently compressing the DFA or focus on performing the decompression at line rates, but not both. The transition compression algorithms [1, 32, 36, 3] which belong to the former category achieve high transition compression rates typically of the order of over 95%. However, in these techniques, the redundant transitions are compressed at the cost of increased memory bandwidth which doesn’t allow the decompression to be performed through dedicated hardware accelerators [4]. On the other hand, the transition compression algorithms which belong to the latter [4, 5] focus on compressing the DFA through the bitmaps. Compressing the redundant state transitions through the bitmap maintains the constant processing complexity as that of the DFA, which allows the decompression to be performed through dedicated accelerators. However, these algorithms require the bitmap to be stored together with the compressed DFA to identify the location of the compressed transition. So, in order to reduce the number of bitmaps stored together with the compressed DFA, the bitmap based compression solutions proposed in the literature compromise on the transition compression rates and only achieve ∼90-95%. Considering the strategic importance of transition compression, there is a requirement for a class of transition compression algorithms which can effectively compress the DFA as well as enable hardware acceleration of the decompression function to perform signature matching at line rates.
The signature matching throughput achieved by a hardware accelerator engine in the case of DPI applications completely depends on the organization of the compressed state transitions in the memory. As in the case of any computing system, the data access latency plays a big role in deciding the throughput that is achieved in the case of signature matching applications. The signature matching throughput achieved is the highest when the compressed DFA is completely stored in the on-chip memories as the data can be accessed at low latencies. However, if the compressed DFA is stored in a combination of on-chip and off-chip memories, the throughput of the hardware accelerator completely depends on the latency to fetch the data from the off-chip memories [37]. There is a possibility to reduce the access latencies through the introduction of cache memories as in the case of [37]. However, loading the specific portions of the compressed DFA
7
1 Introduction
into the cache memories can become highly volatile, as this completely depends on the characteristics of the payload bytes being inspected. Thus, a great amount of effort will be required to effectively design a caching algorithm to load the data into the cache memories. So, if the DFA is compressed efficiently in such a way that it is completely stored in the on-chip memories, the hardware accelerator can be effectively used to perform the signature matching function at line rates.
Considering that the hardware accelerator will be part of the network processor used in the RGR, the following requirements should be satisfied so that the architecture becomes reusable across multiple generation of systems.
• Scalability: The signature matching engine should be scalable to support increasing network bandwidth requirements and increasing signature counts. While there is no specific data that is currently available regarding the specific signature counts required for the RGR, the significant growth in the new malware [38] will require to support signature counts of the order of few thousands. The signature matching throughput to be supported by the engine also depends on the network bandwidth and the application scenarios. Considering that multiple network interfaces are available in the RGR which can support network bandwidth at multiple gigabits per second, the signature matching throughput to be supported will cross the 10 Gbps barrier in the future for the RGR.
• Flexbility & Programmability: When the compressed DFA is completely stored in the on-chip memories, it is essential to make sure that the data is effectively stored in the physical memories. Effective storage architectures should be proposed to store the compressed DFA in a flexible way so that the on-chip memory space is effectively utilized. Additionally, the access to the physical memory and sharing of physical memory space should be made programmable to introduce a certain level of flexibility with respect to compressed DFA storage. In this way, when the accelerator is integrated with a network processor, the processor can configure the functionality of the accelerator depending on specific application requirements.
1.3 Thesis Contributions
Addressing the challenges mentioned above, two transition compression methods are proposed in this dissertation which can efficiently compress the DFA and also enable the decompression to be performed in dedicated hardware accelerators. Following up on the compression algorithms, a flexible, scalable, programmable hardware accelerator is proposed which enables to perform the signature matching functions at ∼10 Gbps. The following are the key contributions of this dissertation:
• Transition Compression Through Bitmaps & Bitmasks: Though the bitmap based transition compression methods proposed in the literature perform the signature matching through dedicated hardware accelerators, they compromise on
8
1.3 Thesis Contributions
the transition compression rates to reduce the number of bitmaps stored together with the compressed DFA. Addressing this drawback, two bitmap based transition compression methods called the Member State Bitmask Technique (MSBT) and the Leader State Compression Technique (LSCT) are proposed in this dissertation. These methods achieve transition compression rates of about 97-98%, a 4-5% improvement over the state-of-the-art solutions [39, 40]. The redundant state transitions in the DFA are compressed through a combination of bitmaps and bitmasks, a secondary layer of indexing introduced in these methods. Though the bitmasks have to be stored together with the compressed transitions, the improvement in the transition compression rates reduces the memory footprint of the compressed DFA by 50% in comparison to the state-of-the-art bitmap based solutions. The compressed DFA generated after bitmap based transition compression mainly consists of the compressed transitions and the control data (bitmaps, bitmasks etc.), which enable to locate the compressed transitions. The main idea behind the proposed solutions is to add more control data in the form of bitmasks to effectively index the redundant state transitions. Thus, even a small improvement in the transition compression rate considerably reduces the overall memory footprint of the compressed DFA.
• Optimizing the Memory Footprint of the Compressed DFA after the MSBT & the LSCT: Since, the compressed DFA is intended to be completely stored in the on-chip memories after the MSBT and the LSCT, it is paramount to make sure that the memory footprint of the compressed DFA is well optimized. The following three methods are proposed to further improve the memory footprint of the compressed DFA.
1. Divide & Conquer State Grouping Method: The state grouping is one of the integral steps in the proposed methods in which the states are grouped into subsets after which the redundant transitions are compressed. So, a compression-aware Divide and Conquer (DC) state grouping method is proposed for this step, through which the transition compression rates are improved by a variable factor of 0.5-2%, thus reaching overall compression rates of about 98-99% [41].
2. Bitmask Compression: As part of the MSBT and the LSCT, the bitmask is generated for all the states in the DFA which enables to effectively compress the redundant state transitions after bitmap based compression. However, a majority of the bitmasks which are generated in this process are redundant and are compressed through the bitmask compression process. Experimental evaluation of the bitmask compression shows that about ∼60-70% of the bitmasks are redundant which are efficiently compressed through the proposed method.
3. Combining Alphabet Compression with Bitmap Compression: Al- phabet compression is a well known method which is used to compress those indistinguishable characters in an alphabet, i.e, the ASCII character set in the
9
1 Introduction
case of DPI applications. The alphabet compression is proposed to be combined together with the MSBT and the LSCT, as the bitmap alone cannot compress certain redundant state transitions in the DFA.
All of the methods proposed above are orthogonal to each other and can be implemented together to optimize the memory footprint of the compressed DFA together with the MSBT and the LSCT. When compared with the state-of-the-art solutions, the overall memory footprint of the compressed DFA reduces by 70%, an additional improvement of 20%, when combined together with the MSBT and the LSCT.
• Hardware Coprocessor for Signature Matching: Utilizing the proposed methods to compress the DFA, a Bitmap based Signature Matching Engine called BiSME is proposed to perform the signature matching function in a dedicated accelerator to achieve line rate DPI [42]. BiSME is a programmable, flexible and scalable coprocessor which stores the compressed DFA in on-chip memories after performing the MSBT. The compressed DFA is flexibly stored in the on-chip SRAMs through various efficient storage architectures. The Shared Memory Methodology efficiently stores the dynamically varying compressed transitions, whose count is signature dependent, in the predefined on-chip memory boundaries. The Packed Storage Methodology enables to store the unstructured bitmasks flexibly, without wasting precious on-chip memory resources. The BiSME is effectively pipelined to achieve a signature matching throughput of 9.3 Gbps. Furthermore, an extension to BiSME called BOBiSME is proposed which implements the bitmask decompression in the hardware and is capable of performing signature matching at 10.6 Gbps. The proposed signature matching engines were synthesized on a commercial 28nm technology library operating at 0.81V. The BiSME consumes 1.43 mm2 of silicon area and consumes 0.155W power while the BOBiSME consumes 1.18 mm2
of silicon area and consumes 0.167W power. A compiler was designed to convert the signature sets into BiSME and BOBiSME memory formats. The functionality of the hardware implementation was thoroughly verified in the Cadence Palladium platform by injecting over 2 GB of identical traffic to BiSME, BOBiSME and a DFA based signature matching engine. The identical signature matching results verified the functional correctness of the hardware implementation.
1.4 Dissertation Organization
The dissertation is organized as follows. Chapter 2 first provides a detailed overview of the state-of-the-art signature matching engines and summarizes the advantages and disadvantages of the various methods proposed in the literature.
Chapter 3 proposes and evaluates the MSBT and the LSCT in which the bitmasks are introduced to significantly improve the transition compression rates in comparison to existing bitmap based solutions.
10
1.4 Dissertation Organization
Chapter 4 proposes and evaluates three different methods, i.e, the Divide and Conquer state grouping method, the alphabet compression and the bitmask compression which focus on further reducing the memory footprint of the compressed DFA.
Chapter 5 first discusses the two flexible storage architectures through which the compressed DFA can be efficiently stored in the on-chip memories and then the resulting hardware engines through which the transition decompression is performed after compressing the DFA through the Member State Bitmask Technique. Furthermore, this chapter discusses the key results with respect to the achievable signature matching throughput and the further the functional evaluation of the proposed hardware accelerators.
Chapter 6 finally concludes the dissertation and discusses the directions for some of the future work.
11
2 State-of-the-Art
String and regular expression based signature matching is a problem which has been well addressed by the research community [34]. The signatures which have been used for DPI have greatly evolved over time. Initially, the signatures were primarily composed using simple string patterns. However, due to the high false positive rate observed during string based signature matching, Sommer et al., [43] proposed the usage of regular expressions to describe the signatures. Since the regular expressions offer better flexibility in comparison to strings, most of the applications prefer to use the regular expressions to define the signatures [44].
The signature matching engines have correspondingly evolved in relation to the evolution in the signature representations. Based on the type of signatures which are processed, the signature matching engines can primarily be classified into string based and automata based signature matching engines as shown in Figure 2.1 [34]. The string based signature matching engines are capable of processing string signatures alone, while the automata based signature matching engines are capable of processing both string and regular expression signatures. Due to this advantage, a majority of the research work has primarily focused on automata based signature matching engines [34].
Existing research work on automata based signature matching engines primarily fo- cuses on two aspects, i.e, automata compression and line-rate signature matching implementations [34, 23]. This chapter provides an overview of solutions which focus on both of these aspects.
This chapter is organized as follows. Section 2.1 provides an overview of the various string based signature matching engines proposed in the literature. Section 2.2 provides an introduction to the theory of automata and the basics of automata based signature matching. Section 2.3 provides an overview of the various DFA compression approaches while Section 2.4 provides an overview of the state-of-the-art solutions which target the problem of line rate signature matching. Finally, Section 2.5 provides an overall summary of the state-of-the-art approaches with respect to automata based signature matching.
2.1 String Based Signature Matching Engines
The very first signature matching engines which were proposed in the literature were targeted to process string based signatures and are classified under the string based signature matching engines [34]. Initially, various classical string matching algorithms such as Knuth Morris Pratt (KMP) [45], Aho-Corasick (AC) [46], Boyer-Moore (BM) [47], etc., were used to match the payload bytes against the string signatures. Dharmapurikar et al., [48] identified that these algorithms are primarily tuned towards software based
13
2 State-of-the-Art
Figure 2.1: Classification of various signature matching engines proposed in the literature
signature matching implementations and are not useful for line rate signature matching. So, they proposed to use the bloom filters [49] for string based line rate signature matching implementations. The bloom filter is a time and space efficient probabilistic data structure that stores a database of strings compactly in a memory vector. How- ever, the result of the querying process using the bloom filter primarily indicates if an element (signature) is likely to be present in the bloom filter or not present, i.e., false positives are possible as part of the querying process but false negatives will never occur. So, a secondary source is required to confirm if the signatures were identified after the querying process.
The usage of bloom filters for signature matching involves a filter programming step, in which k different hash functions are used to set k different bits in an m-bit vector. For each of the signatures in the signature set, the hash functions generate a set of indices at which the bits in the vector are set to 1. As part of signature matching, the payload bytes are passed to the k different hash functions to generate indices at which the memory vector is queried. If all the indices in the memory vector are set, a signature is set to be identified within the payload byte stream. Since a single m-bit vector is used to store multiple signatures, the identified index could have possibly been set by a different signature other than the one that matched. So, a secondary analyzer is used to verify the signature matching results. However, the most important property of this data structure is that the computation time involved in performing the query is independent of the number of strings stored in the vector.
The very first bloom filter based signature matching methodology was proposed in [48], where multiple bloom filters in parallel were used to accelerate the string matching function. A hash based secondary analyzer was used to distinguish between a false-positive and a true positive. On the other hand, Nourani et al. [30] proposed a bloom filter
14
2.2 Automata Based Approaches
based prefilter, to enable a first level filtering and a hash based hardware engine to perform signature matching at multi-gigabit rates. There were various other enhancements which were proposed using the bloom filter based architectures for signature matching following [48] and [30]. However, due to the emergence of regular expressions as the preferred mode to describe the signatures, the research community has primarily focused on automata based signature matching engines which are capable of processing both string and regular expression signatures [44, 34, 23, 50, 33]. The next section provides an introduction to the theory behind the automata and further describes the basics of automata based signature matching approaches.
2.2.1 Introduction to NFA & DFA
An automaton or a Finite State Machine (FSM) is a state machine which can under- stand the language expressed by a set of signatures comprising both string and regular expression signatures. To simplify, an automaton is an equivalent description of a signature set in the machine readable format. A finite automaton [51] is represented as a 5-tuple(Q, Σ, δ, q0, F ) as shown below:
• Q, is a finite set of states,
• Σ is a finite set of characters called alphabet,
• δ: Q ×Σ→ Q1, is the state transition function,
• q0 ∈ Q, is the initial state,
• F ⊆ Q, is the set of accepting states.
An automaton primarily consists of a finite set of nodes called the states, Q and labeled directed edges between the states which are called the state transitions. The labels in the directed edges correspond to either one or multiple characters in the alphabet, Σ. An automaton consists of a single initial state called the root state, q0 from which the state machine starts the signature matching function. The state transition function, δ takes a state and a character2 as an input and generates a single or multiple next states as the output. In an automaton, F is the subset of all the states in Q, which identify a signature match and is referred to as the set of accepting states.
As part of the signature matching, the root state is first assigned as the current state. Subsequently, each and every payload byte is input to the state transition function which computes the next state transition corresponding to a current state and payload byte combination. The computed next state is assigned as the current state and the
1In the case of the Non-Deterministic Finite Automaton, the state transition function generates a set of next states P(Q) corresponding to a state-character combination, i.e, δ: Q ×Σ→ P(Q)
2As part of the dissertation, the terms character and payload byte are used interchangeably to represent an individual payload byte in the network packets.
15
2 State-of-the-Art
subsequent character is input to the machine to compute the next state transition. This process continues until all the payload bytes have been compared against the automaton. At any point of time in this process, if the computed next state belongs to the accepting states, then a signature is set to be detected by the automaton.
Since the 8-bit extended American Standard Code for Information Interchange (ASCII)3
character set is used for the internet communication, the signatures for DPI applications are constructed using the ASCII character set. A signature set can either be represented through the the Non-deterministic Finite Automaton (NFA) or the Deterministic Finite Automaton (DFA). The primary difference between the representations is the output of the state transition function, which also affects the total number of states in the machine4. In the case of the NFA, the number of next states generated by the state transition function is non-deterministic, i.e, the state transition function can either generate a single next state, multiple next states or even no next state. On the other hand, in the case of the DFA, the state transition function always generates a single next state resulting in a deterministic output. It should be noted that both the machines (NFA & DFA) are equivalent representations of the language represented by a signature set.
A signature set which consists of two different signatures, abc and def.*12 is further used to explain the concepts behind the NFA & DFA representations. The former is a string signature and is only matched if there is a sequence of an ‘a’, followed by a ‘b’ and a ‘c’ in the payload. On the other hand, the latter is a regular expression which is only matched if there is a sequence of a ‘def’, followed by a ‘12’ with zero or more characters in between ‘def’ and ‘12’. Figure 2.2(a) and (c), respectively show the NFA and the DFA representations corresponding to the signature set. The initial state in both the automaton representations have been shown using a green circle, while the red circles represent the accepting states.
Figure 2.2(b) and (d), respectively show the sequence of states traversed in the NFA and the DFA when the byte sequence to be inspected is composed of ‘defabc12’. It can be seen that the byte sequence being inspected consists of both the signatures which are part of the signature definitions. In the case of the NFA, due to the non-determinism associated with the state transitions, multiple states are active at the same time during the signature matching as seen in Figure 2.2(b). The states which are active during the signature matching operation are called the ‘active set’ of states. So, the state transition function corresponding to a character has to be executed for all of the states in the active set. On the other hand, it can be seen from Figure 2.2(d) that only a single state is active during DFA based signature matching. Irrespective of the automaton representations, it can be seen from Figure 2.2 that the signatures matched by the sequence of bytes is identical in both the NFA and the DFA, which further prove their functional equivalence.
Converting the signatures into an automaton consists of various steps. A variety of algorithms [52, 53, 54] have been proposed in the literature to convert a signature into
3With respect to DPI applications, the extended ASCII character set is generally referred to as the ASCII character set. So as part of the dissertation, the usage of ASCII character set refers to the 8 bit extended ASCII character set.
4In the rest of the dissertation, the term state-character combination will be used to refer the inputs to the state transition function.
16
Figure 2.2: A signature set with 2 signatures abc, def.*12 represented as (a) NFA and (c) DFA. The states traversed during (b) NFA and (d) DFA based signature matching.
the NFA, which is the first step in converting the signatures into an automaton. After converting the signatures into the NFA, the subset construction algorithm [52] is used to convert the NFA into the DFA. As part of the subset construction algorithm, all the unique active set combinations seen in the NFA are converted to unique DFA states. In theoretical worst-case scenarios, if there are ‘N’ states in the NFA, the equivalent DFA could consist of a maximum of 2N states. However, this exponential state blow-up called the state explosion, only results when the regular expressions in the signatures consist of constrained and unconstrained repetitions of wildcards or large character ranges [33, 50, 44]. Once the DFA is generated from the signature set, the state minimization algorithm [55] is used to generate the most compact DFA equivalent to the signature set.
17
2.2.2 Automata Based Signature Matching - Complexity Analysis
Table 2.1 shows a comparison of the processing complexity and the storage cost incurred to store the NFA and the DFA. Since, the state transition function in the DFA results in a single next state, the processing complexity associated in comparing an individual payload byte is always constant. On the other hand, since the state transition function has to be executed across multiple states in the active set, the worst case processing complexity in comparing a payload byte in the case of the NFA is exponentially related to the number of states5. However, with respect to automata storage, in worst case scenarios, the amount of memory required to store the DFA exponentially grows with respect to the number of states while it is linear in the case of the NFA. To summarize, the DFA is processing efficient and is storage inefficient while the NFA is processing inefficient and is storage efficient.
Table 2.1: Processing and storage complexity associated with NFA & DFA
Processing complexity / payload byte Storage cost
NFA O(N2) O(N)
DFA O(1) O(2N)
Due to the constant processing complexity associated to process a payload byte, the DFA is generally preferred for high speed signature matching applications [32]. Though, the constant processing complexity in the DFA enables to achieve signature matching at high rates, the storage problem has to be addressed for efficient signature matching implementations [1]. Addressing the storage problem in the DFA, the research community has primarily focused on compressing the DFA before it is stored in the memory [23, 34]. The next section provides a detailed overview of the various DFA compression algorithms discussed in the literature.
2.3 DFA Compression
Since the DFA is a deterministic state machine representation, it stores a single state transition for each of the characters in the alphabet, Σ for every state. Consequently, the total number of state transitions which are stored in the DFA, linearly depends on the total number of states generated as well as the total number of characters in the alphabet. Assuming that the DFA consists of ‘S’ states6, the total number of state transitions (DFA Trans) that have to be stored in the memory is represented by (2.1).
DFA Trans = S × Σ (2.1)
5The number of states here represents the total number of states generated in the NFA when a signature set is converted into the automata.
6Its a convention in the filed of automata theory to represent a DFA with ’S’ states corresponding to an NFA with ’N’ states.
18
2.3 DFA Compression
The DFA compression methods proposed in the literature can be broadly classified into the state compression and the transition compression approaches [34]. The state compression algorithms primarily focus on reducing the number of states in the DFA, while the transition compression algorithms focus on compressing the redundant state transitions in the DFA. A detailed description of both the approaches are described below.
2.3.1 State Compression
2.3.1.1 State Explosion Problem
Under certain circumstances, the conversion of an NFA into the DFA can potentially result in an exponential explosion in the number of states created in the DFA. This is called the state explosion problem [50] and is typically addressed by the state compression algorithms. The exponential explosion of states primarily occurs when the regular expressions consist of constrained or unconstrained repetitions either over the wildcard character (.) or over character ranges. Various analyses [50, 33] have identified that the state explosion problem can be narrowed down to the following 3 scenarios7 as described below:
• Scenario 1: Length Restriction on an Anchored Regular Expression State explosion occurs when the signature contains the regular expression of the following format: ˆA+[A-Z]{k}B. This specific regular expression pattern can be broken down into 3 parts. The first part is the prefix (A+), the middle part is the portion with the length restriction on the character range or the wildcard ([A-Z]k) and the final part is the suffix (B). The state explosion in this case occurs due to the overlap in the character(s) seen in the prefix portion with the character range seen in the middle portion. In this scenario, the generated DFA consists of states which record all possible combinations in the overlap occurrence in order to maintain functional correctness to the original signature representation. In addition to the states generated as part of the prefix and the suffix, O(k2) states are further generated to track the overlap occurrence. For example, Figure 2.3(a) shows this scenario where the DFA corresponding to the signature ˆA +[ˆ\n]{3}B consists of 9 states (states 1-10) to keep track of all possible combinations which potentially lead to a signature match. It should be noted that an anchored regular expression (expressions which have ˆ as the first symbol) is matched in the payload bytes, if and only if the pattern exactly matches from the first byte within the payload byte sequence.
• Scenario 2: Length Restriction on an Un-anchored Regular Expression The second scenario is very similar to the first, but results in a case where the regular expression does not have the anchor symbol, i.e., the regular expression is of the format, .*A[A-Z]{k}B. In this scenario, the state explosion also occurs because of the overlap between the characters in the prefix and the ones in the character
7The examples associated to the 3 scenarios have been used from [34]
19
2 State-of-the-Art
Figure 2.3: (a) state explosion scenario 1 (b) state explosion scenario 2 (c) state explosion scenario 3
class in the middle portion. Since, this is not an anchored regular expression, there is no restriction with respect to the occurrence of the signature within the sequence of payload bytes. For example, an ‘A’ seen anywhere in the sequence of bytes would potentially lead the DFA to search for the signature, irrespective of whether the previous few bytes resulted in a partial signature match. In this scenario, O(2k) additional states are required to keep track of all possible combinations leading towards a potential signature match. Figure 2.3(b) shows an example of this scenario, where a DFA is shown corresponding to the regular expression .*A.{2}B.
• Scenario 3: Regular Expression Combinations Unlike the previous scenarios where the state explosion resulted because of the specific characteristics of the individual signature, this scenario occurs due to the interaction of multiple regular expressions with wildcard terms in it. Two regular expressions of the format RE1=sub1.*sub2 and RE2=sub3.*sub4 are used to show the state explosion in this scenario, where each subi could refer to a sequence of characters. The states corresponding to the sub-expressions are duplicated as shown in Figure 2.3(c) resulting in state explosion. This scenario occurs as the unrestricted wildcard character in
20
2.3 DFA Compression
a signature could potentially match the sub expressions of other signatures. So, the resulting DFA should be able to keep track of the following scenarios:
1. After matching a sub-expression within a regular expression (e.g. sub1 in RE1 / sub3 in RE2), the second sub-expression should be tracked for a possible signature match (e.g. sub2 in RE1/sub4 in RE2).
2. After matching the first sub-expression in a certain regular expression (e.g. sub1 in RE1 / sub3 in RE2), the sub-expressions which lead to a different signature should be tracked simultaneously (e.g. sub3 in RE2 / sub1 in RE1).
Thus when multiple regular expressions with unrestricted wildcard character repetitions are converted into a single DFA, the number of states in the resulting DFA grows rapidly.
2.3.1.2 State Compression Solutions
Various solutions have been proposed in the literature targeting the state explosion problem. Based on the algorithmic approach used to solve the state explosion problem, the solutions can be classified into rule grouping, semi-deterministic FAs and the decomposed FAs [34]. A brief overview of each of the approaches is described below:
• Rule Grouping: Since the state explosion primarily happens when multiple signatures are combined together to create a single DFA, one of the ways to avoid this problem is to create multiple DFAs corresponding to a signature set. Solutions such as [33, 56] have proposed rule grouping algorithms to split the signatures into ‘m’ subsets, so that a single DFA is created for each of the subsets. Since the payload bytes corresponding to the network streams have to be compared against the original signature set, they have to be matched against each of the ‘m’ individual DFAs. So, the cost paid for the rule grouping is the increase in the processing complexity, i.e., O(m), to process each of the individual payload bytes. For example, the solution proposed in [56], which is the state-of-the-art rule grouping algorithm generates 7 DFAs corresponding to a signature set consisting of only 107 regular expressions. When the signature set is directly converted into a single DFA, it consists of more than a million states. After rule grouping, the 7 DFAs only consist of a total of 15k states. However, with an increase in the number of signatures to be supported, the total number of groups to which the signatures have to be split also increases, which linearly reduces the achievable signature matching throughput. One of the methods which has been discussed in the literature is the usage of multicore processors to perform the signature matching function against multiple DFAs in parallel to hide the linear reduction in the signature matching throughput. However, the number of DFAs to which the signatures are split completely depends on the characteristics of the signature set. So, the scalability of this approach to support increasing number of signatures is a major concern with this approach [34].
21
2 State-of-the-Art
• Semi-deterministic FA: Solutions such as the Hybrid Finite Automata (HFA) [50], the Tunable Finite Automata (TFA) [57], the DFA with extended character set (DFA-eC) [58], etc., fall into this category and are semi-deterministic automata representations. In the case of the HFA, a certain portion of the automaton is deterministic while the signature structures which lead to state explosion are made non-deterministic. So, the number of states in the active set during signature matching varies depending on whether the deterministic or the non-deterministic portion of the HFA is traversed. On the other hand, solutions such as [57, 58] propose to have a fixed number states which are always active during signature matching. In the case of the TFA, the number of states which are active is greater than ‘1’ and is predefined to a fixed value which is configurable (i.e., 2/3/4), while in the case of DFA-eC, it is always set to 2. With respect to state compression, more than 90% of the states on average are compressed by all of these methods while the individual number of states compressed vary depending on the characteristics of the specific signature sets. Since there is no standard signature set which is used for evaluation purposes, evaluating the state compression results across the techniques discussed above is not straight foward. In general, the cost paid for the state compression in these solutions is the increase in the memory bandwidth due to the semi-deterministic nature of the generated automata. In the case of the HFA, the memory bandwidth varies dynamically depending on the specific portion of the automata that is being traversed. In worst case scenarios, the memory bandwidth required in the case of the HFA is equivalent to that of the NFA. In the case of the DFA-eC, the memory bandwidth doubles which also reflects through the 50% reduction in signature matching performance in comparison to that of the DFA.
• Decomposed FA: Decomposed FAs are a collection of solutions in which certain additional control information is augmented to the state transitions to combat the state explosion. Solutions such as the eXtended Finite Automata (XFA)[44], the Jump Finite Automata (JFA)[35], etc., belong to this category. For example, if the regular expression is of the format ab.*cd, a set bit is used to indicate that the sub expression ab has been matched in the state transition. The state transition leading to the state after matching ‘c’ is only transitioned to, after checking the status of the set bit. In this way, the state explosion resulting from the combination of regular expressions with unrestricted wildcard characters is easily combated. The additional information which is augmented into the state transitions varies depending on the characteristics of the signature set. On the other hand, Bechhi et al. [59] proposed the usage of multiple counters to keep track of the characters seen in the length restricted regular expressions8, to evade the state explosion in the DFA. So, the decomposed FAs introduce additional control information along with the state transitions, to solve the state explosion problem. Unlike the semi- deterministic FAs, the memory bandwidth associated with the state transition fetch in the case of the decomposed FAs is the same as that of the DFA. However, the additional processing which is introduced as part of the execution of the state
8Explained as part of scenario 1 and scenario 2 in the previous section.
22
2.3 DFA Compression
transition function depends on the information augmented in the state transition [34].
After performing the state compression, the total number of states in the DFA reduces from S to S′ (S′ S). However, each of the states in the state compressed DFA stores the state transitions for all the characters in the alphabet Σ. So, even after performing the state compression, a major portion of the state transitions are redundant and have to be compressed through the transition compression algorithms. So, performing the transition compression is mandatory in the case of the DFA irrespective of whether the state compression is performed or not.
2.3.2 Transition Compression
Though there are 256 state transitions for each of the states in the DFA, a majority of these state transitions are redundant and leads the machine towards the root state or one of states closer to the root state [1]. The process of compressing these redundant state transitions in the DFA is called as transition compression. As discussed previously, transition compression can either be performed on the DFA or a state compressed DFA. Various transition compression algorithms have been proposed in the literature and they can be broadly classified into the hardware oriented and software oriented algorithms [39]. This section also discusses the alphabet compression mechanism which com

Documents

Memory Efficient Signature Matching in Deep Packet