Calhoun

Iniiiiuiiortfl Arthivcof (he Navjl Pwigndualt School

Calhoun: The NPS Institutional Archive DSpace Repository

Theses and Dissertations

1. Thesis and Dissertation Collection, all items

2017-03

Identification of low-latency obfuscated traffic using multi-attribute analysis

Dougherty, Kevin R.

Monterey, California: Naval Postgraduate School

http://hdl.handle.net/10945/52975

This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. Copyright protection is not available for this work in the United States.

Downloaded from NPS Archive: Calhoun

DUDLEY

KNOX

LIBRARY

htt p ://w w w. n ps.e-du/l ib ra ry

Calhoun is the Naval Postgraduate School's public access digita I repository for research materials and institutional publications created by the NPS community. Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first appointed and published scholarly author.

Dudley Knox Library / Naval Postgraduate School 411 Dyer Road / 1 University Circle Monterey, California USA 93943

NAVAL

POSTGRADUATE

SCHOOL

MONTEREY, CALIFORNIA

THESIS

IDENTIFICATION OF LOW-LATENCY OBFUSCATED

TRAFFIC USING MULTI-ATTRIBUTE ANALYSIS

by

Kevin R. Dougherty

March 2017

Thesis Advisor:

Shelley Gallup

Co-Advisor:

Thomas Anderson

Approved for public release. Distribution is unlimited.

THIS PAGE INTENTIONALLY LEFT BLANK

REPORT DOCUMENTATION PAGE

Form Approved OMB No. 0704-0188

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington, DC 20503.

1. AGENCY USE ONLY 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED

(Leave blank ) _ March 20 1 7 _ Master’s thesis _

4. TITLE AND SUBTITLE

IDENTIFICATION OF LOW-LATENCY OBFUSCATED TRAFFIC USING MULTI-ATTRIBUTE ANALYSIS _

6. AUTHOR(S) Kevin R. Dougherty

11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. IRB number _ N/A _ .

13. ABSTRACT (maximum 200 words)

There is no process or system capable of detecting obfuscated network traffic on Department of Defense (DOD) networks, and the quantity of obfuscated traffic on DOD networks is unknown. The presence of this traffic on a DOD network creates significant risk from both insider-threat and network- defense perspectives. This study used quantitative correlation and simple network-traffic analysis to identify common characteristics, relationships, and sources of obfuscated traffic. Each characteristic was evaluated individually for its ability to detect obfuscated traffic and in combination in a set of Naive Bayes multi-attribute prediction models. The best performing evaluations used multi-attribute analysis and proved capable of detecting approximately 80 percent of obfuscated traffic in a mixed dataset. By applying the methods and observations of this study, the threat to DOD networks from obfuscation technologies can be greatly reduced.

16. PRICE CODE

NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)

Prescribed by ANSI Std. 239-18

20. LIMITATION OF ABSTRACT

15. NUMBER OF PAGES

109

14. SUBJECT TERMS

Tor, onion routing, obfuscation, network traffic analysis, multi-attribute analysis

18. SECURITY CLASSIFICATION OF THIS PAGE

Unclassified

19. SECURITY CLASSIFICATION OF ABSTRACT

Unclassified

17. SECURITY CLASSIFICATION OF REPORT

Unclassified

12b. DISTRIBUTION CODE

12a. DISTRIBUTION/AVAILABILITY STATEMENT

Approved for public release. Distribution is unlimited.

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)

Naval Postgraduate School

9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESSES)

N/A

5. FUNDING NUMBERS

8. PERFORMING ORGANIZATION REPORT NUMBER

10. SPONSORING/ MONITORING AGENCY REPORT NUMBER

1

THIS PAGE INTENTIONALLY LEFT BLANK

11

Approved for public release. Distribution is unlimited.

IDENTIFICATION OF LOW-LATENCY OBFUSCATED TRAFFIC USING

MUFTI-ATTRIBUTE ANAFYSIS

Kevin R. Dougherty Lieutenant, United States Navy B.G.S., Fort Hayes State University, 2009

Submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE IN CYBER SYSTEMS AND OPERATIONS

from the

NAVAL POSTGRADUATE SCHOOL March 2017

Approved by: Shelley Gallup, Ph.D.

Thesis Advisor

Thomas Anderson, Ph.D. Co-Advisor

Cynthia Irvine, Ph.D.

Chair, Cyber Academic Group

THIS PAGE INTENTIONALLY LEFT BLANK

IV

ABSTRACT

There is no process or system capable of detecting obfuscated network traffic on Department of Defense (DOD) networks, and the quantity of obfuscated traffic on DOD networks is unknown. The presence of this traffic on a DOD network creates significant risk from both insider-threat and network-defense perspectives. This study used quantitative correlation and simple network-traffic analysis to identify common characteristics, relationships, and sources of obfuscated traffic. Each characteristic was evaluated individually for its ability to detect obfuscated traffic and in combination in a set of Naive Bayes multi-attribute prediction models. The best performing evaluations used multi-attribute analysis and proved capable of detecting approximately 80 percent of obfuscated traffic in a mixed dataset. By applying the methods and observations of this study, the threat to DOD networks from obfuscation technologies can be greatly reduced.

v

THIS PAGE INTENTIONALLY LEFT BLANK

vi

TABLE OF CONTENTS

I. INTRODUCTION . 1

A. RESEARCH MOTIVATION . 1

B. PURPOSE AND SCOPE . 2

C. THESIS ORGANIZATION . 2

II. BACKGROUND AND PREVIOUS WORK . 5

A. BACKGROUND . 5

1. Early Onion Routing . 5

2. Generational Evolution . 7

3. Third Generation OR . 10

B. TOR VULNERABILITIES . 12

1. Malicious Guards . 12

2. Directory Server Control . 13

3. Leaking DNS Requests . 14

C. IDENTIFICATION OF TOR TRAFFIC . 16

1. Traffic Analysis . 16

2. HTTP Flow Analysis . 21

3. Entropy Based . 22

4. Semantic Based . 23

D. OTHER OBFUSCATION TECHNOLOGIES . 23

1. Protocol Obfuscation . 24

2. Browser-Based Proxy websites . 25

3. Decoy Routing . 26

E. CONCLUSION . 27

III. METHODOLOGY AND TESTING . 29

A. DATA GENERATION . 29

1. Baseline Data . 30

2. Assumptions . 30

B. VIRTUAL LAB CONFIGURATION . 30

C. OBFUSCATION INDICATORS . 32

1. KCC One: Low TTL Count . 32

2. KCC Two: Common Tor Packet Sizes . 33

3. KCC Three: High TCP Offset . 34

4. KCC Four: Known Tor Exit Node . 34

D. MULTI-ATTRIBUTE DECISION MODEL . 35

E. CONCLUSION . 36

vii

IV. DATA ANALYSIS AND RESULTS . 39

A. DATA ANALYSIS . 39

1. KCC One: Low TTL Count . 39

2. KCC Two: Common Tor Packet Size . 41

3. KCC Three: High TCP Offset . 43

4. KCC Four: Known Tor Exit Node . 45

5. Inter-Attribute Correlation . 45

B. SINGLE ATTRIBUTE ANALYSIS . 47

1. KCC One . 47

2. KCC Two . 48

3. KCC Three . 48

C. MULTI-ATTRIBUTE ANALYSIS . 49

1. Baseline Testing . 49

2. Filtered Training . 50

D. CONCLUSION . 51

V. CONCLUSION . 53

A. RESULTS . 53

B. FUTURE RESEARCH . 54

1. High RTT . 54

2. HTTP Flow Analysis . 55

3. Varied Source IP . 55

4. Vector Relational Data Modeling . 55

5. Real-Time Detection Axioms . 56

C. FINAL THOUGHTS . 56

APPENDIX A. MANUAL FIREFOX PROXY CONFIGURATION . 57

APPENDIX B. SNORT AND LOGPARSER 2.2 CONFIGURATION . 59

A. SNORT . 59

1. MSSQL Logging . 59

2. snort.conf Configuration . 59

3. Windows Batch File Configuration . 60

B. LOGPARSER 2.2 . 60

1. LogParser 2.2 MSSQL Database Creation . 60

2. LogParser 2.2 On-Demand Parsing . 61

APPENDIX C. NAIVE BAYES MULTI-ATTRIBUTE TESTING RESULTS . 63

A. MULTI-ATTRIBUTE TEST ONE . 63

B. MULTI-ATTRIBUTE TEST TWO . 65

viii

C. MULTI-ATTRIBUTE TEST THREE . 66

APPENDIX D. DATA ANALYSIS R-SCRIPT . 69

APPENDIX E. PROPOSED AXIOMS AND ACCUMULATOR MODEL . 77

A. DETECTION AXIOMS . 77

1. IP TTL . 77

2. IP Packet Length . 78

3. TCP Offset . 79

4. Known Tor Exit Node . 80

5. High RTT . 80

6. HTTP Flow Analysis . 81

7. Varied Source IP . 82

B. ACCUMULATOR . 83

LIST OF REFERENCES . 85

INITIAL DISTRIBUTION LIST . 89

IX

THIS PAGE INTENTIONALLY LEFT BLANK

x

LIST OF FIGURES

Figure 1. Complete forward onion. Source: [15] . 7

Figure 2. First generation OR topology depicting a five hop OR circuit.

Source: [15] . 8

Figure 3. Basic Tor configuration. Source: [21] . 10

Figure 4. Diffie-Hellman based routing onion. Source: [23] . 11

Figure 5. Basic DefecTor DNS traffic attack. Source: [20] . 15

Figure 6. Default IP and TCP header fields. Source: [30] . 17

Figure 7. Linear relationship between hop-count (TTL) and RTT. Source:

[28] . 19

Figure 8. Linear relationship between geographical distance and RTT. Source:

[31] . 19

Figure 9. Common Tor packet size distribution. Source: [7] . 20

Figure 10. Fully encapsulated packet depicting TCP header and payload

location. Source: [33] . 21

Figure 1 1 . Internet Freedom badge enabling access to browser-based proxy

services. Source: [40] . 25

Figure 12. Simplified sequence of events from client to Tor relay using a

browser-based flash proxy. Source: [40] . 26

Figure 13. Decoy routing process from client to covert destination via a covert

tunnel from the decoy router. Source: [42] . 27

Figure 14. DarkNet virtual lab configuration . 31

Figure 15. Notional multi-attribute decision model using Naive Bayes analysis . 36

Figure 16. Simplified Venn diagram illustrating the probability of obfuscation

based on two attributes. Source: [44] . 36

Figure 17. Density of Tor and non-Tor TTL values 0-250 . 40

Figure 18. Concentration of Tor and non-Tor TTL values 30-60 . 41

xi

Figure 19. Unique Tor and non-Tor IP packet sizes . 42

Figure 20. Tor and non-Tor TCP offset density values 5-15 . 43

Figure 21. Tor and non-Tor TCP offset values 9-15 . 44

Figure 22. Filtered inter-attribute correlation of KCCs One through Three . 46

Figure 23. Classification probability of Tor traffic based on IP TTL . 47

Figure 24. Classification probability of Tor Traffic based on IP packet size . 48

Figure 25. Classification probability of Tor traffic based on TCP offset

characteristics . 49

Figure 26. Required Firefox manual proxy configuration settings to proxy over

an active Tor circuit . 57

Figure 27. Successful Firefox manual proxy over Tor . 58

Figure 28. Table configuration required to allow MSSQL GUID assignment

after data is Parsed by LogParser 2.2 . 62

Figure 29. Proposed IP TTL three-step detection process . 78

Figure 30. Proposed IP packet length five-step detection process . 79

Figure 3 1 . Proposed TCP offset two-step detection process . 79

Figure 32. Proposed known Tor exit node three-step detection process . 80

Figure 33. Proposed high RTT five-step detection process . 81

Figure 34. Proposed HTTP flow analysis three-step detection process . 82

Figure 35. Proposed varied source IP three-step detection process . 83

Figure 36. Proposed multi-criteria accumulator model . 83

xii

LIST OF TABLES

Table 1. Observed HTTP, HTTP over Tor, and HTTPS over Tor flow packet

size. Source: [4] . 22

Table 2. Required low TTL count data fields and source databases . 33

Table 3. Required common Tor packet size data fields and source databases . 33

Table 4. Required TCP offset data fields and source databases . 34

Table 5. Required Tor blacklisting data fields and source databases . 35

Table 6. Observed IP TTL mean and standard deviation . 41

Table 7. Observed IP packet size mean and standard deviation . 43

Table 8. Observed TCP offset mean and standard deviation . 45

Table 9. Naive Bayes unfiltered training and unfiltered test results . 50

Table 10. Naive Bayes IP TTL filtered training and test results . 5 1

Table 1 1 . Naive Bayes IP TTL filtered training and unfiltered test results . 51

Table 12. Naive Bayes test 1A filtered training and test results . 64

Table 13. Naive Bayes test IB filtered training and unfiltered test results . 64

Table 14. Naive Bayes test 1C unfiltered training and filtered test results . 64

Table 15. Naive Bayes test 2A filtered training and test results . 65

Table 16. Naive Bayes test 2B filtered training and unfiltered test results . 65

Table 17. Naive Bayes test 2C unfiltered training and filtered test results . 66

Table 18. Naive Bayes test 3A filtered training and test results . 66

Table 19. Naive Bayes test 3B filtered training and unfiltered test results . 67

Table 20. Naive Bayes test 3C unfiltered training and filtered test results . 67

xiii

THIS PAGE INTENTIONALLY LEFT BLANK

xiv

LIST OF ACRONYMS AND ABBREVIATIONS

BBNM

Behavior-Based Network Management

DefecTor

DNS-enhanced fingerprinting and egress correlation on Tor

DH

Diffie-Hellmann

DOD

Department of Defense

ERSPAN

encapsulated remote switched port analyzer

FNR

false negative rate

FPR

false positive rate

FQDN

fully qualified domain name

FTP

file transfer protocol

GUI

graphical user interface

GUID

globally unique identifier

HTTP

HyperText Transfer protocol

HTTPS

HyperText Transfer Protocol Secure

IIS

Internet Infonnation System

IP

Internet protocol

KCC

Key Cyber Concept

NIDS

network intrusion detection system

NPS

Naval Postgraduate School

NRF

Naval Research Fab

OR

onion routing

OTV

obfuscated threshold value

P2P

peer-to-peer

PCAP-NG

packet capture-next generation

PKI

public key infrastructure

RDP

remote desktop protocol

RSA

Rivest, Shamir and Adleman (encryption technique)

RSPAN

remote switched port analyzer

RTT

round trip time

SOCKS

socket secure (protocol)

SPAN

switched port analyzer

XV

TCP

transmission control protocol

TLD

top-level domain

TLS

transport layer security

TOR

The Onion Routing

TPR

true positive rate

VRDM

Vector Relational Database Model

WOV

weighted obfuscation value

WWW

world wide web

XVI

ACKNOWLEDGMENTS

I would like to thank my beautiful wife, Ashley, for supporting me throughout my entire career. As a military spouse, entrepreneur, and mother to our beautiful daughter, you truly inspire me every day. Avery, your mother is the strongest and most resilient woman I know, and every day I strive to raise you to be like her.

To my thesis advisors, I appreciate your insight and “bumpers” along the way that kept my research focused, relevant, and most importantly, on-track. The knowledge and wisdom I have gained will follow me through the rest of my career, and I thank you for enriching that experience. I would also like to thank Noel Yucuis from the Dudley Knox Library’s Graduate Writing Center. You helped hone my ability to write clearly and concisely; I truly thank you for all the time you spent molding my thesis and writing into what it is now.

xvu

THIS PAGE INTENTIONALLY LEFT BLANK

I.

INTRODUCTION

Obfuscation of network traffic is a process designed to circumvent censorship and specific user identification on the Internet [1]. Most obfuscation techniques in use today are based on the research by well-known cryptographer David Chaum, who created the first high-latency anonymous routing model [2]. In Chaum’s model, commonly referred to as MixNet, a series of proxies were used to ensure that nodes along the transmission path could not identify the sender or recipient of the infonnation [3], [4]. Since 1996, popular low-latency obfuscation providers, such as Tor, have attempted to hide the source of network traffic by routing traffic through a network of voluntary anonymous nodes before delivering the traffic to its requested destination [5]. Extensive research has been conducted to identify innovative methods for detecting obfuscated traffic, and researchers have had reasonable success using traffic analysis [1], [2], [4], [6]— [ 1 0] and packet entropy [1] as well as heuristic [4], semantic [1], [6], [11], deep packet inspection (DPI) [12], and machine learning-based techniques [1], [4],

Research by Wang et al. [1] combined these approaches and detected obfuscated traffic with a false positive rate (FPR) ranging from 0.002-0.00003 and a true positive rate (TPR) of 0.98-1.0. However, as new methods are discovered to identify obfuscated traffic, changes are made to each obfuscation technique in an attempt to circumvent detection. To counter this, a dynamic model incorporating a combination of identification techniques, blacklisting, and evaluation can be specified in a multi-attribute evaluation model. Using a multi-attribute approach, obfuscation indicators from disparate databases can be aggregated and evaluated to determine the likeliness of obfuscation.

A. RESEARCH MOTIVATION

Presently, there is no system or process deployed on the Department of Defense’s (DOD) networks capable of detecting obfuscated network traffic, and it is not known what quantity of traffic on DOD networks is obfuscated. This poses a significant security risk from both insider threat and network defense perspectives. However, by using advanced traffic analysis [1] and blacklisting [13], it is possible to identify many popular

1

types of obfuscation. Additionally, by aggregating each indication, or signature, into a multi-attribute detection model, it may be possible to identify obfuscated traffic on DOD networks both forensically and in real time [6]. Once indicators are identified, system administrators can establish policies for response to obfuscated traffic; this technique could reduce network activity to only that which is attributable and provides means to hold persons accountable.

B. PURPOSE AND SCOPE

The purpose of this research is to identify obfuscation indicators, or Key Cyber Concepts (KCC), from the current body of research that can be used on a test network to evaluate whether low-latency Transmission Control Protocol/Internet Protocol (TCP/IP), specifically HyperText Transfer Protocol Secure (HTTPS) traffic, is employing obfuscation techniques. The KCCs will be statistically evaluated using R-script, then a multi-attribute analysis model will attempt to detect whether traffic is obfuscated. This thesis attempts to answer the following research questions:

1 . Can low-latency obfuscated network traffic be identified in real time?

2. What IP traffic indications can be used to identify obfuscated low-latency network traffic?

3. Can multiple indications be incorporated into a multi-attribute analysis model to accurately identify obfuscated traffic?

4. Can a multi-attribute analysis model be used in a tool to provide a real¬ time processing capability to analyze obfuscated traffic data for automated response?

C. THESIS ORGANIZATION

The remainder of this thesis is organized as follows:

Chapter II focuses the background and previous work to define network traffic obfuscation with an emphasis on The Onion Routing (Tor). An emphasis will be placed on Tor’s history, purpose, common uses, vulnerabilities, and identifiable characteristics.

Chapter III explains the methodology to be used, data generation, construction of the virtual lab, KCCs, as well as the multi-attribute analysis model.

2

Chapter IV examines the captured data for each KCC, their inter- attribute correlation, and the effectiveness of both single attribute and multi-attribute decision models.

Chapter V reviews the final results of the research and provides suggestions into its real-world application on DOD networks. Recommendations for future research opportunities will be also be presented.

3

THIS PAGE INTENTIONALLY LEFT BLANK

4

II. BACKGROUND AND PREVIOUS WORK

A. BACKGROUND

The World Wide Web has become the modern venue for expression and censorship due to its widespread assimilation into everyday life. As a result, multiple technologies have been developed both to facilitate a user’s access to the Web, and to prevent it. The tools created to facilitate uncensored or anonymous browsing fall into the category of obfuscation technologies. The current body of research into obfuscation technologies has focused both on how to better obfuscate a user’s activity and on how to detect or “de-anonymize” it. Research from [4] and [14] contend that the most widely used obfuscation technology today is Tor. As a result, a significant amount of research has been focused on Tor’s design, operating characteristics, and vulnerabilities.

Early in the Naval Research Lab’s (NRL) development of what is now known as Tor, Goldschlag, Sy verson, and Reed contended it was not their intention to create an anonymous routing system. Instead, as stated by Syverson, the intent was to “separate identification from routing” [5], [15]. In doing so, Tor was created to connect to a series of nodes to obfuscate the association between user and activity [15]. A large factor in Tor’s ability to obscure activity comes from its release as open-source software. Best put by Syverson, “A basic dilemma arises from the difficulty of trying to be anonymous by yourself’ [5]. To that end, the earliest version of Tor was released to the public in 1996 with the intention that its wide-spread adoption would add an additional layer of security to the anonymous routing network by increasing its anonymity set [16]. 1

1. Early Onion Routing

The first generation of onion routing (OR) sought to lower the vulnerability traffic analysis created for network traffic and subsequent user association. The term onion, as defined by Goldschlag et al., refers to the data structure composed of the layers of encryption constructed around the payload between initiator and responder. Specifically,

1 Dingledine, et al. define the anonymity set as the group of users employing a single point for traffic flow for the purpose of obfuscating their browsing activity [16].

5

the initiator’s proxy node created the path, or virtual circuit, to the responder and encapsulated each encryption layer like the layers of an onion. To complete the creation of the virtual circuit, the initiator’s proxy node sent the onion along the route. By design, each node “knew” only the identity of adjacent nodes, and the responder knew only the identity of the last node in the path. As a result, the initiator and responder were protected from simple traffic-analysis correlation [15].

According to [5] and [15], to create the onion, the initiator’s proxy identified the entire virtual circuit and formed the onion in reverse order with the responder’s layer at the core. Specifically, the encryption for the responder’s proxy was constructed around the payload, and then each preceding node’s encryption layer was wrapped back to the first node. The encryption layer included the required encryption keysets and expiration time, enabling bidirectional communication and replay attack protection.2 As a result, each node knew the preceding node and where it should forward the onion to but nothing else. The only exception was the last node, which knew the responder’s identity [15]. Syverson acknowledged the inherent vulnerabilities in sending the symmetric key within the onion in the first two generations of Tor, but the NRL incorporated a symmetric key instead of Diffie-Hellman (DH)-based circuit building to achieve a greater computational efficiency over forward secrecy of Tor [5]. 3

Figure 1 depicts a forward encryption onion from a notional initiator proxy that flows through nodes X and Y to node Z [15].

2 A replay attack occurs when an attacker intercepts a message stream between sender and receiver and then resends, or replays, a duplicate message to either party [17].

3 Forward secrecy provides protection against unauthorized viewing or collection. To ensure forward secrecy, DH key exchange and ephemeral keys are used to protect the confidentiality of a message even if the DH-based encryption is compromised. In the event of a compromise, the message would remain protected since the ephemeral key used would have expired when the original session was terminated [18].

6

V

exp_timex,Y,Ffx,K.fx,Fbx,K.bx,

Y

exp_timcy,Z,Ffy,Kfy,Fby,Kby,

7

cxp_time7,NULL,Ffz,Kpz,Fbz,Kbz,P ADDING

Figure 1. Complete forward onion. Source: [15].

As described by [15], each layer was constructed as follows:

{exp_time, next hop, Ff, Kf, Fb, Kb, payload j PKx ^ere.

exp time was the time the onion expired, where expiration protects against replay attacks;

next hop was the next hop in the onion route;

(f k )

Symmetric key t applied to data flowing forward;

(f k )

Symmetric key ^ b’ h' applied to data (lowing backward; and

The padding concatenated to the end of the payload was constructed from random sized bit strings from each layer and appended before forwarding. This was done to minimize the possibility of a compromised node systematically stripping off the layers of encrypted routing information to find out where it is in the routing chain. Padding was added to all onions to fix the size and hide the length of the chain from initiator to responder.

2. Generational Evolution

In total, there have been three generations of OR designed by NRL with the third being the iteration in use today. Although the functionality has remained largely unchanged through each generation, there were distinct characteristics and configurations used to disassociate users from activity in each one [5].

( 1 ) F irst generation OR

Fixed five-node (instead of three-node) routing circuits were used because they allowed communication between two enclave firewalls with a fixed

7

three-hop route between them to disassociate the origin and destination of the traffic. If only three hops had been used in this configuration, the middle node would have known both origin and destination of the traffic [5]. Figure 2 depicts the general topology of a first-generation onion route.

Link Encrypted Connection Between Routing Nodes Routing Node

Routing, 'Proxy Node

The proxy node W is the source of the communication, nodes X and Y are routing nodes, and nodes U and Z act as routing/proxy nodes.

Figure 2. First generation OR topology depicting a five hop OR circuit.

Source: [15].

Client and router were fully combined and all communication was done in a peer-to-peer (P2P) configuration. Syverson explains that this meant a client would have to use an application to proxy its traffic to an onion router if it did not actively participate in the Tor network as a node. If a client chose this option, its traffic would be sent to the first routing node before the actual circuit was built [5].

Dynamic network updates and topology changes were not supported. It was thought that Tor communication had been between stable networks

8

that did not have inherent trust relationships and that all network configuration changes were handled offline [5].

Because the Socket Secure (SOCKS) protocol was not widely adopted in the mid-1990s,4 multiple application-specific protocol proxies were required to route the various types of information on the Tor network [5]. Further, each application-specific request required its own Transmission Control Protocol (TCP) stream which increased the volume of traffic and posed a possible security threat based on how many TCP streams were present [16].

(2) Second generation OR

The client and router were separated, which allowed a client to leam about available nodes and build its own circuits without having to route traffic for others or blindly trust a remote node to build a P2P circuit as was required in first generation OR [5].

Circuit lengths were no longer fixed; variable-length circuits up to 11 hops were supported within a single onion [5]. 5

The SOCKS protocol was adopted for applications able to support it[5],

A redirector feature was included. It forced all TCP traffic over the Tor network. By design, the redirector only worked on Windows NT and was not intended for public release [5], 6

OR entry and exit policies were enabled to allow organizations the ability to tailor the traffic for users who had access to their onion routers [5]. For example, an organization could set an exit policy to only allow forward web traffic on ports 80 or 443 or choose to whitelist trusted IPs and ports to minimize possible abuse. When implemented properly, entry and exit policies could prevent an onion router from appearing to be the source of malicious activity [16], [20]. 7

4 The SOCKS protocol provides a standardized method for applications to communicate without the need for multiple application-specific proxies by uniformly encapsulating packets for both TCP and UDP (in version 5 only) applications. Conceptually, the SOCKS protocol operates between the application and transport layers [19].

5 In the second generation of OR, a single onion could be constructed with encryption layers for up to 11 hops with an even greater length possible if tunneling techniques were used [5],

According to Syverson, with the exception of the TCP redirector, “all other onion routing code of all generations” was intended for public release [5].

7 Goldschlag et al. state, “Most onion routers in the current network function as restricted exits that permit connections to the world at large, but prevent access to certain abuse-prone addresses and services such as SMTP” [16],

9

Two security features, cell mixing and traffic shaping, were implemented to lower the likelihood that someone would be able to associate user and activity through simple traffic analysis. Experimentation into cell mixing focused on blending TCP streams from separate onion routes to mitigate the threat of a man-in-the-middle (MITM) attack. Traffic shaping used an average of previous packet lengths to pad new traffic in an attempt to protect it from identification by simple traffic analysis. Neither cell mixing nor traffic shaping were included in follow-on generations [5].

3. Third Generation OR

The third generation of OR is commonly known as Tor. Tor provides the same basic protections as its predecessors; however, it affords much more protection. The increased protection can be attributed to how the virtual circuit was created and subsequently encrypted [5]. Tor’s basic design uses three nodes to route obfuscated traffic between initiator and responder (see Figure 3).

Figure 3. Basic Tor configuration. Source: [21].

Departing from the computationally expensive onion built using Public Key Infrastructure (PKI) encryption,8 Tor uses DH key exchange-based circuit building and

8 According to Syverson, PKI was used to encrypt the communication circuit in order to share symmetric keys from initiator to responder with an unpredictable route. The idea of using DH key exchange-based circuit building with RSA encryption was considered but abandoned during the first two generations of OR [5].

10

RSA encryption to construct the onion from hop to hop (see Figure 4) [5]. 9 Further, since public ephemeral keys were used during the encryption process, the initiator could derive the session key by combining the ephemeral key with the private key of the responder (and vice-versa) to provide replay protection and forward secrecy [5], [18].

Router A

Router B

Router C

Source Destination

-

Routers A, B, and C represent the entry, middle, and exit nodes, respectively.

Figure 4. Diffie-Hellman based routing onion. Source: [23].

According to Dingledine et ah, the SOCKS protocol is used in Tor to provide compatibility for most TCP-based applications with no additional modifications. Consequently, Tor relies on outside application-level proxies to provide application level services not compatible with SOCKS [16]. Additionally, Tor relies on a small set of trusted directory servers to distribute network status and information to routing nodes and clients.10 Each server is owned and operated by independent parties and functions as a HTTP server. The HTTP server functionality allows available onion routers to upload their statuses and clients to download network state and listings of available routers. Each

9 RSA is an asymmetric public key algorithm that facilitates encryption and digital signatures by using public and private keys, which are produced by generating large numbers from their prime factors. It is named after its creators Ron Rivest, Adi Shamir, and Leonard Adleman [22].

10 In the current distribution of Tor, nine directory servers are hard coded into the browser package.

The full listing is available from https://atlas.torproject.Org/#search/flag:authority. Accessed 18 October 2016.

11

trusted directory server controls which nodes are allowed to join the routing network in an effort to reduce the risk of malicious routers entering the network [16]. Syverson acknowledges there are distinct gains and losses with the use of directory servers. Tor clients are allowed to obtain a listing of known onion routers that may be used to construct a circuit. However, Syverson contends that a limited number of trusted directory servers creates a network bottleneck [5].

B. TOR VULNERABILITIES

Due to Tor’s widespread adoption and routing configuration, numerous vulnerabilities, or attack vectors, are inherently present in its design. Dingledine et al. gave significant consideration into how to mitigate the threat caused by existence of malicious actors who can monitor sizeable portions of the Internet. Though Tor was not specifically designed to withstand an Internet-scale attack, it was designed to prevent attacks using traditional traffic analysis [16]. Elahi et al. contend significant vulnerabilities exist because the network is comprised of volunteer nodes;11 this is especially true if a malicious actor is able to control both the entry and exit node in a Tor circuit [14].

1. Malicious Guards

Elahi et al. revealed that each node can be bandwidth-weighted to increase the likelihood of selection. As a result, it is possible for a malicious actor to volunteer multiple, high bandwidth, entry and exit nodes in order to conduct end-to-end correlation of user activity. To break the malicious node “kill-chain,” various fixes can be employed to lower the persistence of malicious entry nodes [14], [24]. The Tor Project (torproject.org) proposes multiple solutions including increasing the size of the Tor network and using fewer entry guards [24].

Although not fully resolved, the Tor Project admits there are flaws in the way the guards are selected [24]. It is difficult to establish a set of available guards without

11 The Tor network is comprised volunteer routing nodes; each additional routing node increases the bandwidth and availability of the Tor network. Additional Tor FAQ are available at https://www.torproject.org/docs/faq.html.en, accessed 24 October 2016.

12

inclusion of malicious nodes. However, by scaling the network faster than the adversary can, the adversary’s foothold will decrease, making attribution more difficult. Of note, the Tor Project contends that merely increasing the number of relays may not always decrease the risk of attribution; however, increasing the amount of entry nodes also increases the amount of nodes an adversary can monitor [24]. Further, torproject.org contends that the set of entry guards selected by a client can serve as a fingerprint, as another client is not likely to have the same set. This fingerprint could be used by an adversary to identify a specific client’s obfuscated traffic. However, reducing the quantity of the guard set to two, increases the chance of collision between the two fingerprints, making attribution will be more challenging [24].

2. Directory Server Control

Tor uses a limited set of nine redundant trusted nodes that serve as directory servers.12 Dingledine et al. emphasize that it is the responsibility of the directory servers to reach a network consensus and distribute it to all Tor clients.13 Dingledine et al. define the four-step process used by the directory servers to reach a consensus. First, each directory server broadcasts its signed opinion of the network state to the pool of directory servers. Then, each directory server subsequently broadcasts all signed network states it has received. The second step ensures no directory server has signed more than one network state. In step three, each directory server combines the received network states with its own and broadcasts its signed version of the full network state. In step four, the directory servers rebroadcast the complete network consensus with signatures from all participating directory servers. To facilitate this level of infonnation flow, each directory server also acts an HTTP server. This allows clients to request a listing of all available

12 The full listing of directory servers is available from https://atlas.to1project.0rg/#search/ flag:authority, accessed 24 October 2016.

13 Biryokuv et al. define the network consensus as, “The list of all Tor relays is distributed by the Tor authorities in the [network] consensus document. The consensus is updated once an hour by the directory authorities and remains valid for three hours. Every consensus document has a ‘valid-after’ (VA) time, a ‘fresh-untiT (FU) time and a ‘valid-untiT (VU) time. The ‘valid-after’ timestamp denotes the time at which the Tor authorities published the consensus document. The consensus is considered fresh for one hour (until ‘fresh-until’ has passed) and valid for two hours more (until ‘valid-untiT has passed). ”[10]

13

routers and their states, as well as other onion routers to upload their statuses. However, the limited number of servers also presents multiple vectors for attack [16].

Dingledine et al. argue that if an adversary were to control a directory server, they could use the server to distribute a listing of malicious nodes under adversary control. The impact is much greater if an adversary can subvert control of a majority of directory servers because the adversary would have the ability to sign the network consensus with as many malicious nodes as it pleases. Additionally, if an adversary was able to either physically destroy a directory server or use other computer-based attacks they could render it useless, but simply taking down one directory server would not have a severe impact on routing [16], However, if over half of the available directory servers can be taken offline, there will not be enough servers to sign the network consensus for distribution to clients. To mitigate these attacks, Dingledine et al. maintain it is necessary that each directory server is independently operated and hardened against computer-based attacks [16].

3. Leaking DNS Requests

According to research conducted by Greschbach et al., almost 40 percent of DNS requests by Tor users are observed by Google’s DNS resolver; this poses a significant risk of attribution to Tor clients [20]. There are two ways DNS queries make Tor users vulnerable to attribution. First, as described by Dingledine et al., the SOCKS protocol, which is used heavily by Tor, can pose a risk to users. This is because some applications send the fully qualified domain name (FQDN) to the Tor client while others resolve the