Description
Phase overview
-
Implement a classifier for encrypted traffic
-
-
Capture traffic samples generated by various actions performed in the Android VM
-
-
-
Split dataset in training/evaluation
-
-
-
Use training traces to train model (Random forest, SVM, or another method of your choice) to distinguish between actions
-
-
-
Verify accuracy using evaluation traces
-
-
Use tcpdump (or any tool of your choice) to capture traffic generated by performing a set of actions on the Android VM
-
Convert network flows in traces into feature vectors suitable for a classifier
-
Start Android browser (homepage must be set to en.wikipedia.org)
-
Start Youtube app
-
Start Weather Channel app
-
Start Google News app
-
Start Fruit Ninja app
-
(all apps above can be installed for free using the Google Play app, already pre-installed)
-
Start tcpdump, execute action, terminate tcpdump
-
-
Either by hand, or by using Android test automation commands (start, monkey) via adb shell
-
-
-
You should aim at having ~50 traces per action (although less may work too)
-
-
-
Label each trace w/ the action it captured
-
-
Data cleanup: the Android VM is relatively quiet in terms of network chatter, but you may end up capturing flows unrelated to the action you are performing
-
Suggestion for data cleanup:
-
-
Discard obvious noise (e.g., ARP)
-
Look at DNS requests to figure out the IPs of flows generate by the app
-
-
You may also decide not to cleanup your data and hope the classifier can figure it out
-
Looking at captures using Wireshark may help
-
Once you cleaned up the traces, divide them in bursts as explained last week
-
All traffic in each burst must be partitioned into flows
-
A flow is a set of packets sent between the same pair of addresses/ports and carrying the same protocol (TCP/UDP) (note, traffic flows in both directions)
-
Once you have a set of flows, you must convert them in feature vectors
-
Vectorization: the process of representing an object with a vector of scalar features, suitable for classification algorithms
-
Features you can’t use:
-
-
IP addresses
-
-
-
MAC addresses
-
-
-
Packet payloads
-
-
Everything else is fair game
-
Examples of vectorization:
-
-
Convert each flow in a vector including the lengths of the first 10 packets
-
-
-
Convert each flow in a vector containing statistical features of the sequence of packet lengths
-
-
-
…
-
-
Train a classifier to distinguish between vectors generated by different apps
-
Zhuoqun will give a brief demo later for those of you not familiar with scikit and machine learning in general
-
-
A python script named classifyFlows that, given a pcap trace, must print out a list of bursts, flows in every bursts, and label of the action that generated a certain flow (if any)
-
-
-
Output format:
-
-
classifyFlows mytrace PCAP
Burst 1:
<timestamp> <src addr> <dst addr> <src port> <dst port> <proto>\ <#packets sent> <#packets rcvd> <#bytes send> <#bytes rcvd> <label>
<label> must be either the name of an app, or unknown if the classifier is unable to determine which app was detected
-
Internally, your code must:
-
-
Partition the traffic in bursts
-
-
-
Partition each burst into flows
-
-
-
Generate feature vectors from flows
-
-
-
Attempt to classify each vector using the model you trained
-
-
We will evaluate the accuracy of your code in classifying traces generated using the Android VM
-
By the phase 3 deadline (4/15, 10:45am) you must upload to Canvas a .zip file containing:
-
-
Your Python script
-
-
-
A README file specifying any Python package on which your code depends, and any information we need to be aware of when testing your code
-
-
-
If your code has known limitations or issues, also briefly document them in the file.
-