Traffic profiling: distinguish between streaming and downloading and other services?

Question

I'm a Libpcap and Wireshark novice: for my school project I have to distinguish between different types of traffic (SMTP, web traffic, VoIP, online gaming, downloading, streaming, ...). While at first I relied on port numbers (25 for SMTP, 80/443 for HTTP/HTTPS, ...), some problems came up: always more sites supports HTTPS (so, no more payload investigation) and the simple port number can't tell me important differences (port 443 may bring different types of services).

So I thought to classify traffic according to some known behaviours, for example download and streaming have different bandwidth (bitrate): the first has constant high bandwidth, the second has spikes of high bandwidth that go back to zero when you have the "piece" you need.

Because of my unfamiliarity with the topic, this is the only known behaviour I got from the Web. Anyone can point me in the right direction?

You will have to decide first, which network layer's information you would like to use to differentiate between the traffic. — Haris, Oct 22 '15 at 09:45
That seems to be a *huge* effort you're willing to make; behavioral traffic categorization is definitely a not-yet-sufficiently solved problem. — Marcus Müller, Oct 22 '15 at 09:49
@Haris, from different layers I get different informations, so shouldn't I use all the layers infos? — elmazzun, Oct 22 '15 at 09:51
@elmazzun: yes, but that makes your system even more complex! — Marcus Müller, Oct 22 '15 at 09:51
My professor suggested me to run the hotspot Raspberry Pi (which is running a `libpcap` program) for a while, accepting and logging all the connections and according to the infos I get from logs I should classify the traffic; but even getting everything from Ethernet, IP and TCP/UDP headers I don't know how to classify the traffic. — elmazzun, Oct 22 '15 at 09:55
Keeping it to only one layer would make it simpler. And it would be to show also, For instance, IP traffic can be either UDP or TCP, TCP can then be further differentiated more based on application layer. — Haris, Oct 22 '15 at 09:57

score 0 · Accepted Answer · answered Oct 22 '15 at 09:58

use wireshark to partition your traffic into sessions.
for those where categorization is clear based on protocol/port, categorize (e.g. port 25 = SMTP should be a given).
for those that need further analysis, find appropriate features, such as:
- average packet size,
- packet size std deviation/variance,
- packets per second in upstream/downstream direction,
- overall amount of data,
- up/downstream data amount ratio,
- up/downstream packet number ratio
- so much more you could think of
with the numerical values for the features from 3., build vectors, and apply all your classification knowledge: Maybe this is a case for support vector machines? Maybe you just look at the clusters you might see and come to conclusions? Maybe you just generate "known" traffic of all relevant kinds and map that into that vector space, categorizing each unknown session as the euclidean distance-"closest" known traffic type? Maybe you pre-condition your vectors by what you learn from a principal component analysis?

As you see in 4., there's a lot of tools for classification, and you will need some proficiency in classification theory to deal with your problem.

Could you point me in the right direction? According to your experience, what classification tool would suits best for my project? — elmazzun, Nov 10 '15 at 12:01
@elmazzun: You're missing my point. **Nobody** can tell you what the right tools are for your job until you have the knowledge to *understand* and describe your job. The No. 1 tool of every classification author is his own knowledge on classification. So the tool to give you that is probably a longish textbook, I'm afraid :) — Marcus Müller, Nov 10 '15 at 16:44

Traffic profiling: distinguish between streaming and downloading and other services?

1 Answers1