In real time from CCSW09: more website fingerprinting

13 November 2009

I am currently listening to the presentation of:

  • Dominik Herrmann, Rolf Wendolsky, Hannes Federrath, “Website Fingerprinting: Attacking Popular Privacy Enhancing Technologies with the Multinomial Naive-Bayes Classifier” ACM CCSW 2009. Chicago, US.

This work looks at the classic problem of identifying browsed pages through SSL encryption or low-latency anonymization networks like Tor. The authors cast the problem of identifying web pages as a supervised learning and classification problem. Instead of applying a jacard coefficient or naive bayes classifier, the authors use a multinomial naive Bayes classifier to learn and identify the pages from their sizes. Furthermore they use filtering on the raw page sizes inspired from information retrieval — such as using the logarithms of the sizes to avoid large bias, or cosine normalisation to make the overall size of the website irrelevant.

That is yet another fine example of casting a traffic analysis problem in terms of machine learning and Bayesian inference.

A very interesting comparison is presented on how different single-hop and multi-hop anonymity systems compare. Single Hop systems are broken with probability over 95%. Multi-hop systems are pretty robust with JonDoNym traffic only being identified 20% of the time, and Tor about 3% of the time. Why is Tor doing so well? It turns out that the cell quantization does add noise, and mess up the attack. It is not clear if this is a fundamental finding, or whether a better / more specific attack could do better.

Datasets should be available soon. Sadly I cannot find the paper on-line. Put your work on the web people!


2 Responses to “In real time from CCSW09: more website fingerprinting”

  1. Clive Robinson said

    @ gdanezis,

    “It turns out that the cell quantization does add noise”

    It also limits side channel bandwidth as well.

    Which is why you have the EmSec maxim of,

    “Clock the inputs and clock the outputs”

    Pipelining and clocking with a common clock with appropriate fault handeling can reduce the effective time based side channel down to a bit per message started.

    Thus if you fold over the restart for any given message with an exponential or geometric time increase per fail starting at a second or two. The effective channel capacity of “time based side channels” can be kept down to say 6 bits / min without causing undue traffic problems to human users.

    Obviously other side channels such as message length still remain usable but there are usually very easy ways (random padding or unknown code expansion/compression etc) to deal with those. However the normal method is to maintain full channel capacity from point to point with link encryption to hide traffic.

  2. Hi George,

    thanks for mentioning this on your blog. I have uplaoded the paper to the website. Feel free to get in touch with me in case you have any comments or questions.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: