In real time from CCSW09: more website fingerprinting
13 November 2009
I am currently listening to the presentation of:
- Dominik Herrmann, Rolf Wendolsky, Hannes Federrath, “Website Fingerprinting: Attacking Popular Privacy Enhancing Technologies with the Multinomial Naive-Bayes Classifier” ACM CCSW 2009. Chicago, US.
This work looks at the classic problem of identifying browsed pages through SSL encryption or low-latency anonymization networks like Tor. The authors cast the problem of identifying web pages as a supervised learning and classification problem. Instead of applying a jacard coefficient or naive bayes classifier, the authors use a multinomial naive Bayes classifier to learn and identify the pages from their sizes. Furthermore they use filtering on the raw page sizes inspired from information retrieval — such as using the logarithms of the sizes to avoid large bias, or cosine normalisation to make the overall size of the website irrelevant.
That is yet another fine example of casting a traffic analysis problem in terms of machine learning and Bayesian inference.
A very interesting comparison is presented on how different single-hop and multi-hop anonymity systems compare. Single Hop systems are broken with probability over 95%. Multi-hop systems are pretty robust with JonDoNym traffic only being identified 20% of the time, and Tor about 3% of the time. Why is Tor doing so well? It turns out that the cell quantization does add noise, and mess up the attack. It is not clear if this is a fundamental finding, or whether a better / more specific attack could do better.
Datasets should be available soon. Sadly I cannot find the paper on-line. Put your work on the web people!