The base rate fallacy and the traffic analysis of Tor
30 September 2008
An anonymous contributor, The23rd Raccoon, sent a few days a go a very insightful post to the or-dev lists entitled “How I Learned to Stop Ph34ring NSA and Love the Base Rate Fallacy“. The key point is that tracing anonymous communications is an identification exercise: the adversary has to detect the single correct target amongst the noise of incorrect identities. Therefore reporting simply false positives and false negative rates is misleading, since even moderate false positives will lead to the vast majority of positives being misclassifications.
This is a very cool observation and leads amongst others to the conclusions that low-latency anonymity is not dead:
“[…] Second, it gives us a small glimmer of hope that maybe all is not lost against IX, National, or ISP level adversaries. Especially those with only sampled or connection-level resolution, where fine-grained details on inter-packet timings is not available (as will likely be the case with EU data retention).”
This post is of great interest because it re-opens the problem of high-precision traffic analysis, with a clearly understood and precisely known error rate. Current techniques, mostly based on heuristics, are not capable to deliver such detectors, with high reliability guarantees attached to them.
At the same time the The23rd Raccoon’s analysis overlooks one issue: many detectors do not simply perform matching of streams on a pair by pair basis, but as a whole. This means that the “best match” is selected according to some ranking metric. The mathematical analysis that has to then be presented to support the technique is the probability any non-match is selected over the correct match. Many papers in the literature provide implicitly such results (some based on the rank of the result, others based on selecting the best match and providing the total probability of error based on the selection.) Approaches that provide overall performance metrics should avoid the pitfalls of presenting “intermediate” false positive / false negative probabilities, and give an intuitive understanding of how well traffic analysis techniques work on a full body of streams or message traces.