I am at PETS 2016 writing this summary in real-time as Tao Wang presents his work with Ian Goldberg on “On Realistically Attacking Tor with Website Fingerprinting“. This is the latest paper addressing challenges identified by the Tor project to make website fingerprinting (WF) research more informative about real-world threats. We have a stake in this since Jamie Hayes from UCL will be presenting a joint paper on k-fingerprinting: a Robust Scalable Website Fingerprinting Technique.
The key problem is that a number of reported website fingerprinting results are evaluated in laboratory conditions. Website fingerprinting considers an adversary looking at an encrypted and anonymized stream, and the task is to recognize which website is being accessed through the protections. However, the evaluation of the attacks consider a restricted and stylized setting, where a single client accesses a single (usually landing) page from a server. This is restrictive: it does not consider accessing multiple pages, it assumed the start and end of pages is known, and that the training data is up to date.
So this paper consider how to deal with such setting. First, how to deal with noisy data? For example, what is another tab creates some continuous traffic, aka noise that may confuse the attacker? In this case the attacker will see a combined stream of the signal from a page load superimposed on the noise. Tao illustrates that as the volume of noise increases the accuracy of attacks radically drops. Trivial solutions such as trying to subtract noise are not likely to succeed.So the actual solution is to train on the noisy version of the stream, and that indeed increases the accuracy of the attack back to dangerous levels.
The second problem is splitting: how can an attacker detect the start and end of loading a specific bundle of resources corresponding to a web page. There can either be a lot of quiet time between loads, no time, or even a negative time when the loads overlap. It turns out when there is a large gap it can be overcome with little effect on the True positive rate — as the split is obvious. However, the problem has a generic solution for other cases: you simply use a classifier to learn where to split, and then apply the classifier. I do wonder if using both incoming and outgoing channels could improve this.
Finally, they address the issue of maintaining a training set, since it is expensive to collect such data continuously. So either you have a small fresh training set, or a larger but out of date training set. There is a trade-off, but Tao demonstrates that even old-ish (10 days) still provides good accuracy. In conclusion, it seems that those roadblocks can be tackled to make WF practical, and there is an increased call to implement defenses.
My meta-conclusion is twofold: first, it is clear that the roadblocks presented first by the tor project blog post are indeed issues, but are not a game stopper for website fingerprinting. It took the community about 2 years to mature into using more modern machine learning (ML) techniques, and this work illustrates this. Yet, my second conclusion is that the maturity of the ML usage for fingerprinting is still low: it is pretty self-evident that to be robust to noise one should train on noisy data — I am glad this is now explicit. It is also self-evident that we can take small data sets of pages and network conditions and synthesize a very large set of training examples to make learning even more robust to noise and better at detecting splits. No one has done this yet — maybe that would be a nice paper for PETs next year.
Boing Boing just released a classified GCHQ document that was meant to act as the Sept 2011 guide to open research problems in Data Mining. The intended audience, Heilbronn Institute for Mathematical Research (HIMR), is part of the University of Bristol and composed of mathematicians working for half their time on classified problems with GCHQ.
First off, a quick perusal of the actual publication record of the HIMR makes a sad reading for GCHQ: it seems that very little research on data mining was actually performed post-2011-2014 despite this pitch. I guess this is what you get trying to make pure mathematicians solve core computer science problems.
However, the document presents one of the clearest explanations of GCHQ’s operations and their scale at the time; as well as a very interesting list of open problems, along with salient examples.
Overall, reading this document very much resembles reading the needs of any other organization with big-data, struggling to process it to get any value. The constrains under which they operate (see below), and in particular the limitations to O(n log n) storage per vertex and O(1) per edge event, is a serious threat — but of course this is only for un-selected traffic. So the 5000 or so Tor nodes probably would have a little more space and processing allocated to them, and so would known botnets — I presume.
Secondly, there is clear evidence that timing information is both recognized as being key to correlating events and streams; and it is being recorded and stored at an increasing granularity. There is no smoking gun as of 2011 to say they casually de-anonymize Tor circuits, but the writing is on the wall for the onion routing system. GCHQ at 2011 had all ingredients needed to trace Tor circuits. It would take extra-ordinary incompetence to not have refined their traffic analysis techniques in the past 5 years. The Tor project should do well to not underestimate GCHQ’s capabilities to this point.
Thirdly, one should wonder why we have been waiting for 3 years until such clear documents are finally being published from the Snowden revelations. If those had been the first published, instead of the obscure, misleading and very non-informative slides, it would have saved a lot of time — and may even have engaged the public a bit more than bad powerpoint.
4 November 2015
At last the UK government today published the draft Investigatory Powers Bill, after about a week of carefully crafted briefings aimed at managing opinion, and even dissent. The document comes bundled with a lot of supplementary material, purporting to be from “A Guide” to “Explanatory Notes”. As Richard Clayton advised me a while back: don’t read them! Those are simply smoke-and-mirrors, designed to mislead, provide material for lazy journalists and confuse the reader — the only thing that has legal validity is the law itself on pages 35-227.
The good news is that I read through those 181 pages, and extracted the “juicy bits” from a technology public policy point of view. I am no lawyer, but am not as much interested in the fine print of the law. I am interested in the capabilities that the government wants to grant itself when it comes to, basically, attacking computers and telecommunication systems — with a view to understanding the business of policing and intelligence. So here are my notes…
The paper “I Know Why You Went to the Clinic: Risks and Realization of HTTPS Traffic Analysis” is revisiting the vulnerability of HTTPS encryption to traffic analysis.
The authors note that most secure web connections access directly a specific website that is requested by a user (as opposed to the setting where a user is observed using Tor, where the site is not known). There are challenges in doing this: different websites are usually quite distinct, while many sub-pages in the same website may share a common layout that makes them hard to distinguish through encryption. Additional complications are added by cache, cookies, and the use of Flash. However, this setting may be easier than the Tor case, in that the adversary could realistically enumerate all pages on a website.
The paper chose 10 sites, including health, finance, video rental, and civil liberties. Traces are collected from encrypted requests and replies, and some simple pre-processing is done. In particular, the traffic volumes are collated in bursts and only their sizes are stored. This is done to add robustness to jitter in the traffic. Traces are then clustered using k-means and Gaussians are used to model each of the clusters. The use of the Gaussian distribution provides some further resilience to noise in the size of resources.
A machine learning approach is then followed to select which features distinguish pages: a logistic regression model is used to learn weights that best separate pages. A Hidden Markov Model is then used to track full traces of used navigation through an encrypted web-site.
To evaluate the approach traces were collected using virtual machines. This allows for the full state of a VM to be frozen, to reset the machine state perfectly between trace collection. Traces were collected in sessions where user state was meant to be preserved, and otherwise a fresh VM could be spun to clear the state. This way different conditions of cache on/off, cookies and browsing different sites were used for the collection. Interestingly the effect of cache being on systematically reduced the accuracy of the attack by 10%-20%.
Finally, the paper presents a defense against traffic analysis. It is based on padding the bursts of traffic to some threshold in order to confuse the feature extraction stage. Since many thresholds are possible, and the maximum is chosen, subject to a size “inflation” budget. Other defenses were also tried and compared, and there is a substantial reduction in the accuracy of this attack.
The paper “A Predictive Differentially-Private Mechanism for Mobility Traces” looks at sanitizing mobility traces within the paradigm of differential privacy.
The idea is that a user wishes to use a location-based service. However, a user also wishes to maintain some privacy, so they will somewhat distort the reported location to hide their exact location. Of course, there is a clear trade-off between the utility, namely the accuracy of the reported location, and the privacy that can be maintained. In the case of this work the utility is related with the accuracy of the reported location compared with the real location.
At the heart of the system lies a definition of privacy based on “geo-indistinguishability”. The insight is that the locations “close-by” need to be indistinguishable, while locations very far apart may be distinguished. This offers a higher degree of information leakage than traditional differential privacy, but stands a chance to offer some utility for a single user trace.
A previously proposed mechanism offers such privacy, through first perturbing the center point of the user location (using a 2D generalization of the Laplacian distribution) and the requesting information about a larger mechanism. Its privacy degrades in a predictable manner when multiple observations are seen by the adversary. However, the authors note that real-world traces are very correlated with each other. For example a user stays in a cafe for a while, or they follow a certain path on a motorway. This insight may be used to reduce the noise introduced by the mechanism.
First a straw man mechanism is proposed: a prediction function is defined that tries to predict the next position to report from the previous published position. If the prediction yields a “good enough” location, subject to some threshold, the old prediction is used. If not a new prediction is used, at the cost to the privacy budget used. However, this simple scheme breaks the definition of geo-indistinguishability.
The strawman mechanism can be strengthened by doing a private test for the accuracy of the prediction, which in itself consumes some amount of the privacy budget. This results in information not leaking from the private prediction test on the data, and yields a geo-indistringuishable mechanism.
To make the mechanism more easy to integrate into location-based services a privacy budget manager is also devised. The manager is provided with a certain target privacy level and utility, and will configure the parameters of the mechanism to offer good utility subject to the constraints.
The evaluation was gone on the GeoLife traces from MSR, that were processed to simulate queries to an LBS, through sub-sampling.
Interestingly, a plug-in for Firefox and Chrome is available that implements the approach.
The paper “Quantifying the Effect of Co-location Information on Location Privacy” presented by Alexandra-Mihaela Olteanu discusses the important problem of co-location on location privacy in mobile networks.
The work notes that co-location information is widespread: for example users tag a number of people on pictures in social networks; face recognition is getting better; and devices record the devices around them. Interestingly, the information an adversary may infer from this co-location information is not hidden by traditional location privacy mechanisms. In fact this is an instance of privacy technologies not doing a very good job at hiding information leaking from interactions between users. When correlations between user activity is observable by an adversary, the adversary may combine the information from both to increase the accuracy of their inferences. In fact the actions of one user may have serious consequences on the privacy of another (a co-target).
This work attempts to quantify the degree of privacy lost from such combined inferences. It uses the location privacy model by Shokri et al (S&P). Users locations and traces are modeled using a Markov model and the adversary observes co-locations. Tracking a single user is the traditional inferences of the hidden Markov model. With co-locations the attack is more complex since the constraints implied by co-locations need to be honored. This increases the information of the adversary, but also increases the computational complexity of the attack. As a result the authors also propose a heuristic attack, that ignores some of the information / constraints but tries to keep a certain set of observations consistent.
The evaluation was performed on the GeoLife (MSR Asia) dataset. Interestingly co-location information can be inferred from such raw traces, to simulate a number of scenarios. For example one may vary the fraction of co-location events observed by the adversary or the location privacy mechanism used. The result show, unsurprisingly, that the adversary observing co-locations has a significant impact on privacy. They also show that observing a target with low privacy settings may lead to the compromise of a co-target with high privacy settings.
This is a very nice work. As someone in the audience pointed out, some of the computational complexities may be tackled through using sampling based estimations of the posterior distribution that is otherwise intractable. This is probably a worthy space for follow-up research. Relaxing the Markov mobility assumption (which makes the posterior computations even more complex) is another avenue for future work.
(The first session of PETS2014 deal with privacy for mobile devices. )
The paper “Exploiting Delay Patterns for User IPs Identification in Cellular Networks” observes that a significant number of mobile operators (over 50% of the market in North America and over 30% in Europe) assign public and routable IP addresses to mobile devices. That opens the devices to denial of service attacks, resource depletion, etc. But how can an adversary find the IP address of a specific user? If the adversary can observe a large fraction of the network this may be easy. The paper shows that even a small entity — like a single user — may track a specific user’s IP.
The tracking attack uses specificities of how mobile devices use the cell network. It turns out that the network stack may be in different states: “Idle”, “cell dispatch” or “cell fetch”, to conserve battery. Most of the time a device is in “Idle”. The authors notice that the push notification of IM services allows for indirectly injecting delay patterns on a target device. To receive the notification the device goes to a high everge reception / transmission state and for a time window (of 10-15 sec) the latency to reach the device drops. An adversary may use this insight to identify the IP of a device associated with an IM identifier.
So the attack methodology goes as follows: the attacker send an IM message, lowering the latency of the target device. Then they scan a range of devices by IP to detect if the latency is compatible with the hypothesis it has received a push notification. If not, the IP may be excluded. The attack is repeated a number of times to gain increasing certainty about the exact address of the target. Graphs suggest that about 10-20 iterations the space of possible device IPs becomes significantly smaller. However, for best result the repeated probes should be spread quite a few minutes apart. However, there is still some confusion between the target device and devices that always transmit. Eliminating those further improves the accuracy of the attack.
Simple countermeasures involve either firewalling devices to make it difficult to reach them, or use IPv6 making the space of possible addresses to scan unfeasibly large.
4 October 2010
I am just sitting in the first WPES10 talk:
Balancing the Shadows by Max Schuchard, Alex Dean, Victor Heorhiadi, Yongdae Kim, and Nicholas Hopper (University of Minnesota)
ShadowWalker is a peer-to-peer anonymity system designed by Prateek Mittal (who was our intern in 2008) and Nikita Borisov to prevent corrupt peers jeopardising the network. The authors of this new paper “Balancing the shadows” present an attack on the system, where a malicious coalition of nodes can compromise routing security and can bias the probability of choosing a malicious node as a relay. It turns out that a naïve fix opens the system instead to selective denial-of-service attack.
How does the eclipse attack on ShadowWalker work? The adversary controls a full neighbourhood of the network, i.e. a sequence of peers in the distributed hash table (DHT). This allows an adversary to corrupt the “shadow” mechanism in shadow walker. When Alice asks a malicious node in this neighbourhood about another node in the network, they can provide a false ID, along with a set of false shadows. This attack is not too bad on its own, except that the same mechanism is used during the construction of the routing tables of the DHT. As a result an adversary that controls about 10% of the nodes can corrupt about 90% of the circuits, after a few rounds of the protocol (this was backed by simulations).
How to fix the attack? Can we increase the number of shadows of each node that can testify of the correctness of its ID? It turns out this is not a good idea: the more shadows the higher the probability one of them is malicious. In that case they can maliciously refuse to attest honest nodes, effectively taking them out of the protocol. The authors propose to change the protocol to only require a fraction of shadows providing signatures to attest an ID-node relationship — time will show if this withstands attacks.
What do we learn from this: first the level of security in peer-to-peer anonymity systems is still questionable, as designs keep being proposed and broken on a yearly basis. Second, it highlights that DHT based designs inherit the characteristic that routing tables are designed as part of the protocol. This offers the adversary an opportunity of amplify their attacks. Designs should therefore not consider that the DHT is in an honest steady-state, but instead consider attacks at the time of network formation. Finally, it is worth keeping in mind that these systems try to prevent adversaries using a small fraction of malicious nodes (5%-20%) to compromise the security of a large fraction of the network. This is still far from our hope that peer-to-peer anonymity could withstand large Sybil attacks where the adversary controls a multiple of honest nodes.
12 May 2010
An article in the Greek newspaper “Eleftherotypia”on 11 May 2011, covers a worrying trend in surveillance practices of the Greek anti-terrorist squad since 2007. In multiple occasions the police has “lifted” the confidentiality of communications legal provisions, and has requested information about communications taking place within a whole region, for a window of time up to 12 hours. For example:
- On 31-12-2008 they requested data for 3 hours (4am-7am) for a region of Athens, covering the polytechnic school (that is also covered by special privileges / asylum when it comes to freedom of speech) following an attack on a riot police van.
- On 9-3-2009 a large region in Koukaki was targeted for 12 hours. (Map of two first regions: http://s.enet.gr/resources/2010-05/21-thumb-large.jpg)
- Traffic data were collected for whole regions on 12-5-2009, 9-3-2009 (for 90 minutes), 25-11-2009, 18-2-2009 and 7-5-2007.
The article mentions telecommunications (voice and sms) in the first two cases (that might include content), while only mentioning traffic data for the last cases. Furthermore it points out that the selection of time, regions and targets and processing of the information collected happens in an unaccountable manner by officers. The blanket lifting of confidentiality is done under provisions for “state security”, but the article further points out, has now become routine. These practices are also linked with the Data Retention Directive (2006), that has not yet been translated into Greek law, making the legal context for surveillance requests and providers uncertain.
(Original in Greek: “Είμαστε όλοι ύποπτοι…” by ΧΡΗΣΤΟΣ ΖΕΡΒΑΣ)
This comes as a surprise to me, since I always through that the criteria applied for conducting surveillance have to be tied to a network endpoint, or at least a person’s identity.
Micah Sherr presented at PETS a few days ago his work on “Scalable Link-Based Relay Selection for Anonymous Routing“. The key idea is that paths are generated by taking into account the network performance of each link to be used. The overhead of distributing performance information can be reduced by associating with each server a network coordinate, that allows to estimate the latency between pairs of nodes.
This is a pure path selection proposal, as quite a few have appeared in the past year to reduce latency, or increase node utilization in Tor. The question with all those proposals is: how much anonymity would these path selection strategies provide?
The methodology we present in “The Bayesian Traffic Analysis of Mix Networks” provides a way of answering such questions, by carefully modelling the path selection strategy. Applying the same methodology to these path selection proposals would be of clear benefit, and an excellent project for anyone interested in understanding better how to apply inference based techniques to traffic analysis.