27 July 2011
I am currently sitting at the PETS 2011 symposium in Waterloo, CA. I will be blogging about selected papers (depending on the sessions I attend) over the next couple of days — authors and other participants are welcome to comment!
The first session is about data mining and privacy.
“How Unique and Traceable are Usernames?”
Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils (INRIA)
The first paper looks at the identifiably of on-line usernames. The authors looked at user names from different sites and assess the extent to which they can be linked together, as well as link them to a real person. Interestingly they used Google Profiles as ground truth, since they allow users to provide links to other accounts. First they assess the uniqueness of pseudonyms based on a probabilistic model: a k-th order markov chain is used to compute the probability of each pseudonym, and its information content (i.e. -log_2 P(username)). The authors show that most of the usernames observed have “high entropy” and should therefore be linkable.
The second phase of the analysis looks at usernames from different services, and attempts to link them even given small modifications to the name. The key dataset used was 300K google profiles, that list (sometimes — they used 10K tuples of usernames) other accounts as well. They then show that the Levenshtein distance (i.e. edit distance) of usernames from the same person is small compared to the distance of two random user names. A classifier is built, based on a threshold of the probabilistic Levenshtein distance, to assess whether a pair of usernames belongs to the same or a different user. The results seem good: about 50% of usernames are linkable with no mistakes.
So what are the interesting avenues for future work here? First, the analysis used is a simple threshold on the edit distance metric. It would be surprising if more advanced classifiers could not be applied to the task. I would definitely try to use random forests for the same task. Second, the technique can be used for good not evil: as users try to migrate between services, they need an effective way to find their contacts — maybe the proposed techniques can help with that.
“Text Classification for Data Loss Prevention” (any public PDF?)
Michael Hart (Symantec Research Labs), Pratyusa Manadhata (HP Labs), and Rob Johnson (Stony Brook University)
The paper looks at the automatic classification of documents as sensitive or not. This is to assist “data loss prevention” systems, that raise an alarm when personal data is about to be leaked (i.e. when it is about to be emailed or stored on-line — mostly by mistake). Traditionally DLP try to describe what is confidential through a set of simple rules, that are not expressive enough to describe and find what is confidential — thus the authors present a machine learning approach to automatically classify documents using training data as sensitive or not. As with all ML techniques there is a fear of mistakes: the technique described tries to minimise errors when it comes to classifying company media (ie. public documents) and internet documents, to prevent the system getting on the way of day to day business activities.
The results were rather interesting: the first SVN classifier looked at unigrams with binary weights to classify documents. Yet, it first had a very high rate of false positives for public documents. It seems the classifiers also had a tendency to classify documents as “secret”. A first solution was to supplement the training set with public documents (from wikipedia), to best described “what is public”. Second, the classifier was tweaked to (in a rather mysterious way to me) by “pushing the decision boundary closer to the true negative”. A further step does 3-category classification as secret, public and non-enterprise, rather than just secret and not-secret.
Overall: They manage to get the false positive / false negative rate down to less than 2%-3% on the largest datasets evaluated. That is nice. The downside of the approach, is common to most work that I have seen using SVNs. It requires a lot of manual tweaking, and further it does not really make much sense — it is possible to evaluate how well the technique performs on a test corpus, but difficult to tell why it works, or what makes it good or better than other approaches. As a resut, even early positive resutls should be considered as preliminary until more extensive evaluation is done (more like medicine rather than engineering). I would personally like to see a probabilistic model based classifier on similar features that integrates the two-step classification process into one model, to really understand what is going on — but then I tend to have a Baysian bias.
P3CA: Private Anomaly Detection Across ISP Networks
Shishir Nagaraja (IIIT Delhi) and Virajith Jalaparti, Matthew Caesar, and Nikita Borisov (University of Illinois at Urbana-Champaign)
The final paper in the session looks at privacy preserving intrusion detection to enable cooperation between internet service providers. ISPs would like to pool data from their networks to detect attacks: either because the volume of communications is abnormal at certain times, or because some frequency component is odd. Cooperation between multiple ISPs is important, but this cooperation should not leak volumes of traffic at each IPS since they are a commercial secret.
Technically, privacy of computations is achieved by using two semi-trusted entities, a coordinator and key holder. All ISPs encrypt their traffic under an additive homomorphic scheme (Paillier) under the keyholder key, and send it to the coordinator. The coordinator is using the key-holder as an oracle to perform a PCA computation to output the top-n eighen vectors and values of traffic. The cryptographic techniques are quite but standard, and involve doing additions, subtraction, multiplication, comparison and normalization of matrices privately though a joint private two-party computation.
Surprisingly, the performance of the scheme is quite good! Using a small cluster, can process a few tens of time slots from hundresds of different ISPs in tens of minutes. A further incremental algorithm allows on-line computations of eighen vector/value pairs in seconds — making real-time use of the privacy preserving algorithm possible (5 minutes of updates takes about 10 seconds to process).
This is a surprising result: my intuition would be that the matrix multiplication would make the approach impractically slow. I would be quite interested to compare the implementation and algorithm used here with a general MPC compiler (under the same honest-but-curious model).
6 July 2011
Shishir Nagaraja has pointed out that our Drac anonymity system is not the first one to consider an anonymity network overlayed on a social network. The performance versus security of routing messages over a social network was already considered in his work entitled ‘anonymity in the wild’.
This is important prior work and we should have cited it properly. It presents an analysis of an anonymity provided by different synthetic social network topologies, as well as real-world data from LiveJournal.
30 March 2011
My team at Microsoft research has spent the past 6 months grappling with the problem of privacy in next generation energy systems. In parallel with the good honest scientific work I also participated in the UK government consultation on smart metering, in writing and in person, specifically on the issue of privacy. Its conclusions have finally been made public (see DECC’s site and Ofgem’s detailed responses).
First, what is the problem? Smart-meters are to be fitting in most homes, and they provide facilities for recording fine-grained readings of energy consumption. These are to be used for time of use billing, energy advice, the backend settlement process, financial projections of suppliers, fraud detection, customer service, and network management. The problem is that these readings are also personal data, and leak information about the occupancy of households, devices used, habits, etc. So here we have a classic privacy dilemma: where to strike the balance between the social value of sharing data (even mandating such sharing) versus the intrusion to home life?
Or do we? As it is often the case when privacy is framed as a balance, what is ignored is that we can use technology to achieve both privacy and extract value from the data. In fact we show no balancing act is necessary. We designed a host of privacy technologies to fulfill the needs of the energy industry (even the rather exotic ones) while preserving extremely high levels of privacy and user control. Lets look at them in detail:
- We developed a set of protocols to perform computation on private data while maintaining a high degree of integrity and availability. This allows customers to calculate their bills, provide indicators of consumed energy value to be used in settlement, routing demand response requests, and do profiling to support network operation or even marketing. Our framework guarantees that the computations only leak their results to third parties, and also that those results are in fact derived from the real meter readings. The raw meter readings are not necessarily shared, but can be used locally on any user client to offer a rich experience — i.e. pretty graphs of consumption and comparison with their neighbours. A non technical overview is available as a white paper, a technical introduction for meter manufacturers is provided, and a preliminary technical report with all the crypto is also online.
- Sometimes it is important to aggregate information from multiple meters without revealing anything about individual readings. The traditional approach has been to give all readings to a trusted third-party that performs the aggregation and only publishes the sum. We show that a set of meters can in fact perform the aggregation without the need for a trusted party. This is simple, efficient and compact — the computations can be done inside the trusted meter or outside along with cryptographic verification. All details are available in our technical report on aggregation.
- Some smart-meters may be deployed in extremely high-security settings. In such places leaking even the final bill or statistics aggregated over time may leak information and a positive guarantee that the information leakage is limited might be necessary. We developed techniques inspired from differential privacy to inject noise to aggregate readings that guarantee any specific time period consumption is masked. Further more we allow customers to recuperate the bulk of the costs though an oblivious cryptographic rebate system. Our technical report on differential privacy and rebates in metering is available.
- Finally proving that protocols are correct is not sufficient, so we explore options for proving actual implementation of the protocols are in fact providing the necessary security and privacy properties. A report on the certified implementation of a variants of the proposed protocols using refinement types is also available.
The project web-page on privacy in metering links to all those any more.
So much about the science, what about the engagement with government. On the positive side, our rather limited goal has been achieved: we wanted to put privacy technologies, that provide solutions beyond the dilemmas and balance between privacy and value, on the map. The government response to the consultation takes note, in a limited way, of the potential use of privacy technologies. On page 10 it shyly mentions that:
“2.18. Work is in process to understand the options for aggregating or anonymising smart metering data and whether it is necessary for the data to be accessed by a party that carries out the data minimisation. Privacy enhancing technology can potentially enable anonymised or aggregated data to be provided without any party having access to the personal data itself. The programme will work with industry and academics in order to explore the applicability of privacy enhancing technologies within the smart metering system.”
This is actually a rather fair representation of the capabilities of the technology, even if it is presented as a far away goal, rather than the concrete protocols we have proved correct and the implementations we have built.
Paragraph 2.18 mentioning privacy technology is a ray of light amidst an otherwise ambivalent government response. On the up side it recognizes energy consumption as private data from the onset, it mandates meters to hold 13 months of consumption and provide local access to it, it defines narrowly the scope of data that can be gathered without explicit consent and puts them under the data protection regime. On the down side there is confused language about what constitutes personal data (2.17), and the final technical solution involves collecting data in clear through a centralized systems (the glorious DCC) and protecting it using access control — a far cry from what we know possible in terms of technical privacy protection.
The metering privacy geeks (legal & technical) might also find other interesting nuggets in this report:
- It mentions privacy-by-design, but without support for privacy technologies (except a mention of aggregation in 2.14). This is a damaging trend set by the Ontario report on privacy in the smart grid that takes a purely management approach to privacy in the local smart grid deployment. A response to this trend is provided by Prof. Claudia Diaz and her colleagues that highlights the technical protections necessary to engineer privacy-by-design. This is only the start of this tussle.
- The report seems to suggest that personal data is not personal if it is not readily identifiable by the data controller (sect. 2.17 and 3.7). This is the classic argument of “what is de-identified personal data”. Does it mean the data controller cannot identify it, or anyone in the world? It seems the government is as confused as everyone else on this matter.
- The key outcome of the consultation is that the energy industry needs some data to perform “regulated duties”. This concept was present in the initial consultation, but funnily enough there was no description of that those duties were. It transpired in meetings that Ofgem was not in fact clear about what they were, and a large part of the consultation centered around fleshing those out. A list of those duties is available in Appendix 3 of the report, and is probably welcome by all (a similar list is available in the NIST privacy reports).
- So (in 3.15) the government concedes that industry must have access to the data necessary to perform its regulated duties by default, yet this data should be subject to the DPA requirements (3.16 for example specifically calls principle 5 — that the data should not be kept longer than necessary). Well that is a mine field: it is clear that the data is collected for a specified purpose (principle 2). If the other principles are also applied it means that it should not be used without explicit consent for other purposes (*cough*added value services*cough*) and furthermore it should not be excessive for the stated purpose. Well here we are: our technical reports offer ways in which most of the stated purposes in appendix 3 could be fulfilled without collecting the data. Is this a contradiction? Not automatically. The government’s view is clearly that our proposed protocols are not yet ready for prime time — of course as these technologies become better known and deployed this objection will evaporate. Will the data minimization requirement then mandate the use of privacy technologies? This is a rhetorical question at the moment.
- It is interesting to note that the restrictions associated with limiting the automatic collection of data by suppliers was possibly set in place on the grounds of market competition rather than privacy per-se (section 3.32). Automatic collection by suppliers would put them in an advantageous position vis-a-vis third-party providers of value added services. This is an open issue (3.36).
- The government is keen for a local repository of consumption data in the meter (4.6) and the use of geeky toys to visualize it (4.12). This is the setting in which our solutions enable strong privacy guarantees. That is positive, if only half-way.
In conclusion, the debate around privacy in metering has been informed by consumer concerns, privacy concerns, industry needs and technology alternatives. They are all represented in the government response. Yet the final solution is rather conservative: it relies on a centralised conduit for personal information protected by access control layers and management layers. It is far from what we know possible with privacy technologies. The argument today is that those technologies are too new — which is questionable given how quickly IT inovations are brought to market. This argument will lose its potency in the long term if we keep developping and deploying privacy firendly solutions.
Back in 2009 we had a close look at the surveillance commisionners reports and the implementation of RIPA Part III that makes failure to decrypt material an offense. Today the BBC is reporting that Oliver Drage, 19, of Liverpool has been convicted for refusing to give police the password to his computer. He is looking at spending 16 weeks in jail, merely for not handing out an encryption key.
BBC journalists, in their usual “impartial” style are quick to report the offence under which Mr Drake was arrested, but of course never convicted of. I will not be repeating it here as it might constitute slander, since the accusation was never in fact show to be true, and it is not even clear if the basis of the original suspicion played any role in the conviction.
The BBC also relays verbatim Det Sgt Neil Fowler, of Lancashire police, as saying: “Drage was previously of good character so the immediate custodial sentence handed down by the judge in this case shows just how seriously the courts take this kind of offence. [...] It sends a robust message out to those intent on trying to mask their online criminal activities that they will be taken before the courts with the ultimate sanction, as in this case, being a custodial sentence.”
Of course what the BBC’s impartial style fails to comment on, is that Mr Drake was in fact never shown to be participating in any online criminal activities aside the activity of not revealing his key to the police. At best it sends a robust message that innocent people mindful of their privacy in relation to the state will end up in jail, and at worse it signals to every serious criminal that if they do not reveal their keys they will get off with a light sentence. The police have powers to obtain warrants to enter premisses covertly, install surveillance equipment to retrieve keys, but instead they chose to simply ask the suspect to self incriminate themselves — this is poor policing, and will inevitably lead to travesties of justice.
This is just the beginning of RIPA part III being used, and of course I am looking forward to monitoring the legislation being used against people with legitimate needs for privacy, such as political activists, journalists, lawyers, whistleblowers, etc. Watch this space.
Americans. Attitudes About Internet Behavioral Advertising Practices
Aleecia M. Mcdonald and Lorrie Faith Cranor (Carnegie Mellon University)
This is a very interesting paper on people’s attitudes to behavioural advertising. Researchers used a mix of a small-scale (14 people) study and a larger (100s of people) statistical study. A few findings are remarkable:
- First, they see that users apply their intuition of off-line ads to the experience of on-line ads — many see on-line ads as a push mechanism and do not realise that data about themselves are collected. They seem to not object in general to the idea of advertising, and consider it as a fact of life, and even see it as ‘ok’ to support services.
- The landscape of attitudes to behavioural advertising is fascinating. When faced with a description of what behavioural advertising collects, as a hypothetical scenario, and how it functions, a large percentage of users said this is not possible, and some of them even claimed it would be illegal. When it comes to attitudes towards receiving ‘better’ ads only 18% of them liked the idea for web-based services, and 4% for email based services (like hotmail & gmail). In general the authors found that a lot of extremely common practices cause “surprise”.
- The researchers also looked at the formulation of the text of the NAI site, that offers an opt out from behavioural advertising. They find that what the system does is unclear, even after reading the page where the operation is described.
In general people prefer random ads rather than personal ads, with the exception of contextual ads (like books on on-line book stores). There is still a lot of ignorance about how technical systems work, and education when it comes to privacy and the ability to self-help themselves to protect privacy is clearly not working.
This research is pointing in the direction that the presumed tolerance of users to privacy invasion is due to ignorance of common practices. Once those practices are revealed it produces surprise, and even feeling of betrayal that will not be beneficial to any company and customer confidence.
4 October 2010
- Jack: Scalable Accumulator-based Nymble System by Zi Lin(University of Minnesota) and Nicholas Hopper (University of Minnesota)
The potential for abuse is a key challenge when it comes to deploying anonymity systems, and the privacy technology community has been researching solutions to this problem for a long time. Nymble systems allow administrators to blacklist anonymous accounts, without revealing or even knowing their identity.
What is the model: a user registers an account with a service, such as wikipedia. Then the user can use an anonymous channel like Tor, to do operations, like edit encyclopedia articles. This prevents identification of the author, and also bypasses a number of national firewalls that prevent users accessing the service (China for example blocks Wikipedia for some reason). If abuse it detected then the account can be blacklisted, but without revealing which one it was! The transcript of the edit operation is sufficient for preventing any further edits, but without tracing back the original account or network address of the user.
Nymble systems had some limitations. They either required trusted third parties for registration, or they were slow. A new generation of Nymble systems, including Jack, is now addressing these limitations: they use modern cryptographic accumulator constructions that have proofs of non-membership in O(1) time, to prove a hidden identity is not blacklisted. Jack can do authentication in 200ms, and opening a Nymble address in case of abuse in less than 30ms. This is getting real practical, and it is time that Wikipedia starts using this system instead of blacklisting Tor nodes out of fear of abuse.
4 October 2010
- Using Social Networks to Harvest Email Addresses by Iasonas Polakis, Georgios Kontaxis, Eleni Gessiou, Thanasis Petsas, Evangelos P. Markatos and Spiros Antonatos (Institute of Computer Science, Foundation for Research and Technology Hellas)
- Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships and Other Sensitive Binary Attribute Vectors by Arthur Ascuncion and Michael Goodrich (University of California, Irvine) (not on-line yet).
The first work by Polakis et al, looks at how easy it is to massively harvest email addresses using social networks and search engines to further use them as targets for spam. Furthermore, they attach to each email address contextual social information and personal information to produce more convincing spam and phishing emails. For this they used different techniques on three target platforms.
On facebook, they use the facility that allows users to find other by email to harvest personal information. This acts as an oracle to map harvested email addresses to real world names and demographic information. Once a facebook profile is linked, a very convincing fishing email can be crafted – including an email to ask to befriend them. (About 30% of users would befriend a stranger in that manner – a result not reported in the paper.)
A second vector of information is the use of nicknames that are constant across different sites. They use twitter to harvest pairs of (nicknames, email) and then further use the Facebook email to name oracle to link them to real world addresses. Finally, the authors use a google buzz feature to extract emails: every Buzz user ID is also their gmail address – this means that by searching buzz for particular words you can harvest gmail addresses as well as personal information of the users.
But how effective are the email harvesting techniques? How do you even assess this? The authors check the name and address they have harvested against an exact match with the name extracted from Facebook. The first technique yields about 0.3% correct addresses, the second 7%, and the final one 40%, showing that the techniques are practical when linking email to real names.
The second paper by Ascuncion et al. looks at how to aggregate information leaked by social networks to construct complete profiles of users on Social Networks. The aim is to reconstruct the friendship network as well as recovering attributes even if the privacy settings are used.
The techniques assume you can use an oracle to ask group queries against the social network site to check for a particular attribute. The objective is then to find a scheme that minimises the number of queries. It turns out there is a body of work on combinatorial group testing, including adaptive variants, that are readily applicable to this problem. This is not unlike our work on prying data out of a social network. Applying these techniques to social networks is even narrower allowing a lower number of queries to extract attributes (a logarithmic number of queries in the size of possible profiles, and linear in the number of profiles with a certain attribute to be extracted).
The attack is applied and validated by applying it to extract friends in Facebook, DNA sequences in mitochondrial databases, and movie preferences in the NetFlix database. These techniques are interesting as they are very general. At the same time it is likely that faster ways exist to extract specific attributes of users in real-world social networks, as there are strong correlation between attributes and the social structure of users.
4 October 2010
I am just sitting in the first WPES10 talk:
Balancing the Shadows by Max Schuchard, Alex Dean, Victor Heorhiadi, Yongdae Kim, and Nicholas Hopper (University of Minnesota)
ShadowWalker is a peer-to-peer anonymity system designed by Prateek Mittal (who was our intern in 2008) and Nikita Borisov to prevent corrupt peers jeopardising the network. The authors of this new paper “Balancing the shadows” present an attack on the system, where a malicious coalition of nodes can compromise routing security and can bias the probability of choosing a malicious node as a relay. It turns out that a naïve fix opens the system instead to selective denial-of-service attack.
How does the eclipse attack on ShadowWalker work? The adversary controls a full neighbourhood of the network, i.e. a sequence of peers in the distributed hash table (DHT). This allows an adversary to corrupt the “shadow” mechanism in shadow walker. When Alice asks a malicious node in this neighbourhood about another node in the network, they can provide a false ID, along with a set of false shadows. This attack is not too bad on its own, except that the same mechanism is used during the construction of the routing tables of the DHT. As a result an adversary that controls about 10% of the nodes can corrupt about 90% of the circuits, after a few rounds of the protocol (this was backed by simulations).
How to fix the attack? Can we increase the number of shadows of each node that can testify of the correctness of its ID? It turns out this is not a good idea: the more shadows the higher the probability one of them is malicious. In that case they can maliciously refuse to attest honest nodes, effectively taking them out of the protocol. The authors propose to change the protocol to only require a fraction of shadows providing signatures to attest an ID-node relationship — time will show if this withstands attacks.
What do we learn from this: first the level of security in peer-to-peer anonymity systems is still questionable, as designs keep being proposed and broken on a yearly basis. Second, it highlights that DHT based designs inherit the characteristic that routing tables are designed as part of the protocol. This offers the adversary an opportunity of amplify their attacks. Designs should therefore not consider that the DHT is in an honest steady-state, but instead consider attacks at the time of network formation. Finally, it is worth keeping in mind that these systems try to prevent adversaries using a small fraction of malicious nodes (5%-20%) to compromise the security of a large fraction of the network. This is still far from our hope that peer-to-peer anonymity could withstand large Sybil attacks where the adversary controls a multiple of honest nodes.
10 September 2010
I have been teaching a 4 hour course as part of the FOSAD 2010 summer school in Bertinoro, Italy, on high latency anonymous communications. Hence, I have developed my course from 3 years ago, and integrated newer material on Bayesian traffic analysis (including slides contributed by Carmela Troncoso). I also prepared a fifth lecture on low-latency anonymity system, such as Tor and Crowds, but my colleague Paul Syverson gave a separate 4 hour course just on this topic.
The structure of the 5 lectures is:
- Basic definitions & unconditional anonymity with DC-networks.
- Long-term attacks on anonymity systems (Statistical / Disclosure) and their Bayesian formulation.
- Mix networks and anonymity metrics.
- The Bayesian traffic analysis of mix networks.
- Low-latency anonymity with onion routing and crowds.
12 May 2010
An article in the Greek newspaper “Eleftherotypia”on 11 May 2011, covers a worrying trend in surveillance practices of the Greek anti-terrorist squad since 2007. In multiple occasions the police has “lifted” the confidentiality of communications legal provisions, and has requested information about communications taking place within a whole region, for a window of time up to 12 hours. For example:
- On 31-12-2008 they requested data for 3 hours (4am-7am) for a region of Athens, covering the polytechnic school (that is also covered by special privileges / asylum when it comes to freedom of speech) following an attack on a riot police van.
- On 9-3-2009 a large region in Koukaki was targeted for 12 hours. (Map of two first regions: http://s.enet.gr/resources/2010-05/21-thumb-large.jpg)
- Traffic data were collected for whole regions on 12-5-2009, 9-3-2009 (for 90 minutes), 25-11-2009, 18-2-2009 and 7-5-2007.
The article mentions telecommunications (voice and sms) in the first two cases (that might include content), while only mentioning traffic data for the last cases. Furthermore it points out that the selection of time, regions and targets and processing of the information collected happens in an unaccountable manner by officers. The blanket lifting of confidentiality is done under provisions for “state security”, but the article further points out, has now become routine. These practices are also linked with the Data Retention Directive (2006), that has not yet been translated into Greek law, making the legal context for surveillance requests and providers uncertain.
(Original in Greek: “Είμαστε όλοι ύποπτοι…” by ΧΡΗΣΤΟΣ ΖΕΡΒΑΣ)
This comes as a surprise to me, since I always through that the criteria applied for conducting surveillance have to be tied to a network endpoint, or at least a person’s identity.