I am today attending the first Internet Privacy Engineering Network (IPEN), where the issue of translating Data Protection principles into requirements has been raised a number of times. While this exercise needs to be repeated for each given service or application, it reminded me that I had drafted a number of generic Technical Requirements for Processing PII. These need to be reviewed and validated, but I hope they offer at least proof that the problem can be made tractable.

The latest revelations about the NSA attacking some of the largest US cloud providers’ communications, are also accompanied by Cambridge Member of Parliament, Julian Huppert, call to revise the oversight of UK intelligence agencies. Similar calls were made in the US about better oversight of their security agencies. Julian concludes in this Guardian “Comment is Free” piece that:

“Who can read this, and how do we want to protect this? We need to agree the rules now, before we completely lose control.”

While better oversight is in itself a good thing, the over-reliance on “oversight” or privacy regulation, such as data protection regimes, is a typical example of what I call the “liberal fallacy”. The liberal fallacy is the belief that privacy is a complex social technical issue, and as a result it needs to be addressed first and foremost by better regulation, since it cannot be addressed by technical means alone.

The argument is extremely appealing for a number of reasons, and when put so reasonably I would be surprised if most privacy and security professionals, as well as policy makers and civil society advocates would not agree with it. After all, privacy in indeed both complex, and not merely a technical property. Privacy is not an absolute right, and regulation can “balance” the rights of the individual against the collective needs to revoke this right in certain circumstances. In a liberal democracy both the state and companies operate within the rule of the law, therefore proper regulation seems a light weight mechanism to solve the privacy problem.

The problem is that the “better regulation and oversight” argument is just non-sense in the context of the NSA and GCHQ spying allegations. The reason for thi, is that the national regulations do not affect the willingness, legality or ability of other states to conduct mass surveillance operations. Better German privacy legislation would not have protected the German head of state’s telephone conversation against US agencies. Similarly, better UK oversight of GCHQ will not extend any protections the US afford to US persons only to the UK population. For any national legislation offering you strong privacy guarantees and good oversight, there are about 205 other jurisdictions in which spying on you is not only legal, but highly ethical, patriotic, in the national interest, and rather well funded by tax payers.

National legislation works best in the context of territorial matters, where proximity and ability to harm is related to physical distance and location, and an army ensures territorial integrity. The internet is not like that: a US, Russian or Chinese router is as close to your UK web-site or switch as one in the UK. Benefiting from strong protections by UK entities, does nothing to protect you from other dangers that are just as close. It is shocking that US agencies were targeting cloud providers, but now we know they were not doing so only using their legal authority, but also just intercepting their communications. Even given perfect US oversight, better regulation will not prevent other countries doing the same — only better encryption systems can do that.

I am quite delighted that Julian Huppert does also mention that that placing back doors into security products makes everyone less safe, in line with the statement many UK security researchers made earlier this year. Yet, the focus on regulation is misplaced: against nation state level threats, sadly, only better security and privacy technologies can provide a credible defense for privacy.

I am currently sitting at the PETS 2011 symposium in Waterloo, CA. I will be blogging about selected papers (depending on the sessions I attend) over the next couple of days — authors and other participants are welcome to comment!

The first session is about data mining and privacy.

“How Unique and Traceable are Usernames?”
Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils (INRIA)

The first paper looks at the identifiably of on-line usernames. The authors looked at user names from different sites and assess the extent to which they can be linked together, as well as link them to a real person. Interestingly they used Google Profiles as ground truth, since they allow users to provide links to other accounts. First they assess the uniqueness of pseudonyms based on a probabilistic model: a k-th order markov chain is used to compute the probability of each pseudonym, and its information content (i.e. -log_2 P(username)). The authors show that most of the usernames observed have “high entropy” and should therefore be linkable.

The second phase of the analysis looks at usernames from different services, and attempts to link them even given small modifications to the name. The key dataset used was 300K google profiles, that list (sometimes — they used 10K tuples of usernames) other accounts as well. They then show that the Levenshtein distance (i.e. edit distance) of usernames from the same person is small compared to the distance of two random user names. A classifier is built, based on a threshold of the probabilistic Levenshtein distance, to assess whether a pair of usernames belongs to the same or a different user. The results seem good: about 50% of usernames are linkable with no mistakes.

So what are the interesting avenues for future work here? First, the analysis used is a simple threshold on the edit distance metric. It would be surprising if more advanced classifiers could not be applied to the task. I would definitely try to use random forests for the same task. Second, the technique can be used for good not evil: as users try to migrate between services, they need an effective way to find their contacts — maybe the proposed techniques can help with that.

“Text Classification for Data Loss Prevention” (any public PDF?)
Michael Hart (Symantec Research Labs), Pratyusa Manadhata (HP Labs), and Rob Johnson (Stony Brook University)

The paper looks at the automatic classification of documents as sensitive or not. This is to assist “data loss prevention” systems, that raise an alarm when personal data is about to be leaked (i.e. when it is about to be emailed or stored on-line — mostly by mistake). Traditionally DLP try to describe what is confidential through a set of simple rules, that are not expressive enough to describe and find what is confidential — thus the authors present a machine learning approach to automatically classify documents using training data as sensitive or not. As with all ML techniques there is a fear of mistakes: the technique described tries to minimise errors when it comes to classifying company media (ie. public documents) and internet documents, to prevent the system getting on the way of day to day business activities.

The results were rather interesting: the first SVN classifier looked at unigrams with binary weights to classify documents. Yet, it first had a very high rate of false positives for public documents. It seems the classifiers also had a tendency to classify documents as “secret”. A first solution was to supplement the training set with public documents (from wikipedia), to best described “what is public”. Second, the classifier was tweaked to (in a rather mysterious way to me) by “pushing the decision boundary closer to the true negative”. A further step does 3-category classification as secret, public and non-enterprise, rather than just secret and not-secret.

Overall: They manage to get the false positive / false negative rate down to less than 2%-3% on the largest datasets evaluated. That is nice. The downside of the approach, is common to most work that I have seen using SVNs. It requires a lot of manual tweaking, and further it does not really make much sense — it is possible to evaluate how well the technique performs on a test corpus, but difficult to tell why it works, or what makes it good or better than other approaches. As a resut, even early positive resutls should be considered as preliminary until more extensive evaluation is done (more like medicine rather than engineering). I would personally like to see a probabilistic model based classifier on similar features that integrates the two-step classification process into one model, to really understand what is going on — but then I tend to have a Baysian bias.

P3CA: Private Anomaly Detection Across ISP Networks
Shishir Nagaraja (IIIT Delhi) and Virajith Jalaparti, Matthew Caesar, and Nikita Borisov (University of Illinois at Urbana-Champaign)

The final paper in the session looks at privacy preserving intrusion detection to enable cooperation between internet service providers. ISPs would like to pool data from their networks to detect attacks: either because the volume of communications is abnormal at certain times, or because some frequency component is odd. Cooperation between multiple ISPs is important, but this cooperation should not leak volumes of traffic at each IPS since they are a commercial secret.

Technically, privacy of computations is achieved by using two semi-trusted entities, a coordinator and key holder. All ISPs encrypt their traffic under an additive homomorphic scheme (Paillier) under the keyholder key, and send it to the coordinator. The coordinator is using the key-holder as an oracle to perform a PCA computation to output the top-n eighen vectors and values of traffic. The cryptographic techniques are quite but standard, and involve doing additions, subtraction, multiplication, comparison and normalization of matrices privately though a joint private two-party computation.

Surprisingly, the performance of the scheme is quite good! Using a small cluster, can process a few tens of time slots from hundresds of different ISPs in tens of minutes. A further incremental algorithm allows on-line computations of eighen vector/value pairs in seconds — making real-time use of the privacy preserving algorithm possible (5 minutes of updates takes about 10 seconds to process).

This is a surprising result: my intuition would be that the matrix multiplication would make the approach impractically slow. I would be quite interested to compare the implementation and algorithm used here with a general MPC compiler (under the same honest-but-curious model).

  • Using Social Networks to Harvest Email Addresses by Iasonas Polakis, Georgios Kontaxis, Eleni Gessiou, Thanasis Petsas, Evangelos P. Markatos and Spiros Antonatos (Institute of Computer Science, Foundation for Research and Technology Hellas)
  • Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships and Other Sensitive Binary Attribute Vectors by Arthur Ascuncion and Michael Goodrich (University of California, Irvine) (not on-line yet).

The first work by Polakis et al, looks at how easy it is to massively harvest email addresses using social networks and search engines to further use them as targets for spam. Furthermore, they attach to each email address contextual social information and personal information to produce more convincing spam and phishing emails. For this they used different techniques on three target platforms.

On facebook, they use the facility that allows users to find other by email to harvest personal information. This acts as an oracle to map harvested email addresses to real world names and demographic information. Once a facebook profile is linked, a very convincing fishing email can be crafted – including an email to ask to befriend them. (About 30% of users would befriend a stranger in that manner – a result not reported in the paper.)

A second vector of information is the use of nicknames that are constant across different sites. They use twitter to harvest pairs of (nicknames, email) and then further use the Facebook email to name oracle to link them to real world addresses. Finally, the authors use a google buzz feature to extract emails: every Buzz user ID is also their gmail address – this means that by searching buzz for particular words you can harvest gmail addresses as well as personal information of the users.

But how effective are the email harvesting techniques? How do you even assess this? The authors check the name and address they have harvested against an exact match with the name extracted from Facebook. The first technique yields about 0.3% correct addresses, the second 7%, and the final one 40%, showing that the techniques are practical when linking email to real names.

The second paper by Ascuncion et al. looks at how to aggregate information leaked by social networks to construct complete profiles of users on Social Networks. The aim is to reconstruct the friendship network as well as recovering attributes even if the privacy settings are used.

The techniques assume you can use an oracle to ask group queries against the social network site to check for a particular attribute. The objective is then to find a scheme that minimises the number of queries. It turns out there is a body of work on combinatorial group testing, including adaptive variants, that are readily applicable to this problem. This is not unlike our work on prying data out of a social network. Applying these techniques to social networks is even narrower allowing a lower number of queries to extract attributes (a logarithmic number of queries in the size of possible profiles, and linear in the number of profiles with a certain attribute to be extracted).

The attack is applied and validated by applying it to extract friends in Facebook, DNA sequences in mitochondrial databases, and movie preferences in the NetFlix database. These techniques are interesting as they are very general. At the same time it is likely that faster ways exist to extract specific attributes of users in real-world social networks, as there are strong correlation between attributes and the social structure of users.

I have been teaching a 4 hour course as part of the FOSAD 2010 summer school in Bertinoro, Italy, on high latency anonymous communications. Hence, I have developed my course from 3 years ago, and integrated newer material on Bayesian traffic analysis (including slides contributed by Carmela Troncoso). I also prepared a fifth lecture on low-latency anonymity system, such as Tor and Crowds, but my colleague Paul Syverson gave a separate 4 hour course just on this topic.

The structure of the 5 lectures is:

  1. Basic definitions & unconditional anonymity with DC-networks.
  2. Long-term attacks on anonymity systems (Statistical / Disclosure) and their Bayesian formulation.
  3. Mix networks and anonymity metrics.
  4. The Bayesian traffic analysis of mix networks.
  5. Low-latency anonymity with onion routing and crowds.

Slides are available in Office 2010 format, Office 97-03 format and PDF (but without the cute animations).

I am currently listening to the presentation of:

  • Dominik Herrmann, Rolf Wendolsky, Hannes Federrath, “Website Fingerprinting: Attacking Popular Privacy Enhancing Technologies with the Multinomial Naive-Bayes Classifier” ACM CCSW 2009. Chicago, US.

This work looks at the classic problem of identifying browsed pages through SSL encryption or low-latency anonymization networks like Tor. The authors cast the problem of identifying web pages as a supervised learning and classification problem. Instead of applying a jacard coefficient or naive bayes classifier, the authors use a multinomial naive Bayes classifier to learn and identify the pages from their sizes. Furthermore they use filtering on the raw page sizes inspired from information retrieval — such as using the logarithms of the sizes to avoid large bias, or cosine normalisation to make the overall size of the website irrelevant.

That is yet another fine example of casting a traffic analysis problem in terms of machine learning and Bayesian inference.

A very interesting comparison is presented on how different single-hop and multi-hop anonymity systems compare. Single Hop systems are broken with probability over 95%. Multi-hop systems are pretty robust with JonDoNym traffic only being identified 20% of the time, and Tor about 3% of the time. Why is Tor doing so well? It turns out that the cell quantization does add noise, and mess up the attack. It is not clear if this is a fundamental finding, or whether a better / more specific attack could do better.

Datasets should be available soon. Sadly I cannot find the paper on-line. Put your work on the web people!

Anonymity at CCS’09

11 November 2009

I am currently sitting in the anonymity techniques session of ACM CCS 2009 in Chicago, and thought I might as well provide a roadmap to what is being presented here regarding traffic analysis and anonymity. Aside the main event, WPES was also hosted on Monday and contained quite a few relevent papers.

It turns out that altruism does not pay: if you become a relay when also using an anonymity network, your attack surface grows and novel traffic analysis attacks are possible to guess what you are surfing. This paper presents a long term intersection attack, where the time you are on-line can be used to correlate observed browsing events (like tweets), as well as an extension of what has now been named clogging attacks, where traffic can be modulated on the Tor client, and observed on a remote server. Very cool! This paper seems to be part of Nick Hopper’s group new strategy, of presenting an attack paper at WPES, and a solution at CCS (this is understandable as he has been penalised, by not being able to publish at PETS 2010, as he is aPC chair). Part of the solution against this attack is presented in “Membership-concealing overlay networks” later in the conference.

  • XPay: Practical anonymous payments for Tor routing and other networked services (not on-line!)
    Yao Chen, Radu Sion and Bogdan Carbunar (WPES)

A controversial payment mechanism for paying routers that relay traffic in Tor. More details will follow when I manage to get a copy of the paper, as the proceedings of WPES are not available yet (*cough*cock-up*cough*).

Many have tried to build anonymity systems over DHTs, providing some of us an endless supply of systems to perform traffic analysis. This work looks at pretty much most of the DHT based anonymity systems, and points out that queries in DHTs are easy to observe, manipulate and DoS given only a small number of corrupt nodes. This is the attack paper that motivated the Torsk system, that will be presented later in the CCS conference.

  • NISAN: Network Information Service for Anonymization Networks (not on-line?)
    Andriy Panchenko, Arne Rache and Stefan Richter

Directory services are a bit of a drag on anonymity systems, and NISSAN is a proposal to use a DHT to distribute the directory functionality. Special care is taken to ensure the DHT routing resolution mechanism is not susceptible to an active adversary that controls a fraction of nodes in the DHT, to get truly random nodes. The authors also controversially argue that the observability of queries is not important, and suggest that bridging and fingerprinting is not a big deal so far. DHTs are scary, and NISSAN is likely to come under close scrutiny in the next few years (as long as the authors put the paper on-line!)

The paper proposes the use of certificateless encryption to build circuits in onion routing (or mix) systems. It is unclear how useful this is, as the list of nodes already has to be signed to avoid directory and sybil attacks, but having different options to do crypto in anonymity networks is always interesting.

After many attack papers, Nikita and Prateek decide to jump in the deep end, and propose their own DHT based path construction mechanism for anonymity networks. ShadowWalker is a system that allows secure sampling in a DHT. It relies on commitments on routing tables shared amongst some “shadow” nodes to ensure the adversary cannot adaptively lie about their routing tables, and a tit-for-tat strategy to avoid DoS attacks. Very cool ideas, and a very cool name.

Given a trace of a mix network, what is the probability of Alice talking to Bob? This seemingly simple question turns out to be remarkably hard to answer given the constraints of path selection and a real-world trace of traffic. More details on how to do this are also available in our report: The Application of Bayesian Inference to Traffic Analysis. I will write a post about how I hope this will become the standard way in which to assess the passive information leakage of anonymity networks.

The story so far was that a bigger Tor network is a more secure Tor network. This paper points out it is not so simple: despite having more nodes, Tor still routes most traffic over a small number of AS (Autonomous Systems), that can see both ends of the connection, and should be able to do timing attacks. A method to chose paths to mitigate this problem is proposed — more would be welcome.

Finding out remotely who is using an anonymity network might lead to trouble for some users. This work proposes a method to anonymize traffic, in a peer-to-peer manner, but without ever connecting to many strangers. Its very nice to see the idea of social network being used for infiltration resistance taking off.

Papers to come tomorrow:

  • A New Cell Counter Based Attack Against Tor
    Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Dong Xuan and Weijia Jia

I just sat thought the first session of PET2009, that was about privacy policies where two really interesting pieces of research were presented.

Ram presented a work on “Capturing Social Networking Privacy Preferences” [pdf], where he proposes to infer automatically privacy policies for social networks, and present them as templates or starting points for users to define their own policies. The methodology used is really neat: they record the location of a number of users, and every night they ask the users whether they would be happy to share their locations with different circles of theirs. Then they try to extract a set of standard policies, based on time, location, and the type of contact that can see your location.

The second study, presented by Aleecia, is on how easy and pleasing is to read privacy policies (“A Comparative Study of Online Privacy Policies and Formats“). They find that privacy policies in different formats are more or less easy to read and understand, but across the board privacy policies are difficult to understand, easy to misunderstand, and totally unpleasant to read.

Lords recommend PETs

6 February 2009

The house of Lords Constitution Committeehas just published a report on Surveillance: Citizens and the State as well as the evidence they heard. As part of their recommendations they push Privacy enhancing Technologies to be part of the procurement process of government projects. In particular they say:

485. We recommend that the Government review their procurement processes so as to incorporate design solutions that include privacy-enhancing technologies in new or planned data gathering and processing systems. (paragraph 349)

They also push, albeit in an indirect way, for privacy enhanced identification schemes and ID cards, citing the example of Austria. This is basically a recommendation to implement selective disclosure credential technologies:

478. We recommend that the Government’s development of identification systems should give priority to citizen-oriented considerations. (paragraph 268)

Which refers to:

268. The Information Commissioner’s Office (ICO) drew attention to the use in Austria of a system of identification numbers that allows access to information in different databases “without the need for a single widely known personal identification number that may be misused.” (p 5) The Royal Academy of Engineering (RAE) explained that it is possible for individuals to fulfil their legitimate need or desire to maintain multiple roles or identities in transactions with state or other organisations and to avoid the possibility of those organisations needlessly correlating them. The technology involved in identification can be developed to suit an individual’s preference to keep domestic status and work life separate, where the protection of identity is necessary to avoid abusive relationships or stalking, or where witnesses and children need protection.118 We recommend that the Government’s development of identification systems should give priority to citizen-oriented considerations.

This is all good news! It is indeed at the procurement phase that such requirements for PETs should be specified and entrenched in the delivery contracts. Negotiating PETs for complex surveillance technologies will also make the cost of recording data just-in-case visible.

Carmela Troncoso points to an article in “Telematics Update” about Norwich Union (NU) discontinuing their pilot pay-as-you-drive scheme. Despite them being rather coy about the exact reasons, some experts guess:

“Strategy Analytics analyst, Clare Hughes, said that prohibitive launch costs, privacy violations, patent fees, back-office data integration and difficulties in measuring costs versus benefits would inhibit the immediate widespread launch of PAYD schemes.”

There are two items of interest in this list, one obvious and one less so. The obvious one concerns privacy violations, and the perception from drivers that the insurance company and anyone who can get hold of the data can spy on their every movement. This fear is to some extent justified, which led us to propose PriPAYD, to alleviate exactly those privacy concerns. It is good to see that once more it is proved that privacy technology is an enabler, sadly the hard way for NU.

More subtlety the “back-office data integration” also relates to privacy. Requiring every data point to be transmitted to a central server, and building a gigantic silo of location data comes at a price. Processing this information to extract billing data, or even worse securing and managing access to it is an expensive business. If only PAYD providers stick to their core business model, i.e. provide insurance by the mile, type of road and time of day, they could get rid of data as soon as it is processed, reducing the costs of storage, further processing and management.

Interestingly the article points out that the future of PAYD rests in the services area: roadside assistance, emergency help, etc. This points towards an integration of the PAYD box with other components of a car, making it in fact part of a more general computing platform. Again, lets hope that this platform will be user centric, and will not be emitting a location trail to third parties.


Get every new post delivered to your Inbox.