The latest revelations about the NSA attacking some of the largest US cloud providers’ communications, are also accompanied by Cambridge Member of Parliament, Julian Huppert, call to revise the oversight of UK intelligence agencies. Similar calls were made in the US about better oversight of their security agencies. Julian concludes in this Guardian “Comment is Free” piece that:

“Who can read this, and how do we want to protect this? We need to agree the rules now, before we completely lose control.”

While better oversight is in itself a good thing, the over-reliance on “oversight” or privacy regulation, such as data protection regimes, is a typical example of what I call the “liberal fallacy”. The liberal fallacy is the belief that privacy is a complex social technical issue, and as a result it needs to be addressed first and foremost by better regulation, since it cannot be addressed by technical means alone.

The argument is extremely appealing for a number of reasons, and when put so reasonably I would be surprised if most privacy and security professionals, as well as policy makers and civil society advocates would not agree with it. After all, privacy in indeed both complex, and not merely a technical property. Privacy is not an absolute right, and regulation can “balance” the rights of the individual against the collective needs to revoke this right in certain circumstances. In a liberal democracy both the state and companies operate within the rule of the law, therefore proper regulation seems a light weight mechanism to solve the privacy problem.

The problem is that the “better regulation and oversight” argument is just non-sense in the context of the NSA and GCHQ spying allegations. The reason for thi, is that the national regulations do not affect the willingness, legality or ability of other states to conduct mass surveillance operations. Better German privacy legislation would not have protected the German head of state’s telephone conversation against US agencies. Similarly, better UK oversight of GCHQ will not extend any protections the US afford to US persons only to the UK population. For any national legislation offering you strong privacy guarantees and good oversight, there are about 205 other jurisdictions in which spying on you is not only legal, but highly ethical, patriotic, in the national interest, and rather well funded by tax payers.

National legislation works best in the context of territorial matters, where proximity and ability to harm is related to physical distance and location, and an army ensures territorial integrity. The internet is not like that: a US, Russian or Chinese router is as close to your UK web-site or switch as one in the UK. Benefiting from strong protections by UK entities, does nothing to protect you from other dangers that are just as close. It is shocking that US agencies were targeting cloud providers, but now we know they were not doing so only using their legal authority, but also just intercepting their communications. Even given perfect US oversight, better regulation will not prevent other countries doing the same — only better encryption systems can do that.

I am quite delighted that Julian Huppert does also mention that that placing back doors into security products makes everyone less safe, in line with the statement many UK security researchers made earlier this year. Yet, the focus on regulation is misplaced: against nation state level threats, sadly, only better security and privacy technologies can provide a credible defense for privacy.

Moving to UCL …

31 October 2013

After a good few years at Microsoft Research, I am now moving to University College London to take a position on Security and Privacy Engineering, at the Computer Science Department.

I am joining a fantastic team of researchers: Angela Sasse heads the group and is doing pioneering work on human aspects of security; Jens Groth is an expert on cryptography, and zero knowledge; Nicolas Courtois is a leading cryptanalyst, and has hit the news many times in the past by demonstrating vulnerabilities in deployed systems. Alongside myself, Emiliano De Cristofaro, who works on applied cryptography and privacy, and David Pym, who has a dual interest in formal methods and economics of security, are also joining the group.

One of my first non-research tasks at UCL is to teach the  Computer Security 1 course, which is a broad introduction to the basics of computer security. As a matter of principle, namely that the highest levels of quality of protection are achieved when computer security is discussed in public, I consider that the class to be a public event and open to anyone who would like to attend (subject to space restrictions). So if you are based in London, and would like access, just let me know.

I was sent yesterday articles about the new round of policy proposals related to Internet surveillance and traffic data monitoring (bbc, guardian). What is depressing, but also really revealing, is how similar both the sought powers, as well as the arguments for those powers are to previous attempts, just a few years ago.

In their essence the powers seek to extend the ability of law enforcement agencies to have access to all Internet traffic data, a power that they largely already have when it come to conventional telecommunications, or email services. What are traffic data? Everything that you have not typed as a message: the identity and time of your facebook chats, your facebook likes, the log of the visits to all web-pages, the clicks on on-line polls, the location data your phone sends to access on-line location services, the times and places you were in the same chat room with your friends, your on-line friends, etc. Basically you can think of blanket traffic data retention and access as having a policeman following you around 24h a day / 7 days a week, and making notes about where you have been, what you have looked at, who you are talking to, what you are doing, where you are sleeping (and with whom), everything you bought, every political and trade union meeting you went to, … — but not actually hearning any of the conversation or seeing what you wrote. Traffic data provide an X-ray of your whole life, and the policy suggests they should be available to law enforcement and the intelligence services without any judicial oversight (only political review or police oversight).

As has been discussed many times before recording all this information is very expensive, unlikely to ever be totally comprehensive, technically nearly infeasible, and prone to over-collection and abuse. In fact a focus on “more data” detracts attention from careful on-the-ground case work, and turns policing into a computer game you cannot win. A lot of data is already sitting on databases, and can be accessed by police — in fact police are under-equipped, under trained and under resourced to make use of those investigative resources, let alone more raw data. The focus on ubiquitous surveillance also increases concerns about privacy, and the ideas that everything communicated can be used against a person puts a brake on the adoption of information technologies like computational clouds.

On that basis the conservative and libdem opposition has in the past argued against those measures. They are now in government so one would think that this debate in not necessary. Yet, the internet surveillance plans are back on the table! What is going on here?

This time around, I personally think, that the campaign against those measures should also seek to dislodge those in the permanent and non-elected institutions of the state, that keep bringing back on the table this policy. I find it very hard to believe that these near carbon copy proposal naturally “re-emerge” despite the prolonged public debaes against them. It is much more likely that the policy is simply repackaged and presented anew to every new minister by career civil servants, under permanent pressures from the agencies.

I find it troubling that there is a non-elected set of institutions of the state that have as a permanent policy agenda to undermine civil liberties, despite consistently losing the public debate when specific powers are considered by Parliament. I would call this political subversion. What is even more troubling is that the architects of such policies are hiding under the cloak of secrecy, making it impossible for those outside government and the security services to really call them to account. I appreciate that operational necessities mean that some aspects of the security services’ work must remain secret. Yet, I cannot see why the branch of the security services that aims to directly affect public policy, through pushing a permanent agenda of ubiquitous surveillance, should enjoy that privilege.

Today Theresa May, our home secretary “insisted only data – times, dates, numbers and addresses – not content would be accessible” and that “ordinary people” had nothing to fear. Requesting such information about the communications between the home office, civil servants, and the members of the security services that advised them about this policy may change her opinion about how sensitive such information is. In fact, she may discover that providing such a map of the policy campaign and network of support, gives the opponents of the policy an undeniable advantage. I am very much looking forward to turning my automated social network analysis tools on their call graphs and email logs, and providing all results and intuitions to journalists.

ACM CCS 2011 just took place this week, so I decided to give a bit more insight into a few processes the program chairs used behind the scenes to manage what is the largest security conference to date. Vitaly Shmatikov (CCS11 Program co-chair) has already given a short introduction this year’s process: we received 429 full papers that we had to review with 54 PC members. While no hard target was set at the start of the process we expected to accept around the 60 papers that are now forming the program of CCS 2011. These are my views and opinions on the process, and they are not automatically shared by anyone else, including Vitaly.

Note: This post describes automated statistics we used to interpret scores of reviews to guide us in assigning more reviews or guiding discussion. All final acceptance decision were taken the old fashioned way through qualitative assesment of the reviews and discussion in the PC.

Read the rest of this entry »

Privacy-friendly Aggregation for the Smart-grid
Klaus Kursawe (Radboud Universiteit Nijmegen) and George Danezis and Markulf Kohlweiss (Microsoft Research)

Privacy in for smart electricity provision seems to be a rising topic, and this year there is a whole session on it at PETS 2011. The first paper (one which I am a coauthor) looks at the problem of gathering aggregate data from groups of smart meters, without allowing any third party to get the the individual measurements. This can be applied as a PET to solve real-world problems such as fraud detection, leakage detection, load estimates, demand response, weather prediction — all of which only require aggregate data (sometimes in real time), not individual measurements.

The key challenge to providing a private aggregation protocols are the specific constraints of smart meters. They are cheap devices, with modest resources, hardly any bandwidth, no ability to communicate, etc. Two specific protocols are presented: the first one allows to compare the sum of meter readings with a reference number (maybe measured from a feeder meter). This protocol allows for fancy proofs of correctness, but it slow in terms of computation and bandwidth (it requires public key operations for each reading). The second protocol is extremely fast and has no communication overhead. In both cases a pragmatic approach to the threat model is followed: we assume that the utilities will be honestly defining groups of meters and facilitating the key management protocol — for the second protocol there is no overhead of public key operations after the initial key setup.

The key highlight from this work is not as much its technical depth (tricks with DC networks and hash function that would not surprise any PETS regular). What is interesting is that the protocols were designed for a real industrial application and now fully integrated on real smart meters and their communication protocols in collaboration with our collaborators at Elster.

Plug-in privacy for Smart Metering billing
Marek Jawurek, Martin Johns, and Florian Kerschbaum (SAP Research)

This second paper looks at the problem of billing for fine-grained time of use tariffs — their energy consumption at different times costs a different rate per unit. This is a very important topic, as correct billing and time of use tariffs are a key driver of fine-grained data collection through smart meters — if we can do billing privately then maybe less personal information may be collected.

Technically the protocols proposed are based on the homomorphic properties of Pedersen commitments: readings are commitments, and you can use multiplication by a constant and addition to compute the bill, and (most importantly) prove that it is correct. The system model is that the meter outputs signed commitments of readings, a privacy component computes the bill and proofs of correctness, and those are sent to the supplier for verification (and printing the bills!).

This is the core of a nice solution for the basic billing case (which is likely to be the common one in smart grids). We have shown in related work that the protocol can be further improved to have zero communication overhead. Since it avoids expensive zero-knowledge proofs it is fast for its proofs and verification. It also provides the basic infrastructure to support further more expressive billing policies and general computations.

An Accurate System-Wide Anonymity Metric for Probabilistic Attacks
Rajiv Bagai, Huabo Lu, Rong Li, and Bin Tang (Wichita State University)

Traditional entropy based anonymity metrics look at the security of single messages. But how can you quantify the security provided by a whole system? The first paper in this session looks at a system-wide definition of anonymity by “counting” the possible number of matchings between inputs and outputs of an anonymity system. Furthermore, the metric extends to the probabilities over perfect matchings to express subtleties of modern anonymity systems. The paper first of all provides a thorough critique of the metric by Edman et al. (there was also previous work on this metric by the Leuven crew).

In a nutshell the proposed system-wide metric associates a probability to each possible matching, and computes the entropy over this distribution as a measure of anonymity (normalized). The choice of shanon entropy to summarise quality can be changed to min-entropy or other (which is very cool!) One key issue with system-wide metrics is that  how they express the properties that any individual message receives. Paul Syverson points out that these type of metrics express more the anonymity capacity of a system — namely how much anonymity the system could provide as a whole. The question of how this capacity for protection is distributed across users may need an extension to those metrics. For anyone who would like to extend metrics to capture this aspect, this paper is a very solid foundation.

DefenestraTor: Throwing out Windows in Tor
Mashael AlSabah, Kevin Bauer and Ian Goldberg (University of Waterloo), Dirk Grunwald (University of Colorado), and Damon McCoy, Stefan Savage, and Geoffrey Voelker (University of California-San Diego)

This paper looks at performance issues within the Tor network, and in particular the effects of the congestion and flow control protocols. Tor implements simple end-to-end flow control mechanism at the granularity of circuits and streams. It turns out that the implemented window based flow control has detrimental effects on performance: it does not protect intermediate routers (who are likely to be the congested ones) from congestion.

Two approaches were followed to solve this problem. First, a smaller window could be used — but this would not solve the problem; or windows can be computed dynamically. Second, the N23 congestion control protocol (used for ATM) could be used over Tor. N23 is simple and guarantees no packets are dropped, while implementing a steady flow of data. Its a credit based system, where packets are sent when credits are available (and consume them), and credits are sent up the network when bandwidth is available.

The evaluation was done under realistic conditions on ExperimenTor. The improvement over the current Tor strategy is significant when it comes to the time to get the first byte, but the time to complete larger (bulk) downloads do suffer (which is part of the point of the protocol).

I am really happy to see research on the intersection of traditional networking and anonymous communications. I have never heard of N23 before (shame on me!), and it seems that it is a good fit for the problem of congestion in anonymity networks (where reliability is not an issue when TCP is used).

Privacy Implications of Performance-Based Peer Selection by Onion Routers: A Real-World Case Study using I2P
Michael Herrmann and Christian Grothoff (Technische Universität München)

This is an attack paper on the I2P network, and in particular the performance based peer selection. It combines a denial-of-service attack to influence the selection of peers within the network, and force a victim to choose corrupt servers.

This is a cute attack that combines denial-of-service, traffic analysis for confirmation you are on the same circuit, and interactions with an infrastructure to attack. This is a very good reminder that anonymity engineering is not simply systems’ work. Every design choice about performance can affect security in dramatic ways. The evaluation was also very sensitive to protecting users: the researchers tried their attack on the real network, but targeted their own circuits (I still want to see details to make sure no other users were affected).

Tor too implements circuit selection on the basis of performance — I am wondering to what extent similar ideas could be applied there …

Quantifying Location Privacy: The Case of Sporadic Location Exposure
Reza Shokri and George Theodorakopoulos (EPFL), George Danezis (Microsoft Research), and Jean-Pierre Hubaux and Jean-Yves Le Boudec (EPFL)

This work evaluates the privacy of using location-based services sporadically using a set of location privacy mechanisms. Sporadic services include those that require location infrequently, rather than continuously (think of restaurant suggestions rather than relaying real-time GPS streams). The key novelty of the approach is that the model of location exposure, as well as privacy protection is very general. It encompasses anonymization, generalization and obfuscation of location, use of fake traffic and suppression of location. In turn the analysis relies on advanced models of location and mobility (based on markov chains) and is based on Bayesian inference. The evaluation of different location privacy techniques is done on real-world traces from SF taxis.

I am one of the authors of this work, so of course I think it is awesome! More seriously, it is one of the fist works to combine under a common framework a multitude of location privacy mechanisms, and provide a common evaluation framework for them, to quantify the degree of protection they offer relatively to each other for different adversaries. It is also one of the first systematic applications of Bayesian inference to analyze location privacy — extending the inference paradigm beyond the analysis of network anonymity systems.

Of course this is not the last word. Only a subset of protection techniques and combination of techniques were look at, and other protection mechanisms can be integrated and evaluated in the same framework (the tracing model and threat model can be unchanged). Secondly, the analysis itself may be augmented with side-information — be it commercial transactions or traces of network traffic — that may be giving some information about location, to increase the capabilities of the adversary (or make them more realistic). The model we use, based on markov chains, has the benefit of giving analytically tractable results, but numerical techniques may be used to extend it to be more true to real-life attacks.

The Location-privacy Meter tool that can be used to evaluate custom location privacy protections is available for download!

Privacy in Mobile Computing for Location-Sharing-Based Services
Igor Bilogrevic and Murtuza Jadliwala (EPFL), Kubra Kalkan (Sabanci University), Jean-Pierre Hubaux (EPFL), and Imad Aad (Nokia)

This paper looks at applications where users need to share their location. For example, two users may want to find out if they are close to each other or where they should meet in order to share a taxi ride. Yet, those users do not want to leak any of their location information to the other users or the service provider. More specifically two users specify a set of ranked prefered location they could meet and the system needs to determine on of those fairly without revealing the current location or other preferences (except the one chosen to meet). This is called the fair rendez-vous problem.

The key contribution of this work is to show that this problem can be set with a set of concrete cryptographic protocols. It also presents an implementation of these algorithms on a real mobile phone to show that it is practical. The cryptographic computations are based on homomorphic encryption schemes as well as interactions with the service (to do multiplication that is not possible with Paillier). The implementation on a mobile phone takes a few seconds on the client and the server, and is paralelizable in the number of users. Untypically, the authors also did a user study: users were asked what their concerns were, and after using the application of the phone they were asked how usable it was, and whether they appriciated the privacy provided by the application.

This is a really nice example of a privacy application, that applies advanced crypto, but also evaluates it on a real platform for performance as well as users’ reaction to it. The obvious extensions to this work would be to generalize it to more complex rendez-vous protocols, as well as other location sharing applications. It is good to see that modern mobile devices can do plenty of crypto in a few seconds, so I am very hopeful we will see more work in this field.

On The Practicality of UHF RFID Fingerprinting: How Real is the RFID Tracking Problem?
Davide Zanetti, Pascal Sachs, and Srdjan Capkun (ETH Zurich)

This paper looks that UHF tags — they are the dumb tags that can be read at about 2m that are attached to things you buy to facilitate stock management or customer aftercare. Interestingly this study looks at how identifiable the tags are at the physical layer, not using the actual tag ID! Therefore these techniques may bypass any privacy protection that attempt to prevent access to the tag ID. It turns our that one can build a unique and reliable ID for a tag from its physical characteristics that can be used to trace people as they move around.

What is new about this work is that the focus was on practicality and cost of extracting a reliable fingerprint (previous approaches relied on expensive equipment and laboratory conditions). The solution was implemented using a cheap software radio (USRP2 device + PC).

I am not quite sure what to conclude from the evaluation on the quality of the fingerprint. It seems that an adversary can place tags within one of 83 to 100 groups. Is this really a good results or not? I guess it depends on the application and the density of tags. Of course if more than one tag is carried, then the adversary could combine fingerprints to identify individuals more easily — if you carry 5 tags you have a 20 bit IDs. Interestingly, there is extensive evaluation of the stability of the tag to temperature and mobility — it turns out that these factors do affect the quality of the fingerprint and further reduce the effective number of unique IDs that can be extracted (down to about 49 classes).

It would be interesting to combine this attack vector with the ideas from the first paper (pretending that the short physical IDs are a version of a privacy protection system) to evaluate the effectivness of tracing a set of individual throughout town.

Andreas Pfitzmann has sadly passed away last year, and a special pannel session is taking place right now at PETS 2011 commemorating his work on anonymous communications and privacy. Andreas’ technical contribution span about 30 years, and as such he can be considered a founding father of the field of anonymous communications. His work in educating policy makers, and advocating privacy in the public sphere had a profound impact on German technology policy.

The pannel includes a short excerpt from an interview with Andreas, as well as recorded contributions, by collaborators (Michael Waidner and Marit Hansen), former students (Anna Krasnova and Hannes Federrath) and people in the PET community (Paul Syverson and Caspar Bowden).

I am currently sitting at the PETS 2011 symposium in Waterloo, CA. I will be blogging about selected papers (depending on the sessions I attend) over the next couple of days — authors and other participants are welcome to comment!

The first session is about data mining and privacy.

“How Unique and Traceable are Usernames?”
Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils (INRIA)

The first paper looks at the identifiably of on-line usernames. The authors looked at user names from different sites and assess the extent to which they can be linked together, as well as link them to a real person. Interestingly they used Google Profiles as ground truth, since they allow users to provide links to other accounts. First they assess the uniqueness of pseudonyms based on a probabilistic model: a k-th order markov chain is used to compute the probability of each pseudonym, and its information content (i.e. -log_2 P(username)). The authors show that most of the usernames observed have “high entropy” and should therefore be linkable.

The second phase of the analysis looks at usernames from different services, and attempts to link them even given small modifications to the name. The key dataset used was 300K google profiles, that list (sometimes — they used 10K tuples of usernames) other accounts as well. They then show that the Levenshtein distance (i.e. edit distance) of usernames from the same person is small compared to the distance of two random user names. A classifier is built, based on a threshold of the probabilistic Levenshtein distance, to assess whether a pair of usernames belongs to the same or a different user. The results seem good: about 50% of usernames are linkable with no mistakes.

So what are the interesting avenues for future work here? First, the analysis used is a simple threshold on the edit distance metric. It would be surprising if more advanced classifiers could not be applied to the task. I would definitely try to use random forests for the same task. Second, the technique can be used for good not evil: as users try to migrate between services, they need an effective way to find their contacts — maybe the proposed techniques can help with that.

“Text Classification for Data Loss Prevention” (any public PDF?)
Michael Hart (Symantec Research Labs), Pratyusa Manadhata (HP Labs), and Rob Johnson (Stony Brook University)

The paper looks at the automatic classification of documents as sensitive or not. This is to assist “data loss prevention” systems, that raise an alarm when personal data is about to be leaked (i.e. when it is about to be emailed or stored on-line — mostly by mistake). Traditionally DLP try to describe what is confidential through a set of simple rules, that are not expressive enough to describe and find what is confidential — thus the authors present a machine learning approach to automatically classify documents using training data as sensitive or not. As with all ML techniques there is a fear of mistakes: the technique described tries to minimise errors when it comes to classifying company media (ie. public documents) and internet documents, to prevent the system getting on the way of day to day business activities.

The results were rather interesting: the first SVN classifier looked at unigrams with binary weights to classify documents. Yet, it first had a very high rate of false positives for public documents. It seems the classifiers also had a tendency to classify documents as “secret”. A first solution was to supplement the training set with public documents (from wikipedia), to best described “what is public”. Second, the classifier was tweaked to (in a rather mysterious way to me) by “pushing the decision boundary closer to the true negative”. A further step does 3-category classification as secret, public and non-enterprise, rather than just secret and not-secret.

Overall: They manage to get the false positive / false negative rate down to less than 2%-3% on the largest datasets evaluated. That is nice. The downside of the approach, is common to most work that I have seen using SVNs. It requires a lot of manual tweaking, and further it does not really make much sense — it is possible to evaluate how well the technique performs on a test corpus, but difficult to tell why it works, or what makes it good or better than other approaches. As a resut, even early positive resutls should be considered as preliminary until more extensive evaluation is done (more like medicine rather than engineering). I would personally like to see a probabilistic model based classifier on similar features that integrates the two-step classification process into one model, to really understand what is going on — but then I tend to have a Baysian bias.

P3CA: Private Anomaly Detection Across ISP Networks
Shishir Nagaraja (IIIT Delhi) and Virajith Jalaparti, Matthew Caesar, and Nikita Borisov (University of Illinois at Urbana-Champaign)

The final paper in the session looks at privacy preserving intrusion detection to enable cooperation between internet service providers. ISPs would like to pool data from their networks to detect attacks: either because the volume of communications is abnormal at certain times, or because some frequency component is odd. Cooperation between multiple ISPs is important, but this cooperation should not leak volumes of traffic at each IPS since they are a commercial secret.

Technically, privacy of computations is achieved by using two semi-trusted entities, a coordinator and key holder. All ISPs encrypt their traffic under an additive homomorphic scheme (Paillier) under the keyholder key, and send it to the coordinator. The coordinator is using the key-holder as an oracle to perform a PCA computation to output the top-n eighen vectors and values of traffic. The cryptographic techniques are quite but standard, and involve doing additions, subtraction, multiplication, comparison and normalization of matrices privately though a joint private two-party computation.

Surprisingly, the performance of the scheme is quite good! Using a small cluster, can process a few tens of time slots from hundresds of different ISPs in tens of minutes. A further incremental algorithm allows on-line computations of eighen vector/value pairs in seconds — making real-time use of the privacy preserving algorithm possible (5 minutes of updates takes about 10 seconds to process).

This is a surprising result: my intuition would be that the matrix multiplication would make the approach impractically slow. I would be quite interested to compare the implementation and algorithm used here with a general MPC compiler (under the same honest-but-curious model).

Shishir Nagaraja has pointed out that our Drac anonymity system is not the first one to consider an anonymity network overlayed on a social network. The performance versus security of routing messages over a social network was already considered in his work entitled ‘anonymity in the wild’.

Shishir Nagaraja: Anonymity in the Wild: Mixes on Unstructured Networks. Privacy Enhancing Technologies 2007: 254-271 [pdf][ppt]

This is important prior work and we should have cited it properly. It presents an analysis of an anonymity provided by different synthetic social network topologies, as well as real-world data from LiveJournal.

Follow

Get every new post delivered to your Inbox.