The UK goes every ten years through a national census, where every household is called to fill in details about their demographics, habits, travel and income. The next one will be the UK 2011 census.

The office for national statistics has a statutory duty to ensure that the data released from this census cannot be used to identify any individual or to infer any of unknown attribute. Techniques for doing so are called statistical disclosure control, and have been the subject of intense study for the last 40 years at least. One could never have guessed by reading the documents on confidentiality for the next UK census.

To make a long story short: the ONS never considered modern well defined notions of privacy, it lacks a reliable evaluation framework to establish the degree of risk of different methods (let alone utility), and has proposed disclosure control measures that fall rather short of the state of the art.

Moving households around (a bit)

The consultation is not totally over yet, but the current favorite after two rounds of evaluation seems to be a technique called “Record Swapping”. How does it work? The technique takes the database of all responses to the census and outputs another database, that is sufficiently different to avoid identification and inference. Record swapping first categorises all records by the household size, sex, broad age, and hard-to-count variables. Then it selects 2-20% of the records, and each of them are paired with a record from the same category. Then the geographical data of each pair of records (yes, right, only the geographical data) are swapped.

This procedure has the effect to disperse geographically the population a bit so that, it is not possible to know whether single cells in tables are indeed providing information about an individual in a region or, whether they are the product of a swap from a different region. The advantage is that the totals are the same (since swapping things around is invariant to addition), the swaps are with “similar” households, and the procedure is simple to implement.

This is in-line with the definition of privacy of the census office, namely that: 

“The Registrars General concluded that the Code of Practice statement can be met in relation to census outputs if no statistics are produced that allow the identification of an individual (or information about an individual) with a high degree of confidence. The Registrars General consider that, as long as there has been systematic perturbation of the data, the guarantee in the Code of Practice would be met.”

Problems with “Record Swapping”

So far a whole process has been followed to evaluate a list of proposed disclosure control measures, present a methodolody to evaluate them, shortlist some, and perform more in-depth research about their utility and privacy. There is a lot of repetition in these documents, a few ad-hoc indicators of quality and privacy, and no security analysis what-so-ever about inference attacks on the proposed schemes. The subject of ” disclosure by differencing” is left as a suggestion for future work in the latest interim report, while the only method left on the list is Record Swapping, as well as ABS, that has apparently not been tested yet at all.

Why is that a problem? Records include many other potentially identifying fields aside from location. Since the rest of the record stand as it is, and is aggregated into tables, with a secret small cell adjustment technique, we cannot really be sure at all that there are no re-identification attacks. (Apparently revealing the details of the technique cannot be divulged for confidentiality reasons, violating even the most basic principle of security engineering! See page 3).

The utility measures used to assess how acceptable these disclosure control measures will be to data users (Shlomo et al.), are themselves very simplistic and do not offer very tight bounds on possible errors but I will leave this matter for the statisticians to blog about.

To make the problem worse, this time the ONS, is seriously thinking of allowing data users to submit their own queries to the database of statistics. The queries are not likely to be full SQL any time soon, but tables on 3 categories (called cubes) are likely to be allowed. This leaves wide open quite a range of attacks in the literature on inference in statistical databases.

At this point there is absolutely no evidence that the disclosure control scheme is actually secure, which in security engineering means that it is probably not.

How did we get to this situation?

It seems the bulk of the work on disclosure control has been done by the ONS, in conjunction with researchers from the University of Southampton. None of the authors of any of the evaluations has a substancial research experience in privacy technology or theoretical computer security that deals with these privacy matters in a systematic way.

What is revealing is the fact that the most relevant related work is never mentioned. It includes:

  • The work of Denning on trackersand inference in statistical databases (1980). Instead the archaic term “differencing” is used.
  • The work of Sweeney and Samarati on linkage attacks and k-anonymity (1997).
  • The work of Dwork on Differential Privacy (2007), which is the most current and strongest definition of privacy for statistical databases.

These works show repeatedly that ad-hoc inference control measures, that only aim to suppress a handful of known and obvious attacks, are systematically bypassed.

Dwork in her work on Differential Privacy (that won the 2009 year’s PET Award) provides clear arguments on why simpler ad-hoc techniques cannot provide the same guarantee of privacy: their results can be aggregated with side information known to the adversary to facilitate inference. Differential privacy on the other hand guarantees that the results of a query to the database, or published table, reveals no more information when composed with other such queries or any side information. 

This is a hot topic in research today, and all the details may not be ready for a census in 2 years time. This does not justify the ONS’s ignorance of this field.

Micah Sherr presented at PETS a few days ago his work on ”Scalable Link-Based Relay Selection for Anonymous Routing“. The key idea is that paths are generated by taking into account the network performance of each link to be used. The overhead of distributing performance information can be reduced by associating with each server a network coordinate, that allows to estimate the latency between pairs of nodes.

This is a pure path selection proposal, as quite a few have appeared in the past year to reduce latency, or increase node utilization in Tor. The question with all those proposals is: how much anonymity would these path selection strategies provide?

The methodology we present in  “The Bayesian Traffic Analysis of Mix Networks” provides a way of answering such questions, by carefully modelling the path selection strategy. Applying the same methodology to these path selection proposals would be of clear benefit, and an excellent project for anyone interested in understanding better how to apply inference based techniques to traffic analysis.

In the last year, we have been developping a set of systematic techniques to analyse anonymity systems, to perform traffic analysis. These cast the problem of traffic analysis as a Bayesian inference problem, where the adversay observes some traces, according to a threat model, and then has to infer the hidden state of the system, that is equivalent to tracing who is talking to whom.

So far we have looked at the analysis of mix networks, the analysis of Crowds, and a Bayesian approach to long term intersection attacks. The papers describing each of these are available online:

I am currently at ACM CCS 2008 listening to the talk on “Dependent link padding algorithms for low latency anonymity systems” by Wei Wang, Mehul Motani and Vikram Srinivasan (the pdf does not seem to be on-line yet). They propose a scheme to provably defeat all packet matching attacks against low-latency anonymity systems, by introducing the minimal amount of cover traffic. The results are theoretically well founded, and of great practical importance since they show how one could provide strong anonymity without “constant rate” padding (as it is often assumed necessary.)

A quick fix for Minx

18 August 2008

Last month I attended the Privacy Enhancing Technologies ( PET 2008 ) Symposium in Leuven, Belgium. The programwas fantastic, with a strong focus on anonymous communications, and many papers on traffic analysis. The associated HotPETS event, was also very fun, with plenty of time for discussion, and the added advantage that all the papers are on-line.

A paper that had to catch my attention was entitled “Breaking and Provably Fixing Minx” by Erik Shimshock, Matthew Staats, and Nicholas Hopper, that shows an attack against the Minx scheme Ben Laurie and myself had proposed back in 2004. Minx is a cryptographic packet format to be used by anonymous remailers (or mixes) for high-latency, email like, communication. It was designed to be space efficient, meaning that we radically cut down on the padding and redundancy within the packet, and used raw RSA.

That last use of raw RSA proved to be a bridge too far: recent results show that all bits of RSA are hardcore, meaning that if you do not know the key you cannot guess them. Sadly the inverse is also true, and if you can know even a single bit of the plaintext with non-negligible advantage, there is a polynomial time algorithm to extract the key.

Read the rest of this entry »

The Privacy Enhinacing Technologies Symposium for 2008 is getting closer, and many of us are looking forward to a program packed with anonymous communications and traffic analysisresearch. Paul Syverson and myself (George Danezis) will be presenting our recent work on “Bridging and Fingerprinting” (PDF), a family of attacks that affect mix or onion routing based systems in which users do not know all possible relays in the network.

The starting point of our analysis is that it is getting expensive for all clients in anonymity networks (like Tor) to know all anonymizing routers, along with their cryptographic keys and addressing information. There are about a few thousands right now, that their number will hopefully go up. So would anything bad happen if each client only knows a random subset of those routers, and builds anonymous circuits using those? We show that there is a new class of attacks that might be possible.

First there is path or circuit Fingerprinting, first introduced in a work with Richard Clayton. If Alice builds paths from a small subset of routers that she knows (that is also known to the adversary), it is likely that the resulting paths and path fragments uniquely identify her. No other client would know all nodes necessary to build them. If the adversary observed a circuit fragment, they can reduce the set of possible initiators to those  clients that know all relays in the fragment — a number that is smaller than all clients in the network.

Secondly there is Bridging, the novel attack proposed. The adversary observes connections or messages going through an honest anonymizing relay, and tries to infer which input correspond to which output. It must be the case that for any potential connection though the honest node the resulting path fragment should be known to at least one node in the network. Therefore the adversary can eliminate all potential routes that could not have been created by any client, and may end up `bridging’ this honest stage.

The paper is filled with theory and probabilities describing when these attacks succeed. But what do they mean for anonymous communication designers? First they are not really a direct threat for low-latency communications like Onion Routing or Tor, since the effort of collecting the data necessary to perform these attacks is outside the threat model of these systems — an adversary that can perform bridging or fingerprinting is likely to be also to otherwise break those systems. Still it is important to point out that the only information necessary to perform them is link level information, and not fine-grained traffic data necessary for timing analysis. They are still within the threat models of mix networks like mixminion.

Still this work shows that naively allowing users to know only part of the network reduces the security of anonymous communications. Non-naive strategies for achieving such architectures could include splitting the networks or learning a random subset of nodes without the adversary finding them out. The second architecture opens the way for a new type of service: a Private Information Retrieval based Directory Server. Clients can connect to it and retrieve some router records, without any observer or the service itself learning that set: this would make both Bridging and Fingerprinting much harder.

For a long time I have been sceptical about Private Information Retrieval (PIR) schemes and security schemes based on them. My first experience of PIR was in the single server setting, where communication and computation complexity makes them impractical. Re-reading the The Pynchon Gate I realized that multi-server PIR systems are computationally cheap, bandwidth efficient and relatively simple to implement.

The ‘only’ downside of multi-server PIR is that they are subject to compulsion attacks. A powerful adversary can force servers, after a query, to reveal the client queries, and can infer which document was retrieved. This is an inherent limitation of using a collection of trusted parties, so it is difficult to eliminate. On the other hand a system can make the task of the attacker much more expensive and difficult, though the use of forward security mechanisms.

Here is a proposal for achieving forward-secure compulsion-resistant multi-server PIR: the user contacts the servers one by one, using an encryption channel providing forward secrecy (OTR would work; so would SSL using signed ephemeral DH.) After the result of the query is returned, the server securely deletes all information about the query, and forgets the session keys associated with the channel. At this point an adversary will never be able to retrieve any information about the query or the result, even if they get access to all the secrets on the server.

The user can then proceed to perform the same protocol sequentially with all the other servers participating in the PIR scheme. After sessions with each server close, the user is guaranteed that the query information will never be retrieved in the future. A single honest server, willing to provide strong guarantees against compulsion, is sufficient to guarantee this property, even if all the others log requests and are ready to hand them over to the adversary.

Furthermore the sequential nature of the requests allow a client to terminate the query early, if there is any suspicion that one or more servers act under compulsion. This could be detected through a covert channel, a change of key, or unavailability. This technique is a further argument for operators to terminate their services instead of giving in to compulsion.

The list of accepted papers for the Privacy Enhancing Technologies Symposium for 2008 (PET) has just been published. The program this year is very heavy on anonymous communication, and traffic analysis, cementing the symposiumas the de-facto venue in these disciplines.

A selection of papers to appear, that make a very interesting read  (thanks to the authors giving me a copy!), include:

  • Studying Timing Analysis on the Internet with SubRosa
    Hatim Daginawala and Matthew Wright (University of Texas at Arlington, USA)
  • Perfect Matching Disclosure Attacks
    Carmela Troncoso, Benedikt Gierlichs, Bart Preneel, and Ingrid Verbauwhede (COSIC, K.U. Leuven, Belgium)
  • On the Impact of Social Network Profiling on Anonymity
    Claudia Diaz, Carmela Troncoso (K.U.Leuven, Belgium), and Andrei Serjantov (The Free Haven Project, UK)
  • Chattering Laptops
    Tuomas Aura (Microsoft Research, UK), Janne Lindqvist (Helsinki University of Technology, Finland), Michael Roe (Microsoft Research, UK), and Anish Mohammed (Royal Holloway, University of London, UK)
  • Metrics for Security and Performance in Low-Latency Anonymity Systems
    Steven J. Murdoch and Robert N. M. Watson (Computer Laboratory, University of Cambridge, UK)

I should blog in detail about our own contributions, when the final drafts are in, so that I can provide a link to the full text:

  • Bridging and Fingerprinting: Epistemic Attacks on Route Selection
    George Danezis (Microsoft Research, Cambridge, UK) and Paul Syverson (Naval Research Laboratory, USA)
  • How to Bypass Two Anonymity Revocation Schemes
    George Danezis (Microsoft Research, Cambridge, UK) and Len Sassaman (K.U. Leuven, Belgium)

Lets not forget that the PET Award for 2008 deadline is still ahead of us, and good papers from the last two years deserve to be nominated for it!

I was kindly invited by Peeter Laud and Jan Villemson to teach part of the Tartu University course on cryptographic protocols. The theme chosen for the 6 hours of lectures was “Identity and Anonymity“.

 It started with an introduction to authentication protocols, illustarted by the state-of-the-art Just Fast Keying, an authenticated Diffie-Hellaman with DoS protection and privacy, as well as the PAK protocol for authenticated key exchange using a short password (like a PIN). The second 1.5 hours of the lecture series went into the technical details of Brands anonymous credentials, building from simple zero-knowledge protocols, such as Schnorr identification, all the way to how to prove a that a set of linear relation holds on some attributes.

  • Lecture notes on authentication protocols and anonymous credentials (PDF, PPTX)

The second half of the lecture series (3 hours) was devoted to anonymous communications — my pet subject. It contained a review of the Dining Cryptographers protocol for unconditional anonymity, as well as the basic mix constructions to build practical systems. It also described how onion routing works as well as how to measure anonymity and the basics of disclosure attacks on repeated communications.

  • Lecture notes on anonymous communications and traffic analysis (PDF, PPTX)

I just finished reading:

Anonymous Networking amidst Eavesdroppers (PDF)
by Parvathinathan Venkitasubramaniam, Ting He, and Lang Tong.
Pre-print available as arXiv:0710.4903v1 at arxiv.org, October 2007.

This piece of work has the potential to become a milestone is anonymity research. It puts forward, for the first time, an analytical model of the anonymous communication capacity for a certain degree of quality of service, in terms of latency and throughput.

The key idea behind the anonymity mechanism is that relays transmit a schedule that is statistically independent from the messages they receive. This ensure that no traffic characteristics propagate in the network to allow tracing. When it is time to send a message out, a node checks if an input message is in the queue, otherwise it sends a dummy. Input messages in the internal queues expire after a certain time to ensure that message are delivered in a timely fashion (introducing unreliability, and thus a reduction in capacity, instead.)

The measure of anonymity used is entropic (but different from previous ones), in that it measure the fraction of information gain. Say the adversary has H(S) bits of uncertainty before the observation of the system, and H(S|Y) afterwards, the metric of security is defines as a = H(S|Y) / H(S). It is cute that Fano’s inequality allows to calculate the probability the adversary identifies the correct set of communicating partners from this metric.

Excellent, highly recommended, work!