The UK goes every ten years through a national census, where every household is called to fill in details about their demographics, habits, travel and income. The next one will be the UK 2011 census.

The office for national statistics has a statutory duty to ensure that the data released from this census cannot be used to identify any individual or to infer any of unknown attribute. Techniques for doing so are called statistical disclosure control, and have been the subject of intense study for the last 40 years at least. One could never have guessed by reading the documents on confidentiality for the next UK census.

To make a long story short: the ONS never considered modern well defined notions of privacy, it lacks a reliable evaluation framework to establish the degree of risk of different methods (let alone utility), and has proposed disclosure control measures that fall rather short of the state of the art.

Moving households around (a bit)

The consultation is not totally over yet, but the current favorite after two rounds of evaluation seems to be a technique called “Record Swapping”. How does it work? The technique takes the database of all responses to the census and outputs another database, that is sufficiently different to avoid identification and inference. Record swapping first categorises all records by the household size, sex, broad age, and hard-to-count variables. Then it selects 2-20% of the records, and each of them are paired with a record from the same category. Then the geographical data of each pair of records (yes, right, only the geographical data) are swapped.

This procedure has the effect to disperse geographically the population a bit so that, it is not possible to know whether single cells in tables are indeed providing information about an individual in a region or, whether they are the product of a swap from a different region. The advantage is that the totals are the same (since swapping things around is invariant to addition), the swaps are with “similar” households, and the procedure is simple to implement.

This is in-line with the definition of privacy of the census office, namely that: 

“The Registrars General concluded that the Code of Practice statement can be met in relation to census outputs if no statistics are produced that allow the identification of an individual (or information about an individual) with a high degree of confidence. The Registrars General consider that, as long as there has been systematic perturbation of the data, the guarantee in the Code of Practice would be met.”

Problems with “Record Swapping”

So far a whole process has been followed to evaluate a list of proposed disclosure control measures, present a methodolody to evaluate them, shortlist some, and perform more in-depth research about their utility and privacy. There is a lot of repetition in these documents, a few ad-hoc indicators of quality and privacy, and no security analysis what-so-ever about inference attacks on the proposed schemes. The subject of ” disclosure by differencing” is left as a suggestion for future work in the latest interim report, while the only method left on the list is Record Swapping, as well as ABS, that has apparently not been tested yet at all.

Why is that a problem? Records include many other potentially identifying fields aside from location. Since the rest of the record stand as it is, and is aggregated into tables, with a secret small cell adjustment technique, we cannot really be sure at all that there are no re-identification attacks. (Apparently revealing the details of the technique cannot be divulged for confidentiality reasons, violating even the most basic principle of security engineering! See page 3).

The utility measures used to assess how acceptable these disclosure control measures will be to data users (Shlomo et al.), are themselves very simplistic and do not offer very tight bounds on possible errors but I will leave this matter for the statisticians to blog about.

To make the problem worse, this time the ONS, is seriously thinking of allowing data users to submit their own queries to the database of statistics. The queries are not likely to be full SQL any time soon, but tables on 3 categories (called cubes) are likely to be allowed. This leaves wide open quite a range of attacks in the literature on inference in statistical databases.

At this point there is absolutely no evidence that the disclosure control scheme is actually secure, which in security engineering means that it is probably not.

How did we get to this situation?

It seems the bulk of the work on disclosure control has been done by the ONS, in conjunction with researchers from the University of Southampton. None of the authors of any of the evaluations has a substancial research experience in privacy technology or theoretical computer security that deals with these privacy matters in a systematic way.

What is revealing is the fact that the most relevant related work is never mentioned. It includes:

  • The work of Denning on trackersand inference in statistical databases (1980). Instead the archaic term “differencing” is used.
  • The work of Sweeney and Samarati on linkage attacks and k-anonymity (1997).
  • The work of Dwork on Differential Privacy (2007), which is the most current and strongest definition of privacy for statistical databases.

These works show repeatedly that ad-hoc inference control measures, that only aim to suppress a handful of known and obvious attacks, are systematically bypassed.

Dwork in her work on Differential Privacy (that won the 2009 year’s PET Award) provides clear arguments on why simpler ad-hoc techniques cannot provide the same guarantee of privacy: their results can be aggregated with side information known to the adversary to facilitate inference. Differential privacy on the other hand guarantees that the results of a query to the database, or published table, reveals no more information when composed with other such queries or any side information. 

This is a hot topic in research today, and all the details may not be ready for a census in 2 years time. This does not justify the ONS’s ignorance of this field.

I just come back from a visit to COSIC at K.U. Leuven, to teach a course on Computer Security. Claudia Diaz and myself discussed over lunch the idea of putting together a syllabus for Privacy Technologies. Many in this field have been teaching courses and giving guest lectures, but there does not seem to be yet a canonical curriculum, describing that an advanced course in Privacy Technology should teach.

Here is my attempt at proposing such a syllabus — which I will probably revise after discussions at PETS 2009 next week.

  1. An introduction to Privacy Technology
    An overview of the basic concepts, different fields like technology and law, motivation, threat models, Soft versus Hard privacy technology.
    Slides from the 2007 COSIC course: Introduction to Privacy Technology [pdf]
    (Claudia Diaz has vastly improved these slides to present a lecture on the same topic in this years COSIC course.)
  2. Privacy in authentication
    Modern authentication protocols, initiator privacy and responder privacy, JFKi and JFKr examples, secure password authentication, PAK.
    Slides from Estonia computer security course in 2007: Secure authentication[pdf, start at slide 3]
  3. Selective Disclosure Credentials
    Zero knowledge proofs, selective disclosure for discrete logs, Brands credentials, CL signatures and CL credentials, e-cash, abuse prevention.
    Slides from Estonia computer security course in 2007: Anonymous credentials[pdf, start at slide 45]
  4. Anonymous communications
    Proxies, Crowds, DC networks, mix networks and onion routing.
    Slides from ITE talk in 2006: Introducing Anonymous Communications[pdf]
  5. An introduction to traffic analysis
    History, identification, information extraction, military applications, internet security, traffic data availability
    Slides from SantaCrypt 2005 in Prague: An introduction to traffic analysis[pdf]
  6. The traffic analysis of anonymous communications
    Cryptographic attacks, long term intersection and disclosure, short term disclosure, bridging, network discovery.
    Slides from Umass Amherst talk: Introduction to traffic analysis[pdf]
  7. Privacy in Databases
    Inference control, k-anonymity, differential privacy, perturbation, trackers.
    Slides from CFP 2007: Privacy in Databases[pdf, start at slide 58]
  8. Privacy in Storage
    Encrypted storage, steganographic storage, remote storage, traffic analysis of storage protocols.
  9. Secure Elections
    Electronic voting technologies, secure crypto elections, manual zero-knowledge proofs, receipt freeness, robust mixing.
  10. Censorship resistance and availability
    Blocking technologies, counter-blocking technologies, RF technologies, peer-to-peer file sharing, decentralisation and reputation technologies, sybil attacks.
  11. Location privacy
    Location based services, Mix zones, ad-hoc network privacy, location privacy friendly location services (PriPAYD), charging schemes.
  12. Identity management protocols
    Federated identity management, Liberty, InfoCards, OpenID, PRIME project concepts, privacy policies, P3p, SecPAL.
  13. Economic, legal and policy issues of Privacy Technology
    Privacy economics & attitudes, data protection, data retention, interception by design, lawful access, coercion, privacy as a right, health information.

Note that the order of the topics is arbitrary, and mostly related to what slides I have already available. One could start with less technical subjects and then go to the more cryptographic and statistical topics. If anyone has any nice pointers to slide decks for the topics that have none for the moment, I would appriciate them.

Lords recommend PETs

6 February 2009

The house of Lords Constitution Committeehas just published a report on Surveillance: Citizens and the State as well as the evidence they heard. As part of their recommendations they push Privacy enhancing Technologies to be part of the procurement process of government projects. In particular they say:

485. We recommend that the Government review their procurement processes so as to incorporate design solutions that include privacy-enhancing technologies in new or planned data gathering and processing systems. (paragraph 349)

They also push, albeit in an indirect way, for privacy enhanced identification schemes and ID cards, citing the example of Austria. This is basically a recommendation to implement selective disclosure credential technologies:

478. We recommend that the Government’s development of identification systems should give priority to citizen-oriented considerations. (paragraph 268)

Which refers to:

268. The Information Commissioner’s Office (ICO) drew attention to the use in Austria of a system of identification numbers that allows access to information in different databases “without the need for a single widely known personal identification number that may be misused.” (p 5) The Royal Academy of Engineering (RAE) explained that it is possible for individuals to fulfil their legitimate need or desire to maintain multiple roles or identities in transactions with state or other organisations and to avoid the possibility of those organisations needlessly correlating them. The technology involved in identification can be developed to suit an individual’s preference to keep domestic status and work life separate, where the protection of identity is necessary to avoid abusive relationships or stalking, or where witnesses and children need protection.118 We recommend that the Government’s development of identification systems should give priority to citizen-oriented considerations.

This is all good news! It is indeed at the procurement phase that such requirements for PETs should be specified and entrenched in the delivery contracts. Negotiating PETs for complex surveillance technologies will also make the cost of recording data just-in-case visible.

The German Big Brother award winners’ list is out, and the Technology award was given to a Pay-as-you-drive insurance company. The nomination and award statement reads:

The Big Brother Award 2007 in the “Technology” category goes to

PTV Planung Transport Verkehr AG,
represented by Dr Hans Hubschneider

for their system for individual rating of car insurances with the so-called “pay-as-you-drive” technology, i.e. a device that records routes and driving behaviour in a car and transmits these data to the insurance company. [...]

It is good to see that for once Privacy Technology proposals are not lagging behind: an interdisciplinary team from K.U.Leuven has already put forward a proposal for a privacy friendly pay-as-you-drive scheme.

Carmela Troncoso, George Danezis, Eleni Kosta, Bart Preneel: Pripayd: privacy friendly pay-as-you-drive insurance. WPES 2007: 99-107

The list of accepted papers for the Privacy Enhancing Technologies Symposium for 2008 (PET) has just been published. The program this year is very heavy on anonymous communication, and traffic analysis, cementing the symposiumas the de-facto venue in these disciplines.

A selection of papers to appear, that make a very interesting read  (thanks to the authors giving me a copy!), include:

  • Studying Timing Analysis on the Internet with SubRosa
    Hatim Daginawala and Matthew Wright (University of Texas at Arlington, USA)
  • Perfect Matching Disclosure Attacks
    Carmela Troncoso, Benedikt Gierlichs, Bart Preneel, and Ingrid Verbauwhede (COSIC, K.U. Leuven, Belgium)
  • On the Impact of Social Network Profiling on Anonymity
    Claudia Diaz, Carmela Troncoso (K.U.Leuven, Belgium), and Andrei Serjantov (The Free Haven Project, UK)
  • Chattering Laptops
    Tuomas Aura (Microsoft Research, UK), Janne Lindqvist (Helsinki University of Technology, Finland), Michael Roe (Microsoft Research, UK), and Anish Mohammed (Royal Holloway, University of London, UK)
  • Metrics for Security and Performance in Low-Latency Anonymity Systems
    Steven J. Murdoch and Robert N. M. Watson (Computer Laboratory, University of Cambridge, UK)

I should blog in detail about our own contributions, when the final drafts are in, so that I can provide a link to the full text:

  • Bridging and Fingerprinting: Epistemic Attacks on Route Selection
    George Danezis (Microsoft Research, Cambridge, UK) and Paul Syverson (Naval Research Laboratory, USA)
  • How to Bypass Two Anonymity Revocation Schemes
    George Danezis (Microsoft Research, Cambridge, UK) and Len Sassaman (K.U. Leuven, Belgium)

Lets not forget that the PET Award for 2008 deadline is still ahead of us, and good papers from the last two years deserve to be nominated for it!

In security engineering it is a known fact that few practical systems can claim unconditional security. Security is always reducible to a set of assumptions that are believed to hold.

In traditional cryptography these assumptions are quite clear: keys have to be kept safe; block ciphers and hash functions should be indistinguishable from random permutations and random functions; and public key primitives rely on the hardness of a small set of mathematical problems.

Traditional security engineering is encouraging us to think along similar lines: any secure system has to have a well defined Trusted Computing Base (TCB) that functions according to the specification, and ensure that the security properties of the larger system overall hold. It is the role of the TCB to make sure access control, key storage, memory protection, and privileges, to only mention a few hold in commodity operating systems. In operating systems with mandatory access control the TCB is in charge of imposing the security policy, on sometimes unwilling users. Often the correctness of the TCB cannot be shown mathematically and sound engineering practices have to be employed to implement it: these include code reviews, safe languages, verification and most importantly simplicity and few lines of code.

Hidden under the mathematical assumption or the secure TCB assumption, are people. People will hold access to keys, and coalitions of them are going to have access and permissions to modify the TCB. Shamir’s secret sharing scheme, despite being mathematically unconditionally secure, still relies on some fraction of users holding shares to abide by a security policy and not divulge their shares. For a long time people were considered outside the system, and now security engineering starts considering them explicitly.

So far so good: mathematics, trusted computing bases, or quorums of people are a same foundation on which one can build secure systems.

Yet a worrying trend comes from some corners of systems research: researchers extend or build over complex underlying system without fully understanding their security assumptions — and assume that the final system will be secure. A typical example is any security paper that makes use of a secure Distributed Hash Table (DHT). Some paper discuss how one could build such a beast, either making assumption about admission control and detection of mis-routing, or trying to use social structure to alleviate routing problems and the sybil attack. Those `secure’ DHTs are fragile, make multiple assumptions about their operational environment and impose restrictions and overheads on how the DHT can be used. It is not acceptable to use them as black box replacements for insecure DHTs thinking that their security properties will magically make the overall design secure.

Even worse than not understanding the composition of security properties or assumptions it to make assumptions that are demonstrably false in the real world for the sake of providing a security argument. To use again an example of DHTs, it has become a meme in the systems research community that it is acceptable to do addmission and rate control to limit the Sybil attack, based on IP addresses. The assumption is that IP addresses are a scarse resource, and that an adversary will find it hard to be able to route streams from multiple IP addresses. This is simply not the case: powerful adversaries control large national firewalls / routers that they can use to pretend they are any machine they wish, and pathetic adversaries can buy bot-net time on thousands of machines for little money. How have we allowed this bogus `security assumption’ to appear in so many publications?

Another meme is that selecting a quorum of nodes from an undefined population will increase security. This is an exterme form of the `kindness of strangers’ assumption and comes down to assuming that given enough people a proportion of them will be good. This might be true (even though sociologists will have to work hard to prove it) if the population is guaranteed to be drawn at random, or representative at least of the user base. But if a system can be joined by adversaries perfoming a Sybil attack, it is not clear there is any basis for trusting 1000 nodes more than 10, or think that within those 1000 nodes only 30% would be corrupt. This is a parody of the quorum of people assumption in Shamir’s secret sharing sheme: in secret sharing you know who the people are, and assume that a quotum of them will be honest — given that you know their identity, and have made a trust decision upon it!

Researchers in the Privacy Technology field have for a long time warned that services collecting and keeping personally identifiable information may face hidden costs down the line — in case the data gets compromised! If those in charge of procurement took this risk into account, then they would start finding Privacy-friendly technologies, that cost more upfront, cheaper in the long run. This would be true, even without taking into account abstract costs such as reputation loss, resulting from unauthorised access.

For a long time this theory went untested, but one case has emerged that puts a value on this risk. As an article from Infosecurity reports on the 17th December: the UK’s Financial Services Authority has fined life assurance company Norwich Union Life £1.26 million ($2.54m, €1.77m) for “not having effective systems and controls in place to protect customers’ confidential information and manage its financial crime risks.” (Emphasis added.)

This is a lot of money, and it would have been even more (NU got a 30% discount) if the building society had not fully cooperated from the start of the investigation. Given that in today’s world the probability of compromise of non protected information is very high, this gives us an estimate on how much NU, should have spent on Privacy Technologies, to ensure they can conduct their business without collecting much data as well as better protecting the data they collect.

Incidentally, and more related to traffic analysis, another branch of Norwich Union runs wide-scale trials for Pay-as-you-drive insurance. They better take note of what happened to their sister company, and change their technical infrastructure to something more privacy friendly: right now they simply collect and store the full location over time of all insured vehicles. Some colleagues any myself have written about a detailed architecture to provide the same functionality without the collection of such data:

Troncoso, C., Danezis, G., Kosta, E., and Preneel, B. 2007. Pripayd: privacy friendly pay-as-you-drive insurance. In Proceedings of the 2007 ACM Workshop on Privacy in Electronic Society (Alexandria, Virginia, USA, October 29 – 29, 2007). WPES ‘07. ACM, New York, NY, 99-107. DOI= http://portal.acm.org/citation.cfm?doid=1314333.1314353

I was kindly invited by Peeter Laud and Jan Villemson to teach part of the Tartu University course on cryptographic protocols. The theme chosen for the 6 hours of lectures was “Identity and Anonymity“.

 It started with an introduction to authentication protocols, illustarted by the state-of-the-art Just Fast Keying, an authenticated Diffie-Hellaman with DoS protection and privacy, as well as the PAK protocol for authenticated key exchange using a short password (like a PIN). The second 1.5 hours of the lecture series went into the technical details of Brands anonymous credentials, building from simple zero-knowledge protocols, such as Schnorr identification, all the way to how to prove a that a set of linear relation holds on some attributes.

  • Lecture notes on authentication protocols and anonymous credentials (PDF, PPTX)

The second half of the lecture series (3 hours) was devoted to anonymous communications — my pet subject. It contained a review of the Dining Cryptographers protocol for unconditional anonymity, as well as the basic mix constructions to build practical systems. It also described how onion routing works as well as how to measure anonymity and the basics of disclosure attacks on repeated communications.

  • Lecture notes on anonymous communications and traffic analysis (PDF, PPTX)