I was very honored to be invited to Asiacrypt 2013 to present some of our work on privacy-friendly computations. It was an opportunity to consolidate a presentation that includes an overview of privacy-friendly billing and aggregation for smart metering. The slides of the presentation are available in Powerpoint 2012 format (and an older ppt format).

The key references providing more technical details on smart-metering privacy are:

Privacy-friendly Aggregation for the Smart-grid
Klaus Kursawe (Radboud Universiteit Nijmegen) and George Danezis and Markulf Kohlweiss (Microsoft Research)

Privacy in for smart electricity provision seems to be a rising topic, and this year there is a whole session on it at PETS 2011. The first paper (one which I am a coauthor) looks at the problem of gathering aggregate data from groups of smart meters, without allowing any third party to get the the individual measurements. This can be applied as a PET to solve real-world problems such as fraud detection, leakage detection, load estimates, demand response, weather prediction — all of which only require aggregate data (sometimes in real time), not individual measurements.

The key challenge to providing a private aggregation protocols are the specific constraints of smart meters. They are cheap devices, with modest resources, hardly any bandwidth, no ability to communicate, etc. Two specific protocols are presented: the first one allows to compare the sum of meter readings with a reference number (maybe measured from a feeder meter). This protocol allows for fancy proofs of correctness, but it slow in terms of computation and bandwidth (it requires public key operations for each reading). The second protocol is extremely fast and has no communication overhead. In both cases a pragmatic approach to the threat model is followed: we assume that the utilities will be honestly defining groups of meters and facilitating the key management protocol — for the second protocol there is no overhead of public key operations after the initial key setup.

The key highlight from this work is not as much its technical depth (tricks with DC networks and hash function that would not surprise any PETS regular). What is interesting is that the protocols were designed for a real industrial application and now fully integrated on real smart meters and their communication protocols in collaboration with our collaborators at Elster.

Plug-in privacy for Smart Metering billing
Marek Jawurek, Martin Johns, and Florian Kerschbaum (SAP Research)

This second paper looks at the problem of billing for fine-grained time of use tariffs — their energy consumption at different times costs a different rate per unit. This is a very important topic, as correct billing and time of use tariffs are a key driver of fine-grained data collection through smart meters — if we can do billing privately then maybe less personal information may be collected.

Technically the protocols proposed are based on the homomorphic properties of Pedersen commitments: readings are commitments, and you can use multiplication by a constant and addition to compute the bill, and (most importantly) prove that it is correct. The system model is that the meter outputs signed commitments of readings, a privacy component computes the bill and proofs of correctness, and those are sent to the supplier for verification (and printing the bills!).

This is the core of a nice solution for the basic billing case (which is likely to be the common one in smart grids). We have shown in related work that the protocol can be further improved to have zero communication overhead. Since it avoids expensive zero-knowledge proofs it is fast for its proofs and verification. It also provides the basic infrastructure to support further more expressive billing policies and general computations.

Quantifying Location Privacy: The Case of Sporadic Location Exposure
Reza Shokri and George Theodorakopoulos (EPFL), George Danezis (Microsoft Research), and Jean-Pierre Hubaux and Jean-Yves Le Boudec (EPFL)

This work evaluates the privacy of using location-based services sporadically using a set of location privacy mechanisms. Sporadic services include those that require location infrequently, rather than continuously (think of restaurant suggestions rather than relaying real-time GPS streams). The key novelty of the approach is that the model of location exposure, as well as privacy protection is very general. It encompasses anonymization, generalization and obfuscation of location, use of fake traffic and suppression of location. In turn the analysis relies on advanced models of location and mobility (based on markov chains) and is based on Bayesian inference. The evaluation of different location privacy techniques is done on real-world traces from SF taxis.

I am one of the authors of this work, so of course I think it is awesome! More seriously, it is one of the fist works to combine under a common framework a multitude of location privacy mechanisms, and provide a common evaluation framework for them, to quantify the degree of protection they offer relatively to each other for different adversaries. It is also one of the first systematic applications of Bayesian inference to analyze location privacy — extending the inference paradigm beyond the analysis of network anonymity systems.

Of course this is not the last word. Only a subset of protection techniques and combination of techniques were look at, and other protection mechanisms can be integrated and evaluated in the same framework (the tracing model and threat model can be unchanged). Secondly, the analysis itself may be augmented with side-information — be it commercial transactions or traces of network traffic — that may be giving some information about location, to increase the capabilities of the adversary (or make them more realistic). The model we use, based on markov chains, has the benefit of giving analytically tractable results, but numerical techniques may be used to extend it to be more true to real-life attacks.

The Location-privacy Meter tool that can be used to evaluate custom location privacy protections is available for download!

Privacy in Mobile Computing for Location-Sharing-Based Services
Igor Bilogrevic and Murtuza Jadliwala (EPFL), Kubra Kalkan (Sabanci University), Jean-Pierre Hubaux (EPFL), and Imad Aad (Nokia)

This paper looks at applications where users need to share their location. For example, two users may want to find out if they are close to each other or where they should meet in order to share a taxi ride. Yet, those users do not want to leak any of their location information to the other users or the service provider. More specifically two users specify a set of ranked prefered location they could meet and the system needs to determine on of those fairly without revealing the current location or other preferences (except the one chosen to meet). This is called the fair rendez-vous problem.

The key contribution of this work is to show that this problem can be set with a set of concrete cryptographic protocols. It also presents an implementation of these algorithms on a real mobile phone to show that it is practical. The cryptographic computations are based on homomorphic encryption schemes as well as interactions with the service (to do multiplication that is not possible with Paillier). The implementation on a mobile phone takes a few seconds on the client and the server, and is paralelizable in the number of users. Untypically, the authors also did a user study: users were asked what their concerns were, and after using the application of the phone they were asked how usable it was, and whether they appriciated the privacy provided by the application.

This is a really nice example of a privacy application, that applies advanced crypto, but also evaluates it on a real platform for performance as well as users’ reaction to it. The obvious extensions to this work would be to generalize it to more complex rendez-vous protocols, as well as other location sharing applications. It is good to see that modern mobile devices can do plenty of crypto in a few seconds, so I am very hopeful we will see more work in this field.

On The Practicality of UHF RFID Fingerprinting: How Real is the RFID Tracking Problem?
Davide Zanetti, Pascal Sachs, and Srdjan Capkun (ETH Zurich)

This paper looks that UHF tags — they are the dumb tags that can be read at about 2m that are attached to things you buy to facilitate stock management or customer aftercare. Interestingly this study looks at how identifiable the tags are at the physical layer, not using the actual tag ID! Therefore these techniques may bypass any privacy protection that attempt to prevent access to the tag ID. It turns our that one can build a unique and reliable ID for a tag from its physical characteristics that can be used to trace people as they move around.

What is new about this work is that the focus was on practicality and cost of extracting a reliable fingerprint (previous approaches relied on expensive equipment and laboratory conditions). The solution was implemented using a cheap software radio (USRP2 device + PC).

I am not quite sure what to conclude from the evaluation on the quality of the fingerprint. It seems that an adversary can place tags within one of 83 to 100 groups. Is this really a good results or not? I guess it depends on the application and the density of tags. Of course if more than one tag is carried, then the adversary could combine fingerprints to identify individuals more easily — if you carry 5 tags you have a 20 bit IDs. Interestingly, there is extensive evaluation of the stability of the tag to temperature and mobility — it turns out that these factors do affect the quality of the fingerprint and further reduce the effective number of unique IDs that can be extracted (down to about 49 classes).

It would be interesting to combine this attack vector with the ideas from the first paper (pretending that the short physical IDs are a version of a privacy protection system) to evaluate the effectivness of tracing a set of individual throughout town.

I am currently sitting at the PETS 2011 symposium in Waterloo, CA. I will be blogging about selected papers (depending on the sessions I attend) over the next couple of days — authors and other participants are welcome to comment!

The first session is about data mining and privacy.

“How Unique and Traceable are Usernames?”
Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils (INRIA)

The first paper looks at the identifiably of on-line usernames. The authors looked at user names from different sites and assess the extent to which they can be linked together, as well as link them to a real person. Interestingly they used Google Profiles as ground truth, since they allow users to provide links to other accounts. First they assess the uniqueness of pseudonyms based on a probabilistic model: a k-th order markov chain is used to compute the probability of each pseudonym, and its information content (i.e. -log_2 P(username)). The authors show that most of the usernames observed have “high entropy” and should therefore be linkable.

The second phase of the analysis looks at usernames from different services, and attempts to link them even given small modifications to the name. The key dataset used was 300K google profiles, that list (sometimes — they used 10K tuples of usernames) other accounts as well. They then show that the Levenshtein distance (i.e. edit distance) of usernames from the same person is small compared to the distance of two random user names. A classifier is built, based on a threshold of the probabilistic Levenshtein distance, to assess whether a pair of usernames belongs to the same or a different user. The results seem good: about 50% of usernames are linkable with no mistakes.

So what are the interesting avenues for future work here? First, the analysis used is a simple threshold on the edit distance metric. It would be surprising if more advanced classifiers could not be applied to the task. I would definitely try to use random forests for the same task. Second, the technique can be used for good not evil: as users try to migrate between services, they need an effective way to find their contacts — maybe the proposed techniques can help with that.

“Text Classification for Data Loss Prevention” (any public PDF?)
Michael Hart (Symantec Research Labs), Pratyusa Manadhata (HP Labs), and Rob Johnson (Stony Brook University)

The paper looks at the automatic classification of documents as sensitive or not. This is to assist “data loss prevention” systems, that raise an alarm when personal data is about to be leaked (i.e. when it is about to be emailed or stored on-line — mostly by mistake). Traditionally DLP try to describe what is confidential through a set of simple rules, that are not expressive enough to describe and find what is confidential — thus the authors present a machine learning approach to automatically classify documents using training data as sensitive or not. As with all ML techniques there is a fear of mistakes: the technique described tries to minimise errors when it comes to classifying company media (ie. public documents) and internet documents, to prevent the system getting on the way of day to day business activities.

The results were rather interesting: the first SVN classifier looked at unigrams with binary weights to classify documents. Yet, it first had a very high rate of false positives for public documents. It seems the classifiers also had a tendency to classify documents as “secret”. A first solution was to supplement the training set with public documents (from wikipedia), to best described “what is public”. Second, the classifier was tweaked to (in a rather mysterious way to me) by “pushing the decision boundary closer to the true negative”. A further step does 3-category classification as secret, public and non-enterprise, rather than just secret and not-secret.

Overall: They manage to get the false positive / false negative rate down to less than 2%-3% on the largest datasets evaluated. That is nice. The downside of the approach, is common to most work that I have seen using SVNs. It requires a lot of manual tweaking, and further it does not really make much sense — it is possible to evaluate how well the technique performs on a test corpus, but difficult to tell why it works, or what makes it good or better than other approaches. As a resut, even early positive resutls should be considered as preliminary until more extensive evaluation is done (more like medicine rather than engineering). I would personally like to see a probabilistic model based classifier on similar features that integrates the two-step classification process into one model, to really understand what is going on — but then I tend to have a Baysian bias.

P3CA: Private Anomaly Detection Across ISP Networks
Shishir Nagaraja (IIIT Delhi) and Virajith Jalaparti, Matthew Caesar, and Nikita Borisov (University of Illinois at Urbana-Champaign)

The final paper in the session looks at privacy preserving intrusion detection to enable cooperation between internet service providers. ISPs would like to pool data from their networks to detect attacks: either because the volume of communications is abnormal at certain times, or because some frequency component is odd. Cooperation between multiple ISPs is important, but this cooperation should not leak volumes of traffic at each IPS since they are a commercial secret.

Technically, privacy of computations is achieved by using two semi-trusted entities, a coordinator and key holder. All ISPs encrypt their traffic under an additive homomorphic scheme (Paillier) under the keyholder key, and send it to the coordinator. The coordinator is using the key-holder as an oracle to perform a PCA computation to output the top-n eighen vectors and values of traffic. The cryptographic techniques are quite but standard, and involve doing additions, subtraction, multiplication, comparison and normalization of matrices privately though a joint private two-party computation.

Surprisingly, the performance of the scheme is quite good! Using a small cluster, can process a few tens of time slots from hundresds of different ISPs in tens of minutes. A further incremental algorithm allows on-line computations of eighen vector/value pairs in seconds — making real-time use of the privacy preserving algorithm possible (5 minutes of updates takes about 10 seconds to process).

This is a surprising result: my intuition would be that the matrix multiplication would make the approach impractically slow. I would be quite interested to compare the implementation and algorithm used here with a general MPC compiler (under the same honest-but-curious model).

Shishir Nagaraja has pointed out that our Drac anonymity system is not the first one to consider an anonymity network overlayed on a social network. The performance versus security of routing messages over a social network was already considered in his work entitled ‘anonymity in the wild’.

Shishir Nagaraja: Anonymity in the Wild: Mixes on Unstructured Networks. Privacy Enhancing Technologies 2007: 254-271 [pdf][ppt]

This is important prior work and we should have cited it properly. It presents an analysis of an anonymity provided by different synthetic social network topologies, as well as real-world data from LiveJournal.

My team at Microsoft research has spent the past 6 months grappling with the problem of privacy in next generation energy systems. In parallel with the good honest scientific work I also participated in the UK government consultation on smart metering, in writing and in person, specifically on the issue of privacy. Its conclusions have finally been made public (see DECC’s site and Ofgem’s detailed responses).

First, what is the problem? Smart-meters are to be fitting in most homes, and they provide facilities for recording fine-grained readings of energy consumption. These are to be used for time of use billing, energy advice, the backend settlement process, financial projections of suppliers, fraud detection, customer service, and network management. The problem is that these readings are also personal data, and leak information about the occupancy of households, devices used, habits, etc. So here we have a classic privacy dilemma: where to strike the balance between the social value of sharing data (even mandating such sharing) versus the intrusion to home life?

Or do we? As it is often the case when privacy is framed as a balance, what is ignored is that we can use technology to achieve both privacy and extract value from the data. In fact we show no balancing act is necessary. We designed a host of privacy technologies to fulfill the needs of the energy industry (even the rather exotic ones) while preserving extremely high levels of privacy and user control. Lets look at them in detail:

  • We developed a set of protocols to perform computation on private data while maintaining a high degree of integrity and availability. This allows customers to calculate their bills, provide indicators of consumed energy value to be used in settlement, routing demand response requests, and do profiling to support network operation or even marketing. Our framework guarantees that the computations only leak their results to third parties, and also that those results are in fact derived from the real meter readings. The raw meter readings are not necessarily shared, but can be used locally on any user client to offer a rich experience — i.e. pretty graphs of consumption and comparison with their neighbours. A non technical overview is available as a white paper, a technical introduction for meter manufacturers is provided, and a preliminary technical report with all the crypto is also online.
  • Sometimes it is important to aggregate information from multiple meters without revealing anything about individual readings. The traditional approach has been to give all readings to a trusted third-party that performs the aggregation and only publishes the sum. We show that a set of meters can in fact perform the aggregation without the need for a trusted party. This is simple, efficient and compact — the computations can be done inside the trusted meter or outside along with cryptographic verification. All details are available in our technical report on aggregation.
  • Some smart-meters may be deployed in extremely high-security settings. In such places leaking even the final bill or statistics aggregated over time may leak information and a positive guarantee that the information leakage is limited might be necessary. We developed techniques inspired from differential privacy to inject noise to aggregate readings that guarantee any specific time period consumption is masked. Further more we allow customers to recuperate the bulk of the costs though an oblivious cryptographic rebate system. Our technical report on differential privacy and rebates in metering is available.
  • Finally proving that protocols are correct is not sufficient, so we explore options for proving actual implementation of the protocols are in fact providing the necessary security and privacy properties. A report on the certified implementation of a variants of the proposed protocols using refinement types is also available.

The project web-page on privacy in metering links to all those any more.

So much about the science, what about the engagement with government. On the positive side, our rather limited goal has been achieved: we wanted to put privacy technologies, that provide solutions beyond the dilemmas and balance between privacy and value, on the map. The government response to the consultation takes note, in a limited way, of the potential use of privacy technologies. On page 10 it shyly mentions that:

“2.18. Work is in process to understand the options for aggregating or anonymising smart metering data and whether it is necessary for the data to be accessed by a party that carries out the data minimisation. Privacy enhancing technology can potentially enable anonymised or aggregated data to be provided without any party having access to the personal data itself. The programme will work with industry and academics in order to explore the applicability of privacy enhancing technologies within the smart metering system.”

This is actually a rather fair representation of the capabilities of the technology, even if it is presented as a far away goal, rather than the concrete protocols we have proved correct and the implementations we have built.

Paragraph 2.18 mentioning privacy technology is a ray of light amidst an otherwise ambivalent government response. On the up side it recognizes energy consumption as private data from the onset, it mandates meters to hold 13 months of consumption and provide local access to it, it defines narrowly the scope of data that can be gathered without explicit consent and puts them under the data protection regime. On the down side there is confused language about what constitutes personal data (2.17), and the final technical solution involves collecting data in clear through a centralized systems (the glorious DCC) and protecting it using access control — a far cry from what we know possible in terms of technical privacy protection.

The metering privacy geeks (legal & technical) might also find other interesting nuggets in this report:

  • It mentions privacy-by-design, but without support for privacy technologies (except a mention of aggregation in 2.14). This is a damaging trend set by the Ontario report on privacy in the smart grid that takes a purely management approach to privacy in the local smart grid deployment. A response to this trend is provided by Prof. Claudia Diaz and her colleagues that highlights the technical protections necessary to engineer privacy-by-design. This is only the start of this tussle.
  • The report seems to suggest that personal data is not personal if it is not readily identifiable by the data controller (sect. 2.17 and 3.7). This is the classic argument of “what is de-identified personal data”. Does it mean the data controller cannot identify it, or anyone in the world? It seems the government is as confused as everyone else on this matter.
  • The key outcome of the consultation is that the energy industry needs some data to perform “regulated duties”. This concept was present in the initial consultation, but funnily enough there was no description of that those duties were. It transpired in meetings that Ofgem was not in fact clear about what they were, and a large part of the consultation centered around fleshing those out. A list of those duties is available in Appendix 3 of the report, and is probably welcome by all (a similar list is available in the NIST privacy reports).
  • So (in 3.15) the government concedes that industry must have access to the data necessary to perform its regulated duties by default, yet this data should be subject to the DPA requirements (3.16 for example specifically calls principle 5 — that the data should not be kept longer than necessary). Well that is a mine field: it is clear that the data is collected for a specified purpose (principle 2). If the other principles are also applied it means that it should not be used without explicit consent for other purposes (*cough*added value services*cough*) and furthermore it should not be excessive for the stated purpose. Well here we are: our technical reports offer ways in which most of the stated purposes in appendix 3 could be fulfilled without collecting the data. Is this a contradiction? Not automatically. The government’s view is clearly that our proposed protocols are not yet ready for prime time — of course as these technologies become better known and deployed this objection will evaporate. Will the data minimization requirement then mandate the use of privacy technologies? This is a rhetorical question at the moment.
  • It is interesting to note that the restrictions associated with limiting the automatic collection of data by suppliers was possibly set in place on the grounds of market competition rather than privacy per-se (section 3.32). Automatic collection by suppliers would put them in an advantageous position vis-a-vis third-party providers of value added services. This is an open issue (3.36).
  • The government is keen for a local repository of consumption data in the meter (4.6) and the use of geeky toys to visualize it (4.12). This is the setting in which our solutions enable strong privacy guarantees. That is positive, if only half-way.

In conclusion, the debate around privacy in metering has been informed by consumer concerns, privacy concerns, industry needs and technology alternatives. They are all represented in the government response. Yet the final solution is rather conservative: it relies on a centralised conduit for personal information protected by access control layers and management layers. It is far from what we know possible with privacy technologies. The argument today is that those technologies are too new — which is questionable given how quickly IT inovations are brought to market. This argument will lose its potency in the long term if we keep developping and deploying privacy firendly solutions.

Americans. Attitudes About Internet Behavioral Advertising Practices
Aleecia M. Mcdonald and Lorrie Faith Cranor (Carnegie Mellon University)

This is a very interesting paper on people’s attitudes to behavioural advertising. Researchers used a mix of a small-scale (14 people) study and a larger (100s of people) statistical study. A few findings are remarkable:

  • First, they see that users apply their intuition of off-line ads to the experience of on-line ads — many see on-line ads as a push mechanism and do not realise that data about themselves are collected. They seem to not object in general to the idea of advertising, and consider it as a fact of life, and even see it as ‘ok’ to support services.
  • The landscape of attitudes to behavioural advertising is fascinating. When faced with a description of what behavioural advertising collects, as a hypothetical scenario, and how it functions, a large percentage of users said this is not possible, and some of them even claimed it would be illegal. When it comes to attitudes towards receiving ‘better’ ads only 18% of them liked the idea for web-based services, and 4% for email based services (like hotmail & gmail). In general the authors found that a lot of extremely common practices cause “surprise”.
  • The researchers also looked at the formulation of the text of the NAI site, that offers an opt out from behavioural advertising. They find that what the system does is unclear, even after reading the page where the operation is described.

In general people prefer random ads rather than personal ads, with the exception of contextual ads (like books on on-line book stores). There is still a lot of ignorance about how technical systems work, and education when it comes to privacy and the ability to self-help themselves to protect privacy is clearly not working.

This research is pointing in the direction that the presumed tolerance of users to privacy invasion is due to ignorance of common practices. Once those practices are revealed it produces surprise, and even feeling of betrayal that will not be beneficial to any company and customer confidence.

The potential for abuse is a key challenge when it comes to deploying anonymity systems, and the privacy technology community has been researching solutions to this problem for a long time. Nymble systems allow administrators to blacklist anonymous accounts, without revealing or even knowing their identity.

What is the model: a user registers an account with a service, such as wikipedia. Then the user can use an anonymous channel like Tor, to do operations, like edit encyclopedia articles. This prevents identification of the author, and also bypasses a number of national firewalls that prevent users accessing the service (China for example blocks Wikipedia for some reason). If abuse it detected then the account can be blacklisted, but without revealing which one it was! The transcript of the edit operation is sufficient for preventing any further edits, but without tracing back the original account or network address of the user.

Nymble systems had some limitations. They either required trusted third parties for registration, or they were slow. A new generation of Nymble systems, including Jack, is now addressing these limitations: they use modern cryptographic accumulator constructions that have proofs of non-membership in O(1) time, to prove a hidden identity is not blacklisted. Jack can do authentication in 200ms, and opening a Nymble address in case of abuse in less than 30ms. This is getting real practical, and it is time that Wikipedia starts using this system instead of blacklisting Tor nodes out of fear of abuse.

Other Nymble systems: The original nymble | Newer Nymble | BLAC | Nymbler with VERBS | PEREA. Each of them offers a different trade-off of efficiency and security.

This is the title of the paper resulting from the interdisciplicary collaboration between computer scientists and social scientists, last November in Dagstuhl. The full version is available on SSRN at:

The topic of the seminar was “Network Democracy” and for five days, we discussed tools for representation, direct democracy, power, trasnparency and democratic institutions. This was a refreshing break form the traditional “e-voting = e-democracy” caricature.

The gap between computer and social scientists was initialy wide, and for a few days we concentrated on formulating questions that communities want to ask each other (see appendix 1). A few examples include:

  • Computer to social scientists about Conflicting Values. What are prime examples where democratic values come in conflict with each other? What types of conflicts are inherent in democratic systems? Is the integrity of technical systems the key requirement for edemocracy solutions? Is it more important than privacy? Is availability more important than both? What are the social dangers for democracy in a network society?
  • Social to computer scientists about Privacy and Surveillance. How will future technologies enable all branches of government to discover what citizens and other residents are doing, thinking and saying? To what extent can existing and new privacy and security technologies limit the government’s ability to know more about the public than the public wants to reveal? Can privacy technologies help both enhance and protect the democratic process (e.g. by preventing widespread disclosure of the names of persons signing petitions in a way that could lead to subsequent harassment because of their support of a controversial measure – at the same time as allowing dissemination of information that the wider public would like to know, such as how many people signed the petition and their broad demographic characteristics, but not their individual identities)?

One of the most insightul remarks, and by far my favorite:

“Technologies may be used to cement existing power relations or offer merely an ineffectual ‘play democracy’. Technologies may disadvantage certain groups and worsen power imbalances (e.g. some types of surveillance technologies). Political forces may seek widespread deployment of such technologies or try to limit their use.”




The UK goes every ten years through a national census, where every household is called to fill in details about their demographics, habits, travel and income. The next one will be the UK 2011 census.

The office for national statistics has a statutory duty to ensure that the data released from this census cannot be used to identify any individual or to infer any of unknown attribute. Techniques for doing so are called statistical disclosure control, and have been the subject of intense study for the last 40 years at least. One could never have guessed by reading the documents on confidentiality for the next UK census.

To make a long story short: the ONS never considered modern well defined notions of privacy, it lacks a reliable evaluation framework to establish the degree of risk of different methods (let alone utility), and has proposed disclosure control measures that fall rather short of the state of the art.

Moving households around (a bit)

The consultation is not totally over yet, but the current favorite after two rounds of evaluation seems to be a technique called “Record Swapping”. How does it work? The technique takes the database of all responses to the census and outputs another database, that is sufficiently different to avoid identification and inference. Record swapping first categorises all records by the household size, sex, broad age, and hard-to-count variables. Then it selects 2-20% of the records, and each of them are paired with a record from the same category. Then the geographical data of each pair of records (yes, right, only the geographical data) are swapped.

This procedure has the effect to disperse geographically the population a bit so that, it is not possible to know whether single cells in tables are indeed providing information about an individual in a region or, whether they are the product of a swap from a different region. The advantage is that the totals are the same (since swapping things around is invariant to addition), the swaps are with “similar” households, and the procedure is simple to implement.

This is in-line with the definition of privacy of the census office, namely that: 

“The Registrars General concluded that the Code of Practice statement can be met in relation to census outputs if no statistics are produced that allow the identification of an individual (or information about an individual) with a high degree of confidence. The Registrars General consider that, as long as there has been systematic perturbation of the data, the guarantee in the Code of Practice would be met.”

Problems with “Record Swapping”

So far a whole process has been followed to evaluate a list of proposed disclosure control measures, present a methodolody to evaluate them, shortlist some, and perform more in-depth research about their utility and privacy. There is a lot of repetition in these documents, a few ad-hoc indicators of quality and privacy, and no security analysis what-so-ever about inference attacks on the proposed schemes. The subject of ” disclosure by differencing” is left as a suggestion for future work in the latest interim report, while the only method left on the list is Record Swapping, as well as ABS, that has apparently not been tested yet at all.

Why is that a problem? Records include many other potentially identifying fields aside from location. Since the rest of the record stand as it is, and is aggregated into tables, with a secret small cell adjustment technique, we cannot really be sure at all that there are no re-identification attacks. (Apparently revealing the details of the technique cannot be divulged for confidentiality reasons, violating even the most basic principle of security engineering! See page 3).

The utility measures used to assess how acceptable these disclosure control measures will be to data users (Shlomo et al.), are themselves very simplistic and do not offer very tight bounds on possible errors but I will leave this matter for the statisticians to blog about.

To make the problem worse, this time the ONS, is seriously thinking of allowing data users to submit their own queries to the database of statistics. The queries are not likely to be full SQL any time soon, but tables on 3 categories (called cubes) are likely to be allowed. This leaves wide open quite a range of attacks in the literature on inference in statistical databases.

At this point there is absolutely no evidence that the disclosure control scheme is actually secure, which in security engineering means that it is probably not.

How did we get to this situation?

It seems the bulk of the work on disclosure control has been done by the ONS, in conjunction with researchers from the University of Southampton. None of the authors of any of the evaluations has a substancial research experience in privacy technology or theoretical computer security that deals with these privacy matters in a systematic way.

What is revealing is the fact that the most relevant related work is never mentioned. It includes:

  • The work of Denning on trackersand inference in statistical databases (1980). Instead the archaic term “differencing” is used.
  • The work of Sweeney and Samarati on linkage attacks and k-anonymity (1997).
  • The work of Dwork on Differential Privacy (2007), which is the most current and strongest definition of privacy for statistical databases.

These works show repeatedly that ad-hoc inference control measures, that only aim to suppress a handful of known and obvious attacks, are systematically bypassed.

Dwork in her work on Differential Privacy (that won the 2009 year’s PET Award) provides clear arguments on why simpler ad-hoc techniques cannot provide the same guarantee of privacy: their results can be aggregated with side information known to the adversary to facilitate inference. Differential privacy on the other hand guarantees that the results of a query to the database, or published table, reveals no more information when composed with other such queries or any side information. 

This is a hot topic in research today, and all the details may not be ready for a census in 2 years time. This does not justify the ONS’s ignorance of this field.

Follow

Get every new post delivered to your Inbox.