“Privacy” for the UK 2011 census?
17 August 2009
The UK goes every ten years through a national census, where every household is called to fill in details about their demographics, habits, travel and income. The next one will be the UK 2011 census.
The office for national statistics has a statutory duty to ensure that the data released from this census cannot be used to identify any individual or to infer any of unknown attribute. Techniques for doing so are called statistical disclosure control, and have been the subject of intense study for the last 40 years at least. One could never have guessed by reading the documents on confidentiality for the next UK census.
To make a long story short: the ONS never considered modern well defined notions of privacy, it lacks a reliable evaluation framework to establish the degree of risk of different methods (let alone utility), and has proposed disclosure control measures that fall rather short of the state of the art.
Moving households around (a bit)
The consultation is not totally over yet, but the current favorite after two rounds of evaluation seems to be a technique called “Record Swapping”. How does it work? The technique takes the database of all responses to the census and outputs another database, that is sufficiently different to avoid identification and inference. Record swapping first categorises all records by the household size, sex, broad age, and hard-to-count variables. Then it selects 2-20% of the records, and each of them are paired with a record from the same category. Then the geographical data of each pair of records (yes, right, only the geographical data) are swapped.
This procedure has the effect to disperse geographically the population a bit so that, it is not possible to know whether single cells in tables are indeed providing information about an individual in a region or, whether they are the product of a swap from a different region. The advantage is that the totals are the same (since swapping things around is invariant to addition), the swaps are with “similar” households, and the procedure is simple to implement.
This is in-line with the definition of privacy of the census office, namely that:
“The Registrars General concluded that the Code of Practice statement can be met in relation to census outputs if no statistics are produced that allow the identification of an individual (or information about an individual) with a high degree of confidence. The Registrars General consider that, as long as there has been systematic perturbation of the data, the guarantee in the Code of Practice would be met.”
Problems with “Record Swapping”
So far a whole process has been followed to evaluate a list of proposed disclosure control measures, present a methodolody to evaluate them, shortlist some, and perform more in-depth research about their utility and privacy. There is a lot of repetition in these documents, a few ad-hoc indicators of quality and privacy, and no security analysis what-so-ever about inference attacks on the proposed schemes. The subject of ” disclosure by differencing” is left as a suggestion for future work in the latest interim report, while the only method left on the list is Record Swapping, as well as ABS, that has apparently not been tested yet at all.
Why is that a problem? Records include many other potentially identifying fields aside from location. Since the rest of the record stand as it is, and is aggregated into tables, with a secret small cell adjustment technique, we cannot really be sure at all that there are no re-identification attacks. (Apparently revealing the details of the technique cannot be divulged for confidentiality reasons, violating even the most basic principle of security engineering! See page 3).
The utility measures used to assess how acceptable these disclosure control measures will be to data users (Shlomo et al.), are themselves very simplistic and do not offer very tight bounds on possible errors but I will leave this matter for the statisticians to blog about.
To make the problem worse, this time the ONS, is seriously thinking of allowing data users to submit their own queries to the database of statistics. The queries are not likely to be full SQL any time soon, but tables on 3 categories (called cubes) are likely to be allowed. This leaves wide open quite a range of attacks in the literature on inference in statistical databases.
At this point there is absolutely no evidence that the disclosure control scheme is actually secure, which in security engineering means that it is probably not.
How did we get to this situation?
It seems the bulk of the work on disclosure control has been done by the ONS, in conjunction with researchers from the University of Southampton. None of the authors of any of the evaluations has a substancial research experience in privacy technology or theoretical computer security that deals with these privacy matters in a systematic way.
What is revealing is the fact that the most relevant related work is never mentioned. It includes:
- The work of Denning on trackersand inference in statistical databases (1980). Instead the archaic term “differencing” is used.
- The work of Sweeney and Samarati on linkage attacks and k-anonymity (1997).
- The work of Dwork on Differential Privacy (2007), which is the most current and strongest definition of privacy for statistical databases.
These works show repeatedly that ad-hoc inference control measures, that only aim to suppress a handful of known and obvious attacks, are systematically bypassed.
Dwork in her work on Differential Privacy (that won the 2009 year’s PET Award) provides clear arguments on why simpler ad-hoc techniques cannot provide the same guarantee of privacy: their results can be aggregated with side information known to the adversary to facilitate inference. Differential privacy on the other hand guarantees that the results of a query to the database, or published table, reveals no more information when composed with other such queries or any side information.
This is a hot topic in research today, and all the details may not be ready for a census in 2 years time. This does not justify the ONS’s ignorance of this field.