1 July 2015
One of my key annoyances while doing work in privacy technologies is the poor support of key cryptography libraries in my favorite Python programming language. Today, I would like to share my work on building petlib, a more or less pythonic, wrapper around the OpenSSL low level crypto and math libraries, as well as numerous example privacy technologies (PETs) that I have implemented as examples.
The needs of someone doing research in the PETs field are quite different from other developers: on one hand we need access to low level primitives (such as block cipher and hash function operations), as well as low level mathematical functions on big integers and elliptic curves on finite fields. A number of available libraries try to hide those primitives from developers behind abstractions such as “signed envelopes” or “secure sockets” — which so not serve those who try to build different abstractions. On the other hand, issues such a tight memory management and absolute control over other low-level aspects of the library are not essential; in fact a clean programming interface that leads to beautifully clear reference code for proposed protocols is preferable.
The petlib library is available for everyone to use, and after installing the OpenSSL prerequisites can be acquired through the python repositories through:
pip install petlib
The petlib library was used as the basis for teaching the labs of the Privacy Enhancing Technologies course at UCL, and thus has extensive documentations, and is properly version controlled, packaged and tested:
- petlib github repository
- petlib installation and programming documentation (read the docs)
- petlib listing on pypi
The best way to get a feel for how the library can be used to build PETs prototypes is to browse the examples in the source tree:
- A toy RSA example
- A simple Schnorr Zero-Knowledge Proof
- An additivelly homomorphic public key encryption scheme
- A generic engine for building zero-knowledge proofs using sigma protocols
- The Groth-Kohlweiss ring signature and zero-cash scheme
- The algebraic MAC scheme by Chase, Meiklejohn and Zaverucha
- An anonymous credential scheme based on the aMAC scheme above
- The Baldimtsi and Lysyanskaya anonymous credentials light scheme
In terms of more real-work research project, we have already used petlib for implementing prototypes for a few projects and labs:
- The centrally banked cryptocurrency framework (with Sarah Meiklejohn)
- A private stats collection system for Tor (with Melis and De Cristofaro)
- Exercises in Privacy Enhancing Technologies (for UCL COMPGA17)
One key missing component from the underlying OpenSSL crypto library is support for computations on pairings of elliptic curves. This limits the types of protocols that can currently be implemented with petlib, until such functionality becomes available in the underlying libraries (please contribute!) Bug reports and pull requests with fixes to the code and documentation are very welcome.
30 June 2015
The course covers principally, and in some detail, engineering aspects of PETs and caters for an audience of CS / engineering students that already understands the basics of information security and cryptography (although these are not hard prerequisites). Students were also provided with a working understanding of legal and compliance aspects of data protection regimes, by guest lecturer Prof. Eleni Kosta (Tilburg); as well as a world class introduction to human aspects of computing and privacy, by Prof. Angela Sasse (UCL). This security & cryptographic engineering focus sets this course apart from related courses.
The taught part of the course runs for 20 hours over 10 weeks, split in 10 topics:
- Introduction and privacy in communication. (01-GA17-IntroComms)
- Anonymous communications & Traffic analysis (02-GA17-Anonymous-Comms)
- Private Computations with homomorphic encryption and secret sharing (03-GA17-Private-Computations)
- Privately checking inputs using Zero-Knowledge Proofs (04-GA17-ZeroKnowlegde)
- Private authorization using selective disclosure credentials (05-GA17-Selective-Disclosure)
- Data anonymization & de-anonymziation attacks (08-GA17-Data-Anonymization)
- Private Storage, queries and lookups (09-GA17-Storage-Retrieval)
- Privacy by design case-studies (10-GA17-Privacy-by-design-case-studies – Copy)
- Guest lectures: Human aspects (Angela Sasse)
- Guest lectures: Data Protection (Eleni Kosta)
Most importantly the course includes 10 hours of labs (20 next year!), split into 5 exercises, that give students (and their teachers!) hands on experience implementing extremely advanced privacy enhancing technologies. More generally the course provides an introduction to solid cryptographic engineering, test-driven development, testing & QA tools and code audits. The programming language used was Python on a Linux environment, with the petlib library that was specially developed for this course.
For each lab exercise students in pairs were provided with a partial code file, and a set of unit tests, and were asked to fill in the remaining code to fulfill the task, and at least make the unit tests pass. The topics of the exercises track the first 5 lecture topics:
- Private communications and basic programming with petlib
- Building a simple mix server and client
- Building a private polling system with homomorphic encryption
- Basics of zero-knowledge proofs of knowledge, equality and linear statement
- A basic selective disclosure authorization credential system
Finally, part of the grading was based on students performing a code review of other groups, looking for code defects leading to security or other bugs.
Overall, I am very proud of the progress everyone made. The course was attended by 16 MSc student and 2 MEng students. Everyone eventually was able to complete all lab assignment — not a given considering the advanced nature of the tasks at hand. It was evident while discussing with student the final exercise, on building a selective disclosure credential, that many had developed an intuitive understanding of how to build solutions based on zero-knowledge protocols, and all had definitely overcome their initial fear of these more advanced concepts in PETs.
I was also very impressed with many students that were able to tackle the hardest questions in the exam. One of those questions, basically asked students to re-invent a variant of the privacy preserving genomic testing protocol we presented at WPES 2014 — and many did successfully. Similarly, they were asked to de-anonymize a mechanism very similar to the 15:15 rule in place in California to “protect” smart meter reading, and again many did so successfully under time constrain and the high pressure environment of exams. As ever, the great engagement from students was the most rewarding part of teaching the course.
All material is available online (see links to slides, and git repositories), and I would be delighted to share / receive any additional exercises by others finding this material relevant to their courses.
I will be participating on a panel this afternoon on “Creating Usable and Secure Software”, in the context of the conference on Digital Citizenship and Surveillance Society. I share a platform with a number of illustrious people — Dave Hrycyszyn, Lola Oyelalo and Blaine Cook — with a much deeper experience in usable software and services development. However, I will attempt to provide some context, and my opinions, on why we can observe a broadly poor state of affairs when it comes to usability of privacy technologies — and hopefully open a discussion on how to overcome roadblocks.
My main two positions will be as follows:
- The political context within which technical security and privacy research and development had to be conducted over the past 40 years greatly contributed to the lack of wide deployment and poor usability of privacy technologies.
- The lack of “knowledge” about methods for developing usable privacy friendly solutions only offer a partial explanation for this poor state of affairs, and has to compete with other roadblocks that systemically undermined the deployment of usable privacy technologies.
First, it is worth reminding ourselves that research into security technologies and strong cryptography specifically, was until recently the prerogative of governments. Public discussion and know-how on this topic was developed seriously after the mid-1980s, and often despite serious pressure from the US and other governments. The technical security community is small and there remain serious technical challenges to providing privacy friendly solutions — solutions that require deep expertise developed over years of practice (requiring funding).
Second, the export control regimes, and also requirements for cooperation with law-enforcement slowed down significantly the blanket deployment of privacy technologies even after the strict export control regime of the 1990s was lifted. What makes a number of privacy technologies unusable — email encryption, instant messaging encryption — is the fact that common clients do not support them transparently and by default — requiring plug-ins, user configuration and manual key management. Thus the lasting impact of these regulation has not been the non-proliferation of strong crypto technologies, but the lack of integration of these into mainstream platforms. It is telling that the current Law Enforcement and Government narrative is not about preventing encryption know-how from spreading, but rather discouraging wide deployment of such technologies without the ability for back-door or front-door access.
Third, there are commercial pressures — which again have been related with government hostility of the wide deployment of privacy technologies. It is easy to forget that governments, are major customers of technologies. Thus they are able to dictate requirements that make it difficult to widely deploy privacy technologies. It is telling that mainstream mail clients — such as Microsoft Outlook — do not transparently support PGP based end-to-end encryption and have instead opted for S/MIME and models that make the use of encryption by individuals rather difficult. In this context one may assume that the key customers of this software — large enterprises and governments — simply never asked for such features, and in fact probably considered such a feature to conflict with other requirements (such as the need to recover mail of employees, backup, …).
These commercial pressures, have changed in the past few years, as large internet companies start relying heavily on serving end-users (search, webmail, social networking). Sadly, these companies have adopted both a business model — ad-based monetization — and a technical architecture — cloud computing — that makes meaningful privacy protection very difficult. In turn the “success” of those architectures has lead to an extreme ease of developing using this model, and an increasing difficulty in providing end-user solutions with appropriate privacy protections — let alone usable ones.
The rise of services has pushed a number of key privacy technologies into not being commercially supported and a key feature, and in effect at best a “common” — with the governance and funding problems this entails. We have recently learned about the systemic under funding of key privacy technologies such as OpenSSL and GPG. Technologies like Tor are mostly funded for their national firewall traversal features, seeing development on anonymity features suffer. Unlike other commons (health, parks, quality assurance in medicines), the state has not stepped in to either help with governance or with funding — all the opposite. For example, standardization efforts have systematically promoted “surveillance by design” instead of best of breed privacy protection; funding for surveillance technology is enormous compared to funding for privacy technologies, and somehow ironically, a number of calls for funding of privacy technologies are in the context of making surveillance more “privacy friendly” — leading to largely non-nonsensical outcomes.
So, lack of “knowledge” about how to develop usable software, while also a contributing factor, has to be seen within the context of the above structural pressures. In parallel, pressures undoubtedly exist when it comes to the discipline of UX which is in itself recent, and constantly involving. Along with serious funding for collaboration on building more usable privacy software (which the Simply Secure project that I am associated with attempts to provide), we need a strategy to counter those systemic pressures to ensure the wide deployment of usable privacy technologies.
I have a rather enlightening chat with Yvo Desmedt after the Cambridge Security Protocol‘s Workshop, who was kind enough to give me an overview and insight into the group key agreement protocols of the ’90s. As part of an on-going conversation on secure chat, I am trying to understand the genealogy, and requirement, for key group agreements to be “contributory” — namely ensure all participants contribute to the final group key, even under malice. It seems that there is also a preference for symmetric schemes, where all participants perform the same operations.
Yvo’s classic Eurocrypt paper [BD94] is the basis of GOTR [LVH13], which manages to complicate it considerably. The original paper had a O(N) broadcasts cost, and was “peer-to-peer”, meaning that everyone just does the same thing. However it did not consider an active adversary, and was also not “contributory” meaning that an insider (active) adversary could force the key to anything they liked. Interestingly, Yvo points me to [BD96], which he thinks is superior to [BD94]. This paper really illustrates that there is no magic to group key agreement: a master key for the group is determined and then propagated using a key-sharing graph. This reduces the cost from O(N) broadcasts to just O(N) point to point messages — which to me seems optimal.
Now, the idea that schemes must be “contributory” (ie. no participant is special in determining the key — no one can force the key to be some specific value) emerged sometime in the late ’90s. The first reference I found to this property is [BW98], where the authors look at the round complexity of key agreement. However they state that “If the group key is generated and distributed by a central trusted party, then it is not necessary to discuss the communication complexity“. Then, they just launch into those schemes with no justification as to the reason centralized distribution may be a problem …
Katz and Young [KY03] also state that “(we exclude here centralized protocols in which a designated group manager is assumed; such asymmetric, non-contributory schemes place an unfairly high burden on one participant who is a single point of failure and who must also be trusted to properly generate keys)“. So it seems that the security issue contributory schemes try to mitigate is a flawed RNG. However this is a marginal threat — if the RNG is bad, than it may be likely the adversary can also corrupt other aspects of the platform to extract keys. In any case the flawed RNG threat can be dealt with by including some entropy from other participants assuming that they do not act in an adaptive malicious manner (if they do they can just leak the key to the adversary). I find it strange that KY03 argue that a single participant must not be burdened, when this results in proposed protocols that burden all participants to a great extent instead.
Around 2000 a number of works spring up that attempt to extend key agreement to dynamic membership setting, including [STW00] and [KPT00]. It is not at all clear to me whether those are in fact superior to running the key exchange multiple times, or even having a central party distributing keys.
Finally, Goldberg and others [GUGC09] propose extensions to OTR for a multi-user setting. These focus on deniability and signatures (and call a generic key exchange protocol) and a shared authentic transcript. This is a fine property, however I am a bit surprised the protocols are (a) so complex to establish ephemeral signatures and (b) so simple if they are to establish transcript consistency. My understanding is that they rely on the channel to offer consistent ordering, and then simply cryptographically ensure it was not tampered by an adversary — however I only read the paper obliquely.
Conclusion: It seems that a lot of the literature on group key exchange is based on the premise that the protocols need to be symmetric and contributory. Yet, I fail to see any justification beyond the fact that centralized schemes are simple and efficient, and no one could possibly write an academic paper about them. All schemes I have seem rely on the honest channel offering ordering, and being reliable. If that is not the case some of them detect it and hard fail (for example the integrity checks fail, with no hint that it is due to missing messages). This means that they assume some ordering happens on the outside of the crypto, which is dubious without some leader election. Few works have dealt with how you determine the group, which would either go the admin way or the voting way (can of worms).
[BD94] Mike Burmester, Yvo Desmedt: A Secure and Efficient Conference Key Distribution System (Extended Abstract). EUROCRYPT 1994: 275-286
[KY03] Katz, Jonathan, and Moti Yung. “Scalable protocols for authenticated group key exchange.” Advances in cryptology-CRYPTO 2003. Springer Berlin Heidelberg, 2003. 110-125.
[BW98] Becker, Klaus, and Uta Wille. “Communication complexity of group key distribution.” Proceedings of the 5th ACM conference on Computer and communications security. ACM, 1998.
[KPT00] Kim, Yongdae, Adrian Perrig, and Gene Tsudik. “Simple and fault-tolerant key agreement for dynamic collaborative groups.” Proceedings of the 7th ACM conference on Computer and communications security. ACM, 2000.
I was very honored to be invited to Asiacrypt 2013 to present some of our work on privacy-friendly computations. It was an opportunity to consolidate a presentation that includes an overview of privacy-friendly billing and aggregation for smart metering. The slides of the presentation are available in Powerpoint 2012 format (and an older ppt format).
The key references providing more technical details on smart-metering privacy are:
- Alfredo Rial, George Danezis: Privacy-preserving smart metering. WPES 2011: 49-60
- Klaus Kursawe, George Danezis, Markulf Kohlweiss: Privacy-Friendly Aggregation for the Smart-Grid. PETS 2011: 175-191
- George Danezis, Markulf Kohlweiss, Alfredo Rial: Differentially Private Billing with Rebates. Information Hiding 2011: 148-162
- George Danezis, Benjamin Livshits: Towards ensuring client-side computational integrity. CCSW 2011: 125-130
- Andres Molina-Markham, George Danezis, Kevin Fu, Prashant J. Shenoy, David E. Irwin: Designing Privacy-Preserving Smart Meters with Low-Cost Microcontrollers. Financial Cryptography 2012: 239-253
- Gilles Barthe, George Danezis, Benjamin Grégoire, César Kunz, Santiago Zanella Béguelin: Verified Computational Differential Privacy with Applications to Smart Metering. CSF 2013: 287-301
- Cedric Fournet, Markulf Kohlweiss, George Danezis and Zhengqin Luo. ZQL: A Compiler for Privacy-Preserving Data Processing. USENIX Security Symposium, Washington DC, 2013.
- George Danezis, Cedric Fournet, Markulf Kohlweiss and Santiago Zanella-Beguelin. Smart Meter Aggregation via Secret-Sharing. ACM SEGS 2013: Smart Energy Grid Security Workshop, Berlin, 2013.
- Carmela Troncoso, George Danezis, Eleni Kosta, Josep Balasch, Bart Preneel: PriPAYD: Privacy-Friendly Pay-As-You-Drive Insurance. IEEE Trans. Dependable Sec. Comput. 8(5): 742-755 (2011)
- George Danezis, Markulf Kohlweiss, Benjamin Livshits, Alfredo Rial: Private Client-Side Profiling with Random Forests and Hidden Markov Models. Privacy Enhancing Technologies 2012: 18-37
28 July 2011
Privacy-friendly Aggregation for the Smart-grid
Klaus Kursawe (Radboud Universiteit Nijmegen) and George Danezis and Markulf Kohlweiss (Microsoft Research)
Privacy in for smart electricity provision seems to be a rising topic, and this year there is a whole session on it at PETS 2011. The first paper (one which I am a coauthor) looks at the problem of gathering aggregate data from groups of smart meters, without allowing any third party to get the the individual measurements. This can be applied as a PET to solve real-world problems such as fraud detection, leakage detection, load estimates, demand response, weather prediction — all of which only require aggregate data (sometimes in real time), not individual measurements.
The key challenge to providing a private aggregation protocols are the specific constraints of smart meters. They are cheap devices, with modest resources, hardly any bandwidth, no ability to communicate, etc. Two specific protocols are presented: the first one allows to compare the sum of meter readings with a reference number (maybe measured from a feeder meter). This protocol allows for fancy proofs of correctness, but it slow in terms of computation and bandwidth (it requires public key operations for each reading). The second protocol is extremely fast and has no communication overhead. In both cases a pragmatic approach to the threat model is followed: we assume that the utilities will be honestly defining groups of meters and facilitating the key management protocol — for the second protocol there is no overhead of public key operations after the initial key setup.
The key highlight from this work is not as much its technical depth (tricks with DC networks and hash function that would not surprise any PETS regular). What is interesting is that the protocols were designed for a real industrial application and now fully integrated on real smart meters and their communication protocols in collaboration with our collaborators at Elster.
Plug-in privacy for Smart Metering billing
Marek Jawurek, Martin Johns, and Florian Kerschbaum (SAP Research)
This second paper looks at the problem of billing for fine-grained time of use tariffs — their energy consumption at different times costs a different rate per unit. This is a very important topic, as correct billing and time of use tariffs are a key driver of fine-grained data collection through smart meters — if we can do billing privately then maybe less personal information may be collected.
Technically the protocols proposed are based on the homomorphic properties of Pedersen commitments: readings are commitments, and you can use multiplication by a constant and addition to compute the bill, and (most importantly) prove that it is correct. The system model is that the meter outputs signed commitments of readings, a privacy component computes the bill and proofs of correctness, and those are sent to the supplier for verification (and printing the bills!).
This is the core of a nice solution for the basic billing case (which is likely to be the common one in smart grids). We have shown in related work that the protocol can be further improved to have zero communication overhead. Since it avoids expensive zero-knowledge proofs it is fast for its proofs and verification. It also provides the basic infrastructure to support further more expressive billing policies and general computations.
27 July 2011
Quantifying Location Privacy: The Case of Sporadic Location Exposure
Reza Shokri and George Theodorakopoulos (EPFL), George Danezis (Microsoft Research), and Jean-Pierre Hubaux and Jean-Yves Le Boudec (EPFL)
This work evaluates the privacy of using location-based services sporadically using a set of location privacy mechanisms. Sporadic services include those that require location infrequently, rather than continuously (think of restaurant suggestions rather than relaying real-time GPS streams). The key novelty of the approach is that the model of location exposure, as well as privacy protection is very general. It encompasses anonymization, generalization and obfuscation of location, use of fake traffic and suppression of location. In turn the analysis relies on advanced models of location and mobility (based on markov chains) and is based on Bayesian inference. The evaluation of different location privacy techniques is done on real-world traces from SF taxis.
I am one of the authors of this work, so of course I think it is awesome! More seriously, it is one of the fist works to combine under a common framework a multitude of location privacy mechanisms, and provide a common evaluation framework for them, to quantify the degree of protection they offer relatively to each other for different adversaries. It is also one of the first systematic applications of Bayesian inference to analyze location privacy — extending the inference paradigm beyond the analysis of network anonymity systems.
Of course this is not the last word. Only a subset of protection techniques and combination of techniques were look at, and other protection mechanisms can be integrated and evaluated in the same framework (the tracing model and threat model can be unchanged). Secondly, the analysis itself may be augmented with side-information — be it commercial transactions or traces of network traffic — that may be giving some information about location, to increase the capabilities of the adversary (or make them more realistic). The model we use, based on markov chains, has the benefit of giving analytically tractable results, but numerical techniques may be used to extend it to be more true to real-life attacks.
Privacy in Mobile Computing for Location-Sharing-Based Services
Igor Bilogrevic and Murtuza Jadliwala (EPFL), Kubra Kalkan (Sabanci University), Jean-Pierre Hubaux (EPFL), and Imad Aad (Nokia)
This paper looks at applications where users need to share their location. For example, two users may want to find out if they are close to each other or where they should meet in order to share a taxi ride. Yet, those users do not want to leak any of their location information to the other users or the service provider. More specifically two users specify a set of ranked prefered location they could meet and the system needs to determine on of those fairly without revealing the current location or other preferences (except the one chosen to meet). This is called the fair rendez-vous problem.
The key contribution of this work is to show that this problem can be set with a set of concrete cryptographic protocols. It also presents an implementation of these algorithms on a real mobile phone to show that it is practical. The cryptographic computations are based on homomorphic encryption schemes as well as interactions with the service (to do multiplication that is not possible with Paillier). The implementation on a mobile phone takes a few seconds on the client and the server, and is paralelizable in the number of users. Untypically, the authors also did a user study: users were asked what their concerns were, and after using the application of the phone they were asked how usable it was, and whether they appriciated the privacy provided by the application.
This is a really nice example of a privacy application, that applies advanced crypto, but also evaluates it on a real platform for performance as well as users’ reaction to it. The obvious extensions to this work would be to generalize it to more complex rendez-vous protocols, as well as other location sharing applications. It is good to see that modern mobile devices can do plenty of crypto in a few seconds, so I am very hopeful we will see more work in this field.
On The Practicality of UHF RFID Fingerprinting: How Real is the RFID Tracking Problem?
Davide Zanetti, Pascal Sachs, and Srdjan Capkun (ETH Zurich)
This paper looks that UHF tags — they are the dumb tags that can be read at about 2m that are attached to things you buy to facilitate stock management or customer aftercare. Interestingly this study looks at how identifiable the tags are at the physical layer, not using the actual tag ID! Therefore these techniques may bypass any privacy protection that attempt to prevent access to the tag ID. It turns our that one can build a unique and reliable ID for a tag from its physical characteristics that can be used to trace people as they move around.
What is new about this work is that the focus was on practicality and cost of extracting a reliable fingerprint (previous approaches relied on expensive equipment and laboratory conditions). The solution was implemented using a cheap software radio (USRP2 device + PC).
I am not quite sure what to conclude from the evaluation on the quality of the fingerprint. It seems that an adversary can place tags within one of 83 to 100 groups. Is this really a good results or not? I guess it depends on the application and the density of tags. Of course if more than one tag is carried, then the adversary could combine fingerprints to identify individuals more easily — if you carry 5 tags you have a 20 bit IDs. Interestingly, there is extensive evaluation of the stability of the tag to temperature and mobility — it turns out that these factors do affect the quality of the fingerprint and further reduce the effective number of unique IDs that can be extracted (down to about 49 classes).
It would be interesting to combine this attack vector with the ideas from the first paper (pretending that the short physical IDs are a version of a privacy protection system) to evaluate the effectivness of tracing a set of individual throughout town.
27 July 2011
I am currently sitting at the PETS 2011 symposium in Waterloo, CA. I will be blogging about selected papers (depending on the sessions I attend) over the next couple of days — authors and other participants are welcome to comment!
The first session is about data mining and privacy.
“How Unique and Traceable are Usernames?”
Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils (INRIA)
The first paper looks at the identifiably of on-line usernames. The authors looked at user names from different sites and assess the extent to which they can be linked together, as well as link them to a real person. Interestingly they used Google Profiles as ground truth, since they allow users to provide links to other accounts. First they assess the uniqueness of pseudonyms based on a probabilistic model: a k-th order markov chain is used to compute the probability of each pseudonym, and its information content (i.e. -log_2 P(username)). The authors show that most of the usernames observed have “high entropy” and should therefore be linkable.
The second phase of the analysis looks at usernames from different services, and attempts to link them even given small modifications to the name. The key dataset used was 300K google profiles, that list (sometimes — they used 10K tuples of usernames) other accounts as well. They then show that the Levenshtein distance (i.e. edit distance) of usernames from the same person is small compared to the distance of two random user names. A classifier is built, based on a threshold of the probabilistic Levenshtein distance, to assess whether a pair of usernames belongs to the same or a different user. The results seem good: about 50% of usernames are linkable with no mistakes.
So what are the interesting avenues for future work here? First, the analysis used is a simple threshold on the edit distance metric. It would be surprising if more advanced classifiers could not be applied to the task. I would definitely try to use random forests for the same task. Second, the technique can be used for good not evil: as users try to migrate between services, they need an effective way to find their contacts — maybe the proposed techniques can help with that.
“Text Classification for Data Loss Prevention” (any public PDF?)
Michael Hart (Symantec Research Labs), Pratyusa Manadhata (HP Labs), and Rob Johnson (Stony Brook University)
The paper looks at the automatic classification of documents as sensitive or not. This is to assist “data loss prevention” systems, that raise an alarm when personal data is about to be leaked (i.e. when it is about to be emailed or stored on-line — mostly by mistake). Traditionally DLP try to describe what is confidential through a set of simple rules, that are not expressive enough to describe and find what is confidential — thus the authors present a machine learning approach to automatically classify documents using training data as sensitive or not. As with all ML techniques there is a fear of mistakes: the technique described tries to minimise errors when it comes to classifying company media (ie. public documents) and internet documents, to prevent the system getting on the way of day to day business activities.
The results were rather interesting: the first SVN classifier looked at unigrams with binary weights to classify documents. Yet, it first had a very high rate of false positives for public documents. It seems the classifiers also had a tendency to classify documents as “secret”. A first solution was to supplement the training set with public documents (from wikipedia), to best described “what is public”. Second, the classifier was tweaked to (in a rather mysterious way to me) by “pushing the decision boundary closer to the true negative”. A further step does 3-category classification as secret, public and non-enterprise, rather than just secret and not-secret.
Overall: They manage to get the false positive / false negative rate down to less than 2%-3% on the largest datasets evaluated. That is nice. The downside of the approach, is common to most work that I have seen using SVNs. It requires a lot of manual tweaking, and further it does not really make much sense — it is possible to evaluate how well the technique performs on a test corpus, but difficult to tell why it works, or what makes it good or better than other approaches. As a resut, even early positive resutls should be considered as preliminary until more extensive evaluation is done (more like medicine rather than engineering). I would personally like to see a probabilistic model based classifier on similar features that integrates the two-step classification process into one model, to really understand what is going on — but then I tend to have a Baysian bias.
P3CA: Private Anomaly Detection Across ISP Networks
Shishir Nagaraja (IIIT Delhi) and Virajith Jalaparti, Matthew Caesar, and Nikita Borisov (University of Illinois at Urbana-Champaign)
The final paper in the session looks at privacy preserving intrusion detection to enable cooperation between internet service providers. ISPs would like to pool data from their networks to detect attacks: either because the volume of communications is abnormal at certain times, or because some frequency component is odd. Cooperation between multiple ISPs is important, but this cooperation should not leak volumes of traffic at each IPS since they are a commercial secret.
Technically, privacy of computations is achieved by using two semi-trusted entities, a coordinator and key holder. All ISPs encrypt their traffic under an additive homomorphic scheme (Paillier) under the keyholder key, and send it to the coordinator. The coordinator is using the key-holder as an oracle to perform a PCA computation to output the top-n eighen vectors and values of traffic. The cryptographic techniques are quite but standard, and involve doing additions, subtraction, multiplication, comparison and normalization of matrices privately though a joint private two-party computation.
Surprisingly, the performance of the scheme is quite good! Using a small cluster, can process a few tens of time slots from hundresds of different ISPs in tens of minutes. A further incremental algorithm allows on-line computations of eighen vector/value pairs in seconds — making real-time use of the privacy preserving algorithm possible (5 minutes of updates takes about 10 seconds to process).
This is a surprising result: my intuition would be that the matrix multiplication would make the approach impractically slow. I would be quite interested to compare the implementation and algorithm used here with a general MPC compiler (under the same honest-but-curious model).
6 July 2011
Shishir Nagaraja has pointed out that our Drac anonymity system is not the first one to consider an anonymity network overlayed on a social network. The performance versus security of routing messages over a social network was already considered in his work entitled ‘anonymity in the wild’.
This is important prior work and we should have cited it properly. It presents an analysis of an anonymity provided by different synthetic social network topologies, as well as real-world data from LiveJournal.
30 March 2011
My team at Microsoft research has spent the past 6 months grappling with the problem of privacy in next generation energy systems. In parallel with the good honest scientific work I also participated in the UK government consultation on smart metering, in writing and in person, specifically on the issue of privacy. Its conclusions have finally been made public (see DECC’s site and Ofgem’s detailed responses).
First, what is the problem? Smart-meters are to be fitting in most homes, and they provide facilities for recording fine-grained readings of energy consumption. These are to be used for time of use billing, energy advice, the backend settlement process, financial projections of suppliers, fraud detection, customer service, and network management. The problem is that these readings are also personal data, and leak information about the occupancy of households, devices used, habits, etc. So here we have a classic privacy dilemma: where to strike the balance between the social value of sharing data (even mandating such sharing) versus the intrusion to home life?
Or do we? As it is often the case when privacy is framed as a balance, what is ignored is that we can use technology to achieve both privacy and extract value from the data. In fact we show no balancing act is necessary. We designed a host of privacy technologies to fulfill the needs of the energy industry (even the rather exotic ones) while preserving extremely high levels of privacy and user control. Lets look at them in detail:
- We developed a set of protocols to perform computation on private data while maintaining a high degree of integrity and availability. This allows customers to calculate their bills, provide indicators of consumed energy value to be used in settlement, routing demand response requests, and do profiling to support network operation or even marketing. Our framework guarantees that the computations only leak their results to third parties, and also that those results are in fact derived from the real meter readings. The raw meter readings are not necessarily shared, but can be used locally on any user client to offer a rich experience — i.e. pretty graphs of consumption and comparison with their neighbours. A non technical overview is available as a white paper, a technical introduction for meter manufacturers is provided, and a preliminary technical report with all the crypto is also online.
- Sometimes it is important to aggregate information from multiple meters without revealing anything about individual readings. The traditional approach has been to give all readings to a trusted third-party that performs the aggregation and only publishes the sum. We show that a set of meters can in fact perform the aggregation without the need for a trusted party. This is simple, efficient and compact — the computations can be done inside the trusted meter or outside along with cryptographic verification. All details are available in our technical report on aggregation.
- Some smart-meters may be deployed in extremely high-security settings. In such places leaking even the final bill or statistics aggregated over time may leak information and a positive guarantee that the information leakage is limited might be necessary. We developed techniques inspired from differential privacy to inject noise to aggregate readings that guarantee any specific time period consumption is masked. Further more we allow customers to recuperate the bulk of the costs though an oblivious cryptographic rebate system. Our technical report on differential privacy and rebates in metering is available.
- Finally proving that protocols are correct is not sufficient, so we explore options for proving actual implementation of the protocols are in fact providing the necessary security and privacy properties. A report on the certified implementation of a variants of the proposed protocols using refinement types is also available.
The project web-page on privacy in metering links to all those any more.
So much about the science, what about the engagement with government. On the positive side, our rather limited goal has been achieved: we wanted to put privacy technologies, that provide solutions beyond the dilemmas and balance between privacy and value, on the map. The government response to the consultation takes note, in a limited way, of the potential use of privacy technologies. On page 10 it shyly mentions that:
“2.18. Work is in process to understand the options for aggregating or anonymising smart metering data and whether it is necessary for the data to be accessed by a party that carries out the data minimisation. Privacy enhancing technology can potentially enable anonymised or aggregated data to be provided without any party having access to the personal data itself. The programme will work with industry and academics in order to explore the applicability of privacy enhancing technologies within the smart metering system.”
This is actually a rather fair representation of the capabilities of the technology, even if it is presented as a far away goal, rather than the concrete protocols we have proved correct and the implementations we have built.
Paragraph 2.18 mentioning privacy technology is a ray of light amidst an otherwise ambivalent government response. On the up side it recognizes energy consumption as private data from the onset, it mandates meters to hold 13 months of consumption and provide local access to it, it defines narrowly the scope of data that can be gathered without explicit consent and puts them under the data protection regime. On the down side there is confused language about what constitutes personal data (2.17), and the final technical solution involves collecting data in clear through a centralized systems (the glorious DCC) and protecting it using access control — a far cry from what we know possible in terms of technical privacy protection.
The metering privacy geeks (legal & technical) might also find other interesting nuggets in this report:
- It mentions privacy-by-design, but without support for privacy technologies (except a mention of aggregation in 2.14). This is a damaging trend set by the Ontario report on privacy in the smart grid that takes a purely management approach to privacy in the local smart grid deployment. A response to this trend is provided by Prof. Claudia Diaz and her colleagues that highlights the technical protections necessary to engineer privacy-by-design. This is only the start of this tussle.
- The report seems to suggest that personal data is not personal if it is not readily identifiable by the data controller (sect. 2.17 and 3.7). This is the classic argument of “what is de-identified personal data”. Does it mean the data controller cannot identify it, or anyone in the world? It seems the government is as confused as everyone else on this matter.
- The key outcome of the consultation is that the energy industry needs some data to perform “regulated duties”. This concept was present in the initial consultation, but funnily enough there was no description of that those duties were. It transpired in meetings that Ofgem was not in fact clear about what they were, and a large part of the consultation centered around fleshing those out. A list of those duties is available in Appendix 3 of the report, and is probably welcome by all (a similar list is available in the NIST privacy reports).
- So (in 3.15) the government concedes that industry must have access to the data necessary to perform its regulated duties by default, yet this data should be subject to the DPA requirements (3.16 for example specifically calls principle 5 — that the data should not be kept longer than necessary). Well that is a mine field: it is clear that the data is collected for a specified purpose (principle 2). If the other principles are also applied it means that it should not be used without explicit consent for other purposes (*cough*added value services*cough*) and furthermore it should not be excessive for the stated purpose. Well here we are: our technical reports offer ways in which most of the stated purposes in appendix 3 could be fulfilled without collecting the data. Is this a contradiction? Not automatically. The government’s view is clearly that our proposed protocols are not yet ready for prime time — of course as these technologies become better known and deployed this objection will evaporate. Will the data minimization requirement then mandate the use of privacy technologies? This is a rhetorical question at the moment.
- It is interesting to note that the restrictions associated with limiting the automatic collection of data by suppliers was possibly set in place on the grounds of market competition rather than privacy per-se (section 3.32). Automatic collection by suppliers would put them in an advantageous position vis-a-vis third-party providers of value added services. This is an open issue (3.36).
- The government is keen for a local repository of consumption data in the meter (4.6) and the use of geeky toys to visualize it (4.12). This is the setting in which our solutions enable strong privacy guarantees. That is positive, if only half-way.
In conclusion, the debate around privacy in metering has been informed by consumer concerns, privacy concerns, industry needs and technology alternatives. They are all represented in the government response. Yet the final solution is rather conservative: it relies on a centralised conduit for personal information protected by access control layers and management layers. It is far from what we know possible with privacy technologies. The argument today is that those technologies are too new — which is questionable given how quickly IT inovations are brought to market. This argument will lose its potency in the long term if we keep developping and deploying privacy firendly solutions.