Machiavelli Confronts 21st Century Digital Technology
6 January 2010
This is the title of the paper resulting from the interdisciplicary collaboration between computer scientists and social scientists, last November in Dagstuhl. The full version is available on SSRN at:
Machiavelli Confronts 21st Century Digital Technology: Democracy in a Network Society
by Walter S. Baer, Nikita Borisov, George Danezis, Seda F. Guerses, Marek Klonowski, Miroslaw Kutylowski, Ursula Maier-Rabler, Tal Moran, Andreas Pfitzmann, Bart Preneel, Ahmad-Reza Sadeghi, Thierry Vedel, Tracy Westen, Filip Zagorski, and William H. Dutton.
The topic of the seminar was “Network Democracy” and for five days, we discussed tools for representation, direct democracy, power, trasnparency and democratic institutions. This was a refreshing break form the traditional “e-voting = e-democracy” caricature.
The gap between computer and social scientists was initialy wide, and for a few days we concentrated on formulating questions that communities want to ask each other (see appendix 1). A few examples include:
- Computer to social scientists about Conflicting Values. What are prime examples where democratic values come in conflict with each other? What types of conflicts are inherent in democratic systems? Is the integrity of technical systems the key requirement for edemocracy solutions? Is it more important than privacy? Is availability more important than both? What are the social dangers for democracy in a network society?
- Social to computer scientists about Privacy and Surveillance. How will future technologies enable all branches of government to discover what citizens and other residents are doing, thinking and saying? To what extent can existing and new privacy and security technologies limit the government’s ability to know more about the public than the public wants to reveal? Can privacy technologies help both enhance and protect the democratic process (e.g. by preventing widespread disclosure of the names of persons signing petitions in a way that could lead to subsequent harassment because of their support of a controversial measure – at the same time as allowing dissemination of information that the wider public would like to know, such as how many people signed the petition and their broad demographic characteristics, but not their individual identities)?
One of the most insightul remarks, and by far my favorite:
“Technologies may be used to cement existing power relations or offer merely an ineffectual ‘play democracy’. Technologies may disadvantage certain groups and worsen power imbalances (e.g. some types of surveillance technologies). Political forces may seek widespread deployment of such technologies or try to limit their use.”
In real time from CCSW09: more website fingerprinting
13 November 2009
I am currently listening to the presentation of:
- Dominik Herrmann, Rolf Wendolsky, Hannes Federrath, “Website Fingerprinting: Attacking Popular Privacy Enhancing Technologies with the Multinomial Naive-Bayes Classifier” ACM CCSW 2009. Chicago, US.
This work looks at the classic problem of identifying browsed pages through SSL encryption or low-latency anonymization networks like Tor. The authors cast the problem of identifying web pages as a supervised learning and classification problem. Instead of applying a jacard coefficient or naive bayes classifier, the authors use a multinomial naive Bayes classifier to learn and identify the pages from their sizes. Furthermore they use filtering on the raw page sizes inspired from information retrieval — such as using the logarithms of the sizes to avoid large bias, or cosine normalisation to make the overall size of the website irrelevant.
That is yet another fine example of casting a traffic analysis problem in terms of machine learning and Bayesian inference.
A very interesting comparison is presented on how different single-hop and multi-hop anonymity systems compare. Single Hop systems are broken with probability over 95%. Multi-hop systems are pretty robust with JonDoNym traffic only being identified 20% of the time, and Tor about 3% of the time. Why is Tor doing so well? It turns out that the cell quantization does add noise, and mess up the attack. It is not clear if this is a fundamental finding, or whether a better / more specific attack could do better.
Datasets should be available soon. Sadly I cannot find the paper on-line. Put your work on the web people!
Anonymity at CCS’09
11 November 2009
I am currently sitting in the anonymity techniques session of ACM CCS 2009 in Chicago, and thought I might as well provide a roadmap to what is being presented here regarding traffic analysis and anonymity. Aside the main event, WPES was also hosted on Monday and contained quite a few relevent papers.
- On the risks of serving whenever you surf: Vulnerabilities in Tor’s blocking resistance design
Jon McLachlan and Nicholas J. Hopper (WPES)
It turns out that altruism does not pay: if you become a relay when also using an anonymity network, your attack surface grows and novel traffic analysis attacks are possible to guess what you are surfing. This paper presents a long term intersection attack, where the time you are on-line can be used to correlate observed browsing events (like tweets), as well as an extension of what has now been named clogging attacks, where traffic can be modulated on the Tor client, and observed on a remote server. Very cool! This paper seems to be part of Nick Hopper’s group new strategy, of presenting an attack paper at WPES, and a solution at CCS (this is understandable as he has been penalised, by not being able to publish at PETS 2010, as he is aPC chair). Part of the solution against this attack is presented in “Membership-concealing overlay networks” later in the conference.
- XPay: Practical anonymous payments for Tor routing and other networked services (not on-line!)
Yao Chen, Radu Sion and Bogdan Carbunar (WPES)
A controversial payment mechanism for paying routers that relay traffic in Tor. More details will follow when I manage to get a copy of the paper, as the proceedings of WPES are not available yet (*cough*cock-up*cough*).
- Hashing it out in public: Common failure modes of DHT-based anonymity schemes
Andrew Quoc Tran, Nicholas J. Hopper and Yongdae Kim (WPES)
Many have tried to build anonymity systems over DHTs, providing some of us an endless supply of systems to perform traffic analysis. This work looks at pretty much most of the DHT based anonymity systems, and points out that queries in DHTs are easy to observe, manipulate and DoS given only a small number of corrupt nodes. This is the attack paper that motivated the Torsk system, that will be presented later in the CCS conference.
- NISAN: Network Information Service for Anonymization Networks (not on-line?)
Andriy Panchenko, Arne Rache and Stefan Richter
Directory services are a bit of a drag on anonymity systems, and NISSAN is a proposal to use a DHT to distribute the directory functionality. Special care is taken to ensure the DHT routing resolution mechanism is not susceptible to an active adversary that controls a fraction of nodes in the DHT, to get truly random nodes. The authors also controversially argue that the observability of queries is not important, and suggest that bridging and fingerprinting is not a big deal so far. DHTs are scary, and NISSAN is likely to come under close scrutiny in the next few years (as long as the authors put the paper on-line!)
- Certificateless Onion Routing
Dario Catalano, Dario Fiore and Rosario Gennaro
The paper proposes the use of certificateless encryption to build circuits in onion routing (or mix) systems. It is unclear how useful this is, as the list of nodes already has to be signed to avoid directory and sybil attacks, but having different options to do crypto in anonymity networks is always interesting.
- ShadowWalker: Peer-to-peer Anonymous Communication using Redundant Structured Topologies
Prateek Mittal and Nikita Borisov
After many attack papers, Nikita and Prateek decide to jump in the deep end, and propose their own DHT based path construction mechanism for anonymity networks. ShadowWalker is a system that allows secure sampling in a DHT. It relies on commitments on routing tables shared amongst some “shadow” nodes to ensure the adversary cannot adaptively lie about their routing tables, and a tit-for-tat strategy to avoid DoS attacks. Very cool ideas, and a very cool name.
- The Bayesian Traffic Analysis of Mix Networks
Carmela Troncoso and George Danezis
Given a trace of a mix network, what is the probability of Alice talking to Bob? This seemingly simple question turns out to be remarkably hard to answer given the constraints of path selection and a real-world trace of traffic. More details on how to do this are also available in our report: The Application of Bayesian Inference to Traffic Analysis. I will write a post about how I hope this will become the standard way in which to assess the passive information leakage of anonymity networks.
- AS-awareness in Tor Path Selection
Matthew Edman and Paul Syverson
The story so far was that a bigger Tor network is a more secure Tor network. This paper points out it is not so simple: despite having more nodes, Tor still routes most traffic over a small number of AS (Autonomous Systems), that can see both ends of the connection, and should be able to do timing attacks. A method to chose paths to mitigate this problem is proposed — more would be welcome.
- Membership-concealing overlay networks
Eugene Vasserman, Rob Jansen, James Tyra, Nicholas Hopper and Yongdae Kim
Finding out remotely who is using an anonymity network might lead to trouble for some users. This work proposes a method to anonymize traffic, in a peer-to-peer manner, but without ever connecting to many strangers. Its very nice to see the idea of social network being used for infiltration resistance taking off.
Papers to come tomorrow:
- A New Cell Counter Based Attack Against Tor
Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Dong Xuan and Weijia Jia
- Scalable Onion Routing with Torsk
Jon McLachlan, Andrew Tran, Nicholas Hopper and Yongdae Kim
“Privacy” for the UK 2011 census?
17 August 2009
The UK goes every ten years through a national census, where every household is called to fill in details about their demographics, habits, travel and income. The next one will be the UK 2011 census.
The office for national statistics has a statutory duty to ensure that the data released from this census cannot be used to identify any individual or to infer any of unknown attribute. Techniques for doing so are called statistical disclosure control, and have been the subject of intense study for the last 40 years at least. One could never have guessed by reading the documents on confidentiality for the next UK census.
To make a long story short: the ONS never considered modern well defined notions of privacy, it lacks a reliable evaluation framework to establish the degree of risk of different methods (let alone utility), and has proposed disclosure control measures that fall rather short of the state of the art.
Moving households around (a bit)
The consultation is not totally over yet, but the current favorite after two rounds of evaluation seems to be a technique called “Record Swapping”. How does it work? The technique takes the database of all responses to the census and outputs another database, that is sufficiently different to avoid identification and inference. Record swapping first categorises all records by the household size, sex, broad age, and hard-to-count variables. Then it selects 2-20% of the records, and each of them are paired with a record from the same category. Then the geographical data of each pair of records (yes, right, only the geographical data) are swapped.
This procedure has the effect to disperse geographically the population a bit so that, it is not possible to know whether single cells in tables are indeed providing information about an individual in a region or, whether they are the product of a swap from a different region. The advantage is that the totals are the same (since swapping things around is invariant to addition), the swaps are with “similar” households, and the procedure is simple to implement.
This is in-line with the definition of privacy of the census office, namely that:
“The Registrars General concluded that the Code of Practice statement can be met in relation to census outputs if no statistics are produced that allow the identification of an individual (or information about an individual) with a high degree of confidence. The Registrars General consider that, as long as there has been systematic perturbation of the data, the guarantee in the Code of Practice would be met.”
Problems with “Record Swapping”
So far a whole process has been followed to evaluate a list of proposed disclosure control measures, present a methodolody to evaluate them, shortlist some, and perform more in-depth research about their utility and privacy. There is a lot of repetition in these documents, a few ad-hoc indicators of quality and privacy, and no security analysis what-so-ever about inference attacks on the proposed schemes. The subject of ” disclosure by differencing” is left as a suggestion for future work in the latest interim report, while the only method left on the list is Record Swapping, as well as ABS, that has apparently not been tested yet at all.
Why is that a problem? Records include many other potentially identifying fields aside from location. Since the rest of the record stand as it is, and is aggregated into tables, with a secret small cell adjustment technique, we cannot really be sure at all that there are no re-identification attacks. (Apparently revealing the details of the technique cannot be divulged for confidentiality reasons, violating even the most basic principle of security engineering! See page 3).
The utility measures used to assess how acceptable these disclosure control measures will be to data users (Shlomo et al.), are themselves very simplistic and do not offer very tight bounds on possible errors but I will leave this matter for the statisticians to blog about.
To make the problem worse, this time the ONS, is seriously thinking of allowing data users to submit their own queries to the database of statistics. The queries are not likely to be full SQL any time soon, but tables on 3 categories (called cubes) are likely to be allowed. This leaves wide open quite a range of attacks in the literature on inference in statistical databases.
At this point there is absolutely no evidence that the disclosure control scheme is actually secure, which in security engineering means that it is probably not.
How did we get to this situation?
It seems the bulk of the work on disclosure control has been done by the ONS, in conjunction with researchers from the University of Southampton. None of the authors of any of the evaluations has a substancial research experience in privacy technology or theoretical computer security that deals with these privacy matters in a systematic way.
What is revealing is the fact that the most relevant related work is never mentioned. It includes:
- The work of Denning on trackersand inference in statistical databases (1980). Instead the archaic term “differencing” is used.
- The work of Sweeney and Samarati on linkage attacks and k-anonymity (1997).
- The work of Dwork on Differential Privacy (2007), which is the most current and strongest definition of privacy for statistical databases.
These works show repeatedly that ad-hoc inference control measures, that only aim to suppress a handful of known and obvious attacks, are systematically bypassed.
Dwork in her work on Differential Privacy (that won the 2009 year’s PET Award) provides clear arguments on why simpler ad-hoc techniques cannot provide the same guarantee of privacy: their results can be aggregated with side information known to the adversary to facilitate inference. Differential privacy on the other hand guarantees that the results of a query to the database, or published table, reveals no more information when composed with other such queries or any side information.
This is a hot topic in research today, and all the details may not be ready for a census in 2 years time. This does not justify the ONS’s ignorance of this field.
Micah Sherr presented at PETS a few days ago his work on ”Scalable Link-Based Relay Selection for Anonymous Routing“. The key idea is that paths are generated by taking into account the network performance of each link to be used. The overhead of distributing performance information can be reduced by associating with each server a network coordinate, that allows to estimate the latency between pairs of nodes.
This is a pure path selection proposal, as quite a few have appeared in the past year to reduce latency, or increase node utilization in Tor. The question with all those proposals is: how much anonymity would these path selection strategies provide?
The methodology we present in “The Bayesian Traffic Analysis of Mix Networks” provides a way of answering such questions, by carefully modelling the path selection strategy. Applying the same methodology to these path selection proposals would be of clear benefit, and an excellent project for anyone interested in understanding better how to apply inference based techniques to traffic analysis.
Today morning at PETS 2009, the paper on “Physical Layer Attacks on Unlinkability in Wireless LANs” was presented. The idea is that despite all anonymization techniques at the logical layers, though IP address modulation, the physical location of the IEEE802.11 transmitters can be localised, and thus unlinkable packets emanating from it linked together.
The approach used to do this, uses signal strength and triangulation techniques, with a machine learning twist, to cluster together emissions and link them to the same transmitter. A set of countermeasures is also presented, where transmitters modulate their signal strength to foil this clustering.
The attacker model was restricted to using commodity hardware, so physical device fingerprinting attacks were not considered.
Bayesian traffic analysis
5 August 2009
In the last year, we have been developping a set of systematic techniques to analyse anonymity systems, to perform traffic analysis. These cast the problem of traffic analysis as a Bayesian inference problem, where the adversay observes some traces, according to a threat model, and then has to infer the hidden state of the system, that is equivalent to tracing who is talking to whom.
So far we have looked at the analysis of mix networks, the analysis of Crowds, and a Bayesian approach to long term intersection attacks. The papers describing each of these are available online:
- Carmela Troncoso and George Danezis.
The Bayesian Traffic Analysis of Mix Networks. (Draft)
ACM CCS 2009, Chicago, USA. - George Danezis, Claudia Diaz, Emilia Kasper, and Carmela Troncoso.
The wisdom of Crowds: attacks and optimal constructions.
ESORICS 2009, St Malo, France. - George Danezis and Carmela Troncoso.
Vida: How to use Bayesian inference to de-anonymize persistent communications.
PETS 2009, Seattle, USA.
In real time: privacy policies @ PET 2009
5 August 2009
I just sat thought the first session of PET2009, that was about privacy policies where two really interesting pieces of research were presented.
Ram presented a work on “Capturing Social Networking Privacy Preferences” [pdf], where he proposes to infer automatically privacy policies for social networks, and present them as templates or starting points for users to define their own policies. The methodology used is really neat: they record the location of a number of users, and every night they ask the users whether they would be happy to share their locations with different circles of theirs. Then they try to extract a set of standard policies, based on time, location, and the type of contact that can see your location.
The second study, presented by Aleecia, is on how easy and pleasing is to read privacy policies (“A Comparative Study of Online Privacy Policies and Formats“). They find that privacy policies in different formats are more or less easy to read and understand, but across the board privacy policies are difficult to understand, easy to misunderstand, and totally unpleasant to read.
A traffic analysis exercise
3 August 2009
In the 1950’s Lambros D. Callimahos built the Zendian Problem [zip] as an integrated exercise in traffic analysis, cryptanalysis, and communications intelligence operations. The state of cryptology today has moved on, beyond the point where an analyst can rely on plaintext to drive operations. The state of traffic analysis techniques, and the availability of more computing power, requires a new generation of exercises to sharpen the tools and minds of those in the field.
Steven Murdoch and myself have been developing, on (and mostly) off, over the past year an exercise in traffic analysis, and in particular the long term disclosure attacks. The exercise was first presented and used at the Brno FIDIS summer school, and we are now using it as part of an industrial training curriculum.
The exercise consists of an anonymized trace of communications, that were mediated by an anonymity system, that a group of people used to message each other. The message traces are synthetic, but generated based on a real-world social network. Users have favorite communication partners, talk more or less according to the type of relationship and time of day, and may reply to each others messages.
The goal is to apply any disclosure attacks, and de-anonymize and trace as many messages as possible. An oracle is provided that outputs the success rate, and the instructor’s pack includes the original messages as well as the scripts used to simulate the messaging behaviours and the anonymization layer. We tried to keep the exercise and success rates realistic, so do not expect to ever get 100% — significantly better than random is already quite good.
The richness of the messaging behaviour is designed to stress the most advanced statistical disclosure techniques, that make use of social network, replies, and perfect matchings. The literature on statistical disclosure can be found on the exercise page, and an example implementing the simple SDA is provided in the bundle. The family of Disclosure Attacks (devised by Kesdoganet al.) might also be modified and applied to the exercise. Our new attack soon to be presented at PET, using Bayesian Inference, could also be applied:
- George Danezis and Carmela Troncoso. Vida: How to use Bayesian inference to de-anonymize persistent communications. Privacy Enhancing Technologies Symposium (PETS 2009), Seattle, USA.
A couple of caveats: this is an exercise, to help people learn about long term traffic analysis attacks, and allow them to implement the attacks on a rich, but safe, dataset. The objective is to learn.
- It is not a benchmarking tool between attacks. We are not sure that the traffic patterns are typical enough to make sure that when an attack performs better in the setting of our exercise it would perform better on real data.
- It is also not a competition or test. We publish all of the hidden state for instructors, and the random number generator used was not cryptographically strong. The point is not who can get the highest score, but the quality of understanding of the attacks.
Caveats aside, I do hope that the exercise opens a discussion about how we can exchange training or live datasets, formats for evaluating traffic analysis attacks, and a level of standardisation to the interfaces of attack scripts. These will probably be topics for debate over the Privacy Enhancing Technologies Symposium 2009, next week.
Both myself and Steven would be very interested to hear your experiences with the exercise, either if you take it yourself, or give it to a class as an instructor. If you extend the exercise, or generate particular bundles of anonymized datasets, we would also be happy to host them.
A closer look at this year’s surveillance reports
28 July 2009
The annual reports from the Chief Surveillance Commissioner (2008-2009) and the Interception of Communications Commissioner (2008)just came out. They contain some interesting statistics, buried in the mist of boring self-congratulations on how wonderful the surveillance regime is in the UK.
First of all we get a bit of an idea on how, and how often, the RIPA part III powers to compel decryption or request keys, are to be used. It seems, from both reports, that any such request has to be approved by NTAC first, before anyone is served. Then a judge rubber-stamps the request that is served to an individual. These individual comply or go to jail, the theory goes. In the period 2008-2009:
- NTAC approved 26 applications to serve a decryption notice (and declined 1).
- A judge approved 17 notices (and zero were declined).
- 15 notices were served.
- 11 individuals failed to comply (the assumption is that 4 of them complied)
- 7 individuals were charged as a result of their failure to comply
- 2 individuals were convicted
What does all this add to? About 10% or less conviction rate for failing to comply with a notice (2 / 22, assuming 4 complied). It would of course be of interest to find out if any of those who complied were charged and convicted with any offences, or whether the requests are just keeping honest people honest.
It is a real pity more qualitative information is not provided about the specific cases that reached court, aside the fact that the powers were used to investigate counter terrorism, child indecency and domestic extremism. Finding how each case went would be quite worth while.
The appendix B of the Surveillance Commissioner has a rough breakdown of the authorisations for property interference as well as surveillance, by types of offence investigated. The trends, and changes, between this period (2008-2009) and the previous period (2007-2008) are very interesting, and again totally unexplained in the text of the report. Some highlights:
- Most of the authorisations for property interference are related to drugs offenses (63% in 2008-2009, and 60% in 2007-2008). That seems pretty stable, and is the single biggest category by an order of magnitude.
- We used to have a terrorism problem, with about 4.8% of property interference related to it in 2007-2008. It seems we have ran out of terrorism to investigate in 2008-2009, and now it only accounts for 0.6% of all cases of property interference. That is nearly an order of magnitude reduction.
- While terrorism is down, conspiracy investigations are up: 2.8% of authorisations related to it in 2008-2009, versus only 1.5% for the previous year. That may not be unrelated to the shift of looking at “domestic terrorism”, with the usual silly “conspiracy to cause a nuisance” charges.
- It is unclear where child indicency fits in any of these categories, despite requiring some property interference, presumably to raid people and seize their computers.
Similar trends are observed when it comes to intrusive surveillance authorised under RIPA Part II. Drugs are biger than anything else, terrorism is no more a pretext for surveillance (1 case!) and conspiracy is becoming popular with a serious increase of surveillance. The investigations of burglaries and robberies using surveillance and property interference is also up. About 2681 property interference authorisations were issued, and 384 intrusive surveilance authorisations were served in 2008-2009. (There were also 16118 directed surveillance authorisations.)
The interception of communication figures look relatively similar. In 2008 about the same number of warrants were issued or active under RIPA (2599 RIPA warrants) for intercepting communications. The fact that the numbers are of the same order of magnitude may suggest that the different authorisations are used as a “bundle” for particular cases. It might also be just a coincidence.
There are no specific figures about access to traffic data (under traffic data retention regimes) but it is estimated that out of all requests 80% concern subscriber information, e.g. who is behind this telephone number? This is in-line with previous statistics.
What about CHIS, the euphemism for Covert Human Intelligence Source, or more commonly known as a “snitch“? There were 3722 CHIS at the end of March 2009, and 4278 recruited in the year. This means that on average each CHIS is used for a bit less than a year. The variance can of course be significant.
Overall the pictured offered is that the UK is a really quiet place. With about 60 Million people and only about 3000-4000 cases requiring surveillance authorisations, let alone the laughable 26 applications to coerce decryption, there seems to be more rhetoric about serious crime, than there is serious crime. Of course there statistics exclude warrants obtained by MI5 and SIS, who are subject to a different oversight body, that is much less keen on publishing statistics. It is not unlikely that a lot of the terrorism and political crimes are investigated there.