Americans. Attitudes About Internet Behavioral Advertising Practices
Aleecia M. Mcdonald and Lorrie Faith Cranor (Carnegie Mellon University)

This is a very interesting paper on people’s attitudes to behavioural advertising. Researchers used a mix of a small-scale (14 people) study and a larger (100s of people) statistical study. A few findings are remarkable:

  • First, they see that users apply their intuition of off-line ads to the experience of on-line ads — many see on-line ads as a push mechanism and do not realise that data about themselves are collected. They seem to not object in general to the idea of advertising, and consider it as a fact of life, and even see it as ‘ok’ to support services.
  • The landscape of attitudes to behavioural advertising is fascinating. When faced with a description of what behavioural advertising collects, as a hypothetical scenario, and how it functions, a large percentage of users said this is not possible, and some of them even claimed it would be illegal. When it comes to attitudes towards receiving ‘better’ ads only 18% of them liked the idea for web-based services, and 4% for email based services (like hotmail & gmail). In general the authors found that a lot of extremely common practices cause “surprise”.
  • The researchers also looked at the formulation of the text of the NAI site, that offers an opt out from behavioural advertising. They find that what the system does is unclear, even after reading the page where the operation is described.

In general people prefer random ads rather than personal ads, with the exception of contextual ads (like books on on-line book stores). There is still a lot of ignorance about how technical systems work, and education when it comes to privacy and the ability to self-help themselves to protect privacy is clearly not working.

This research is pointing in the direction that the presumed tolerance of users to privacy invasion is due to ignorance of common practices. Once those practices are revealed it produces surprise, and even feeling of betrayal that will not be beneficial to any company and customer confidence.

The potential for abuse is a key challenge when it comes to deploying anonymity systems, and the privacy technology community has been researching solutions to this problem for a long time. Nymble systems allow administrators to blacklist anonymous accounts, without revealing or even knowing their identity.

What is the model: a user registers an account with a service, such as wikipedia. Then the user can use an anonymous channel like Tor, to do operations, like edit encyclopedia articles. This prevents identification of the author, and also bypasses a number of national firewalls that prevent users accessing the service (China for example blocks Wikipedia for some reason). If abuse it detected then the account can be blacklisted, but without revealing which one it was! The transcript of the edit operation is sufficient for preventing any further edits, but without tracing back the original account or network address of the user.

Nymble systems had some limitations. They either required trusted third parties for registration, or they were slow. A new generation of Nymble systems, including Jack, is now addressing these limitations: they use modern cryptographic accumulator constructions that have proofs of non-membership in O(1) time, to prove a hidden identity is not blacklisted. Jack can do authentication in 200ms, and opening a Nymble address in case of abuse in less than 30ms. This is getting real practical, and it is time that Wikipedia starts using this system instead of blacklisting Tor nodes out of fear of abuse.

Other Nymble systems: The original nymble | Newer Nymble | BLAC | Nymbler with VERBS | PEREA. Each of them offers a different trade-off of efficiency and security.

  • Using Social Networks to Harvest Email Addresses by Iasonas Polakis, Georgios Kontaxis, Eleni Gessiou, Thanasis Petsas, Evangelos P. Markatos and Spiros Antonatos (Institute of Computer Science, Foundation for Research and Technology Hellas)
  • Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships and Other Sensitive Binary Attribute Vectors by Arthur Ascuncion and Michael Goodrich (University of California, Irvine) (not on-line yet).

The first work by Polakis et al, looks at how easy it is to massively harvest email addresses using social networks and search engines to further use them as targets for spam. Furthermore, they attach to each email address contextual social information and personal information to produce more convincing spam and phishing emails. For this they used different techniques on three target platforms.

On facebook, they use the facility that allows users to find other by email to harvest personal information. This acts as an oracle to map harvested email addresses to real world names and demographic information. Once a facebook profile is linked, a very convincing fishing email can be crafted – including an email to ask to befriend them. (About 30% of users would befriend a stranger in that manner – a result not reported in the paper.)

A second vector of information is the use of nicknames that are constant across different sites. They use twitter to harvest pairs of (nicknames, email) and then further use the Facebook email to name oracle to link them to real world addresses. Finally, the authors use a google buzz feature to extract emails: every Buzz user ID is also their gmail address – this means that by searching buzz for particular words you can harvest gmail addresses as well as personal information of the users.

But how effective are the email harvesting techniques? How do you even assess this? The authors check the name and address they have harvested against an exact match with the name extracted from Facebook. The first technique yields about 0.3% correct addresses, the second 7%, and the final one 40%, showing that the techniques are practical when linking email to real names.

The second paper by Ascuncion et al. looks at how to aggregate information leaked by social networks to construct complete profiles of users on Social Networks. The aim is to reconstruct the friendship network as well as recovering attributes even if the privacy settings are used.

The techniques assume you can use an oracle to ask group queries against the social network site to check for a particular attribute. The objective is then to find a scheme that minimises the number of queries. It turns out there is a body of work on combinatorial group testing, including adaptive variants, that are readily applicable to this problem. This is not unlike our work on prying data out of a social network. Applying these techniques to social networks is even narrower allowing a lower number of queries to extract attributes (a logarithmic number of queries in the size of possible profiles, and linear in the number of profiles with a certain attribute to be extracted).

The attack is applied and validated by applying it to extract friends in Facebook, DNA sequences in mitochondrial databases, and movie preferences in the NetFlix database. These techniques are interesting as they are very general. At the same time it is likely that faster ways exist to extract specific attributes of users in real-world social networks, as there are strong correlation between attributes and the social structure of users.

I am just sitting in the first WPES10 talk:

Balancing the Shadows by Max Schuchard, Alex Dean, Victor Heorhiadi, Yongdae Kim, and Nicholas Hopper (University of Minnesota)

ShadowWalker is a peer-to-peer anonymity system designed by Prateek Mittal (who was our intern in 2008) and Nikita Borisov to prevent corrupt peers jeopardising the network. The authors of this new paper “Balancing the shadows” present an attack on the system, where a malicious coalition of nodes can compromise routing security and can bias the probability of choosing a malicious node as a relay. It turns out that a naïve fix opens the system instead to selective denial-of-service attack.

How does the eclipse attack on ShadowWalker work? The adversary controls a full neighbourhood of the network, i.e. a sequence of peers in the distributed hash table (DHT). This allows an adversary to corrupt the “shadow” mechanism in shadow walker. When Alice asks a malicious node in this neighbourhood about another node in the network, they can provide a false ID, along with a set of false shadows. This attack is not too bad on its own, except that the same mechanism is used during the construction of the routing tables of the DHT. As a result an adversary that controls about 10% of the nodes can corrupt about 90% of the circuits, after a few rounds of the protocol (this was backed by simulations).

How to fix the attack? Can we increase the number of shadows of each node that can testify of the correctness of its ID? It turns out this is not a good idea: the more shadows the higher the probability one of them is malicious. In that case they can maliciously refuse to attest honest nodes, effectively taking them out of the protocol. The authors propose to change the protocol to only require a fraction of shadows providing signatures to attest an ID-node relationship — time will show if this withstands attacks.

What do we learn from this: first the level of security in peer-to-peer anonymity systems is still questionable, as designs keep being proposed and broken on a yearly basis. Second, it highlights that DHT based designs inherit the characteristic that routing tables are designed as part of the protocol. This offers the adversary an opportunity of amplify their attacks. Designs should therefore not consider that the DHT is in an honest steady-state, but instead consider attacks at the time of network formation. Finally, it is worth keeping in mind that these systems try to prevent adversaries using a small fraction of malicious nodes (5%-20%) to compromise the security of a large fraction of the network. This is still far from our hope that peer-to-peer anonymity could withstand large Sybil attacks where the adversary controls a multiple of honest nodes.

I have been teaching a 4 hour course as part of the FOSAD 2010 summer school in Bertinoro, Italy, on high latency anonymous communications. Hence, I have developed my course from 3 years ago, and integrated newer material on Bayesian traffic analysis (including slides contributed by Carmela Troncoso). I also prepared a fifth lecture on low-latency anonymity system, such as Tor and Crowds, but my colleague Paul Syverson gave a separate 4 hour course just on this topic.

The structure of the 5 lectures is:

  1. Basic definitions & unconditional anonymity with DC-networks.
  2. Long-term attacks on anonymity systems (Statistical / Disclosure) and their Bayesian formulation.
  3. Mix networks and anonymity metrics.
  4. The Bayesian traffic analysis of mix networks.
  5. Low-latency anonymity with onion routing and crowds.

Slides are available in Office 2010 format, Office 97-03 format and PDF (but without the cute animations).

An article in the Greek newspaper “Eleftherotypia”on 11 May 2011, covers a worrying trend in surveillance practices of the Greek anti-terrorist squad since 2007. In multiple occasions the police has “lifted” the confidentiality of communications legal provisions, and has requested information about communications taking place within a whole region, for a window of time up to 12 hours. For example:

  • On 31-12-2008 they requested data for 3 hours (4am-7am) for a region of Athens, covering the polytechnic school (that is also covered by special privileges / asylum when it comes to freedom of speech) following an attack on a riot police van.
  • On 9-3-2009 a large region in Koukaki was targeted for 12 hours. (Map of two first regions: http://s.enet.gr/resources/2010-05/21-thumb-large.jpg)
  • Traffic data were collected for whole regions on 12-5-2009, 9-3-2009 (for 90 minutes), 25-11-2009, 18-2-2009 and 7-5-2007.

The article mentions telecommunications (voice and sms) in the first two cases (that might include content), while only mentioning traffic data for the last cases. Furthermore it points out that the selection of time, regions and targets and processing of the information collected happens in an unaccountable manner by officers. The blanket lifting of confidentiality is done under provisions for “state security”, but the article further points out, has now become routine. These practices are also linked with the Data Retention Directive (2006), that has not yet been translated into Greek law, making the legal context for surveillance requests and providers uncertain.

(Original in Greek: “Είμαστε όλοι ύποπτοι…” by ΧΡΗΣΤΟΣ ΖΕΡΒΑΣ)
http://www.enet.gr/?i=issue.el.home&date=11/05/2010&id=160930 

This comes as a surprise to me, since I always through that the criteria applied for conducting surveillance have to be tied to a network endpoint, or at least a person’s identity. 

This is the title of the paper resulting from the interdisciplicary collaboration between computer scientists and social scientists, last November in Dagstuhl. The full version is available on SSRN at:

The topic of the seminar was “Network Democracy” and for five days, we discussed tools for representation, direct democracy, power, trasnparency and democratic institutions. This was a refreshing break form the traditional “e-voting = e-democracy” caricature.

The gap between computer and social scientists was initialy wide, and for a few days we concentrated on formulating questions that communities want to ask each other (see appendix 1). A few examples include:

  • Computer to social scientists about Conflicting Values. What are prime examples where democratic values come in conflict with each other? What types of conflicts are inherent in democratic systems? Is the integrity of technical systems the key requirement for edemocracy solutions? Is it more important than privacy? Is availability more important than both? What are the social dangers for democracy in a network society?
  • Social to computer scientists about Privacy and Surveillance. How will future technologies enable all branches of government to discover what citizens and other residents are doing, thinking and saying? To what extent can existing and new privacy and security technologies limit the government’s ability to know more about the public than the public wants to reveal? Can privacy technologies help both enhance and protect the democratic process (e.g. by preventing widespread disclosure of the names of persons signing petitions in a way that could lead to subsequent harassment because of their support of a controversial measure – at the same time as allowing dissemination of information that the wider public would like to know, such as how many people signed the petition and their broad demographic characteristics, but not their individual identities)?

One of the most insightul remarks, and by far my favorite:

“Technologies may be used to cement existing power relations or offer merely an ineffectual ‘play democracy’. Technologies may disadvantage certain groups and worsen power imbalances (e.g. some types of surveillance technologies). Political forces may seek widespread deployment of such technologies or try to limit their use.”




I am currently listening to the presentation of:

  • Dominik Herrmann, Rolf Wendolsky, Hannes Federrath, “Website Fingerprinting: Attacking Popular Privacy Enhancing Technologies with the Multinomial Naive-Bayes Classifier” ACM CCSW 2009. Chicago, US.

This work looks at the classic problem of identifying browsed pages through SSL encryption or low-latency anonymization networks like Tor. The authors cast the problem of identifying web pages as a supervised learning and classification problem. Instead of applying a jacard coefficient or naive bayes classifier, the authors use a multinomial naive Bayes classifier to learn and identify the pages from their sizes. Furthermore they use filtering on the raw page sizes inspired from information retrieval — such as using the logarithms of the sizes to avoid large bias, or cosine normalisation to make the overall size of the website irrelevant.

That is yet another fine example of casting a traffic analysis problem in terms of machine learning and Bayesian inference.

A very interesting comparison is presented on how different single-hop and multi-hop anonymity systems compare. Single Hop systems are broken with probability over 95%. Multi-hop systems are pretty robust with JonDoNym traffic only being identified 20% of the time, and Tor about 3% of the time. Why is Tor doing so well? It turns out that the cell quantization does add noise, and mess up the attack. It is not clear if this is a fundamental finding, or whether a better / more specific attack could do better.

Datasets should be available soon. Sadly I cannot find the paper on-line. Put your work on the web people!

Anonymity at CCS’09

11 November 2009

I am currently sitting in the anonymity techniques session of ACM CCS 2009 in Chicago, and thought I might as well provide a roadmap to what is being presented here regarding traffic analysis and anonymity. Aside the main event, WPES was also hosted on Monday and contained quite a few relevent papers.

It turns out that altruism does not pay: if you become a relay when also using an anonymity network, your attack surface grows and novel traffic analysis attacks are possible to guess what you are surfing. This paper presents a long term intersection attack, where the time you are on-line can be used to correlate observed browsing events (like tweets), as well as an extension of what has now been named clogging attacks, where traffic can be modulated on the Tor client, and observed on a remote server. Very cool! This paper seems to be part of Nick Hopper’s group new strategy, of presenting an attack paper at WPES, and a solution at CCS (this is understandable as he has been penalised, by not being able to publish at PETS 2010, as he is aPC chair). Part of the solution against this attack is presented in “Membership-concealing overlay networks” later in the conference.

  • XPay: Practical anonymous payments for Tor routing and other networked services (not on-line!)
    Yao Chen, Radu Sion and Bogdan Carbunar (WPES)

A controversial payment mechanism for paying routers that relay traffic in Tor. More details will follow when I manage to get a copy of the paper, as the proceedings of WPES are not available yet (*cough*cock-up*cough*).

Many have tried to build anonymity systems over DHTs, providing some of us an endless supply of systems to perform traffic analysis. This work looks at pretty much most of the DHT based anonymity systems, and points out that queries in DHTs are easy to observe, manipulate and DoS given only a small number of corrupt nodes. This is the attack paper that motivated the Torsk system, that will be presented later in the CCS conference.

  • NISAN: Network Information Service for Anonymization Networks (not on-line?)
    Andriy Panchenko, Arne Rache and Stefan Richter

Directory services are a bit of a drag on anonymity systems, and NISSAN is a proposal to use a DHT to distribute the directory functionality. Special care is taken to ensure the DHT routing resolution mechanism is not susceptible to an active adversary that controls a fraction of nodes in the DHT, to get truly random nodes. The authors also controversially argue that the observability of queries is not important, and suggest that bridging and fingerprinting is not a big deal so far. DHTs are scary, and NISSAN is likely to come under close scrutiny in the next few years (as long as the authors put the paper on-line!)

The paper proposes the use of certificateless encryption to build circuits in onion routing (or mix) systems. It is unclear how useful this is, as the list of nodes already has to be signed to avoid directory and sybil attacks, but having different options to do crypto in anonymity networks is always interesting.

After many attack papers, Nikita and Prateek decide to jump in the deep end, and propose their own DHT based path construction mechanism for anonymity networks. ShadowWalker is a system that allows secure sampling in a DHT. It relies on commitments on routing tables shared amongst some “shadow” nodes to ensure the adversary cannot adaptively lie about their routing tables, and a tit-for-tat strategy to avoid DoS attacks. Very cool ideas, and a very cool name.

Given a trace of a mix network, what is the probability of Alice talking to Bob? This seemingly simple question turns out to be remarkably hard to answer given the constraints of path selection and a real-world trace of traffic. More details on how to do this are also available in our report: The Application of Bayesian Inference to Traffic Analysis. I will write a post about how I hope this will become the standard way in which to assess the passive information leakage of anonymity networks.

The story so far was that a bigger Tor network is a more secure Tor network. This paper points out it is not so simple: despite having more nodes, Tor still routes most traffic over a small number of AS (Autonomous Systems), that can see both ends of the connection, and should be able to do timing attacks. A method to chose paths to mitigate this problem is proposed — more would be welcome.

Finding out remotely who is using an anonymity network might lead to trouble for some users. This work proposes a method to anonymize traffic, in a peer-to-peer manner, but without ever connecting to many strangers. Its very nice to see the idea of social network being used for infiltration resistance taking off.

Papers to come tomorrow:

  • A New Cell Counter Based Attack Against Tor
    Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Dong Xuan and Weijia Jia

The UK goes every ten years through a national census, where every household is called to fill in details about their demographics, habits, travel and income. The next one will be the UK 2011 census.

The office for national statistics has a statutory duty to ensure that the data released from this census cannot be used to identify any individual or to infer any of unknown attribute. Techniques for doing so are called statistical disclosure control, and have been the subject of intense study for the last 40 years at least. One could never have guessed by reading the documents on confidentiality for the next UK census.

To make a long story short: the ONS never considered modern well defined notions of privacy, it lacks a reliable evaluation framework to establish the degree of risk of different methods (let alone utility), and has proposed disclosure control measures that fall rather short of the state of the art.

Moving households around (a bit)

The consultation is not totally over yet, but the current favorite after two rounds of evaluation seems to be a technique called “Record Swapping”. How does it work? The technique takes the database of all responses to the census and outputs another database, that is sufficiently different to avoid identification and inference. Record swapping first categorises all records by the household size, sex, broad age, and hard-to-count variables. Then it selects 2-20% of the records, and each of them are paired with a record from the same category. Then the geographical data of each pair of records (yes, right, only the geographical data) are swapped.

This procedure has the effect to disperse geographically the population a bit so that, it is not possible to know whether single cells in tables are indeed providing information about an individual in a region or, whether they are the product of a swap from a different region. The advantage is that the totals are the same (since swapping things around is invariant to addition), the swaps are with “similar” households, and the procedure is simple to implement.

This is in-line with the definition of privacy of the census office, namely that: 

“The Registrars General concluded that the Code of Practice statement can be met in relation to census outputs if no statistics are produced that allow the identification of an individual (or information about an individual) with a high degree of confidence. The Registrars General consider that, as long as there has been systematic perturbation of the data, the guarantee in the Code of Practice would be met.”

Problems with “Record Swapping”

So far a whole process has been followed to evaluate a list of proposed disclosure control measures, present a methodolody to evaluate them, shortlist some, and perform more in-depth research about their utility and privacy. There is a lot of repetition in these documents, a few ad-hoc indicators of quality and privacy, and no security analysis what-so-ever about inference attacks on the proposed schemes. The subject of ” disclosure by differencing” is left as a suggestion for future work in the latest interim report, while the only method left on the list is Record Swapping, as well as ABS, that has apparently not been tested yet at all.

Why is that a problem? Records include many other potentially identifying fields aside from location. Since the rest of the record stand as it is, and is aggregated into tables, with a secret small cell adjustment technique, we cannot really be sure at all that there are no re-identification attacks. (Apparently revealing the details of the technique cannot be divulged for confidentiality reasons, violating even the most basic principle of security engineering! See page 3).

The utility measures used to assess how acceptable these disclosure control measures will be to data users (Shlomo et al.), are themselves very simplistic and do not offer very tight bounds on possible errors but I will leave this matter for the statisticians to blog about.

To make the problem worse, this time the ONS, is seriously thinking of allowing data users to submit their own queries to the database of statistics. The queries are not likely to be full SQL any time soon, but tables on 3 categories (called cubes) are likely to be allowed. This leaves wide open quite a range of attacks in the literature on inference in statistical databases.

At this point there is absolutely no evidence that the disclosure control scheme is actually secure, which in security engineering means that it is probably not.

How did we get to this situation?

It seems the bulk of the work on disclosure control has been done by the ONS, in conjunction with researchers from the University of Southampton. None of the authors of any of the evaluations has a substancial research experience in privacy technology or theoretical computer security that deals with these privacy matters in a systematic way.

What is revealing is the fact that the most relevant related work is never mentioned. It includes:

  • The work of Denning on trackersand inference in statistical databases (1980). Instead the archaic term “differencing” is used.
  • The work of Sweeney and Samarati on linkage attacks and k-anonymity (1997).
  • The work of Dwork on Differential Privacy (2007), which is the most current and strongest definition of privacy for statistical databases.

These works show repeatedly that ad-hoc inference control measures, that only aim to suppress a handful of known and obvious attacks, are systematically bypassed.

Dwork in her work on Differential Privacy (that won the 2009 year’s PET Award) provides clear arguments on why simpler ad-hoc techniques cannot provide the same guarantee of privacy: their results can be aggregated with side information known to the adversary to facilitate inference. Differential privacy on the other hand guarantees that the results of a query to the database, or published table, reveals no more information when composed with other such queries or any side information. 

This is a hot topic in research today, and all the details may not be ready for a census in 2 years time. This does not justify the ONS’s ignorance of this field.

Follow

Get every new post delivered to your Inbox.