Boing Boing just released a classified GCHQ document that was meant to act as the Sept 2011 guide to open research problems in Data Mining. The intended audience, Heilbronn Institute for Mathematical Research (HIMR), is part of the University of Bristol and composed of mathematicians working for half their time on classified problems with GCHQ.
First off, a quick perusal of the actual publication record of the HIMR makes a sad reading for GCHQ: it seems that very little research on data mining was actually performed post-2011-2014 despite this pitch. I guess this is what you get trying to make pure mathematicians solve core computer science problems.
However, the document presents one of the clearest explanations of GCHQ’s operations and their scale at the time; as well as a very interesting list of open problems, along with salient examples.
Overall, reading this document very much resembles reading the needs of any other organization with big-data, struggling to process it to get any value. The constrains under which they operate (see below), and in particular the limitations to O(n log n) storage per vertex and O(1) per edge event, is a serious threat — but of course this is only for un-selected traffic. So the 5000 or so Tor nodes probably would have a little more space and processing allocated to them, and so would known botnets — I presume.
Secondly, there is clear evidence that timing information is both recognized as being key to correlating events and streams; and it is being recorded and stored at an increasing granularity. There is no smoking gun as of 2011 to say they casually de-anonymize Tor circuits, but the writing is on the wall for the onion routing system. GCHQ at 2011 had all ingredients needed to trace Tor circuits. It would take extra-ordinary incompetence to not have refined their traffic analysis techniques in the past 5 years. The Tor project should do well to not underestimate GCHQ’s capabilities to this point.
Thirdly, one should wonder why we have been waiting for 3 years until such clear documents are finally being published from the Snowden revelations. If those had been the first published, instead of the obscure, misleading and very non-informative slides, it would have saved a lot of time — and may even have engaged the public a bit more than bad powerpoint.
3 February 2016
(This is an extract from my contribution to Harper, Richard. “Introduction and Overview”, Trust, Computing, and Society. Ed. Richard H. R. Harper. 1st ed. New York: Cambridge University Press, 2014. pp. 3-14. Cambridge Books Online. Web. 03 February 2016. http://dx.doi.org/10.1017/CBO9781139828567.003)
Cryptography has been used for centuries to secure military, diplomatic, and commercial communications that may fall into the hands of enemies and competitors (Kahn 1996). Traditional cryptography concerns itself with a simple problem: Alice wants to send a message to Bob over some communication channel that may be observed by Eve, but without Eve being able to read the content of the message. To do this, Alice and Bob share a short key, say a passphrase or a poem. Alice then uses this key to scramble (or encrypt) the message, using a cipher, and sends the message to Bob. Bob is able to use the shared key to invert the scrambling (or “decrypt”) and recover the message. The hope is that Eve, without the knowledge of the key, will not be able to unscramble the message, thus preserving its confidentiality.
It is important to note that in this traditional setting we have not removed the need for a secure channel. The shared key needs to be exchanged securely, because its compromise would allow Eve to read messages. Yet, the hope is that the key is much shorter than the messages subsequently exchanged, and thus easier to transport securely once (by memorizing it or by better physical security). What about the cipher? Should the method by which the key and the message are combined not be kept secret? In “La Cryptographie Militaire” in 1883, Auguste Kerckhoffs stated a number of principles, including that only the key should be considered secret, not the cipher method itself (Kerckhoffs 1883). Both the reliance on a small key and the fact that other aspects of the system are public is an application of the minimization principle we have already seen in secure system engineering. It is by minimizing what has to be trusted for the security policy to hold that one can build and verify secure systems – in the context of traditional cryptography, in principle, this is just a short key.
Kerckhoffs argues that only the key, not the secrecy of the cipher is in the trusted computing base. But a key property of the cipher is relied on: Eve must not be able to use an encrypted message and knowledge of the cipher to recover the message without access to the secret key. This is very different from previous security assumptions or components of the TCB. It is not about the physical restrictions on Eve, and it is not about the logical operations of the computer software and hardware that could be verified by careful inspection. It comes down to an assumption that Eve cannot solve a somehow difficult mathematical problem. Thus, how can you trust a cipher? How can you trust that the adversary cannot solve a mathematical problem?
To speak the truth, this was not a major concern until relatively recently, compared with the long history of cryptography. Before computers, encoding and decoding had to be performed by hand or using electromechanical machines. Concerns such as usability, speed, cost of the equipment, and lack of decoding errors were the main concerns in choosing a cipher. When it comes to security, it was assumed that if a “clever person” proposes a cipher, then it would take someone much cleverer than them to decode it. It was even sometimes assumed that ciphers were of such complexity that there was “no way” to decode messages without the key. The assumption that other nations may not have a supply of “clever” people may have to do with a colonial ideology of nineteenth and early twentieth centuries. Events leading to the 1950s clearly contradict this: ciphers used by major military powers were often broken by their opponents.
In 1949, Claude Shannon set out to define what a perfect cipher would be. He wanted it to be “impossible” to solve the mathematical problem underlying the cipher (Shannon 1949). The results of this seminal work are mixed. On the positive side, there is a perfect cipher that, no matter how clever an adversary is, cannot be solved – the one-time pad. On the down side, the key of the cipher is as long as the message, must be absolutely random, and can only be used once. Therefore the advantage of short keys, in terms of minimizing their exposure, is lost and the cost of generating keys is high (avoiding bias in generating random keys is harder than expected). Furthermore, Shannon proves that any cipher with smaller keys cannot be perfectly secure. Because the one-time pad is not practical in many cases, how can one trust a cipher with short keys, knowing that its security depends on the complexity of finding a solution? For about thirty years, the United States and the UK followed a very pragmatic approach to this: they kept the cryptological advances of World War II under wraps; they limited the export of cryptographic equipment and know-how through export regulations; and their signal intelligence agencies – the NSA and GCHQ, respectively – became the largest worldwide employers of mathematicians and the largest customers of supercomputers. Additionally, in their roles in eavesdropping on their enemies’ communications, they evaluated the security of the systems used to protect government communications. The assurance in cryptography came at the cost of being the largest organizations that know about cryptography in the world.
The problem with this arrangement is that it relies on a monopoly of knowledge around cryptology. Yet, as we have seen with the advent of commercial telecommunications, cryptography becomes important for nongovernment uses. Even the simplest secure remote authentication mechanism requires some cryptography if it is to be used over insecure channels. Therefore, keeping cryptography under wraps is not an option: in 1977, the NSA approved the IBM design for a public cipher, the Data Encryption Standard (DES), for public use. It was standardized in 1979 by the US National Institute for Standards and Technology (NIST).
The publication of DES launched a wide interest in cryptography in the public academic community. Many people wanted to understand how it works and why it is secure. Yet, the fact that the NSA tweaked its design, for undisclosed reasons, created widespread suspicion in the cipher. The fear was that a subtle flaw was introduced to make decryption easy for intelligence agencies. It is fair to say that many academic cryptographers did not trust DES!
Another important innovation in 1976 was presented by Whitfield Diffie and Martin Hellman in their work “New Directions in Cryptography” (Diffie & Hellman 1976). They show that it is possible to preserve the confidentiality of a conversation over a public channel, without sharing a secret key! This is today known as “Public Key Cryptography,” because it relies on Alice knowing a public key for Bob, shared with anyone in the world, and using it to encrypt a message. Bob has the corresponding private part of the key, and is the only one that can decode messages used with the public key. In 1977, Ron Rivest, Adi Shamir, and Leonard Adleman proposed a further system, the RSA, that also allowed for the equivalent of “digital signatures” (Rivest et al. 1978).
What is different in terms of trusting public key cryptography versus traditional ciphers? Both the Diffie-Hellman system and the RSA system base their security on number theoretic problems. For example, RSA relies on the difficulty of factoring integers with two very large factors (hundreds of digits). Unlike traditional ciphers – such as DES – that rely on many layers of complex problems, public key algorithms base their security on a handful of elegant number theoretic problems.
Number theory, a discipline that G.H. Hardy argued at the beginning of the twentieth century was very pure in terms of its lack of any practical application (Hardy & Snow 1967), quickly became the deciding factor on whether one can trust the most significant innovation in the history of cryptology! As a result, a lot of interest and funding directed academic mathematicians to study whether the mathematical problems underpinning public key cryptography were in fact difficult and how difficult the problems were.
Interestingly, public key cryptography does not eliminate the need to totally trust the keys. Unlike traditional cryptography, there is no need for Bob to share a secret key with Alice to receive confidential communications. Instead, Bob needs to keep the private key secret and not share it with anyone else. Maintaining the confidentiality of private keys is simpler than sharing secret keys safely, but it is far from trivial given their long-term nature. What needs to be shared is Bob’s public key. Furthermore, Alice need to be sure she is using the public key associated with the Bob’s private key; if Eve convinces Alice to use an arbitrary public key to encrypt a message to Bob, then Eve could decrypt all messages.
The need to securely associate public keys with entities has been recognized early on. Diffie and Hellman proposed to publish a book, a bit like the phone register, associating public keys with people. In practice, a public key infrastructure is used to do this: trusted authorities, like Verisign, issue digital certificates to attest that a particular key corresponds to a particular Internet address. These authorities are in charge of ensuring that the identity, the keys, and their association are correct. The digital certificates are “signed” using the signature key of the authorities that anyone can verify.
The use of certificate authorities is not a natural architecture in many cases. If Alice and Bob know each other, they can presumably use another way to ensure Alice knows the correct public key for Bob. Similarly, if a software vendor wants to sign updates for their own software, they can presumably embed the correct public key into it, instead of relying on public key authorities to link their own key with their own identity.
The use of public key infrastructures (PKI) is necessary in case Alice wants to communicate with Bob without them having any previous relationship. In that case Alice, given only a valid name for Bob, can establish a private channel to Bob (as long as it trusts the PKI). This is often confused: the PKI ensures that Alice talks to Bob, but not that Bob is “trustworthy” in any other way. For example, a Web browser can establish a secure channel to a Web service that is compromised or simply belong to the mafia. The secrecy provided by the channel does not, in that case, provide any guarantees as to the operation of the Web service. Recently, PKI services and browsers have tried to augment their services by only issuing certificates to entities that are verified as somehow legitimate.
Deferring the link between identities and public keys to trusted third parties places this third party in a system’s TCB. Can certification authorities be trusted to support your security policy? In some ways, no. As implemented in current browsers, any certification authority (CA) can sign a digital certificate for any site on the Internet (Ellison & Schneier 2000). This means that a rogue national CA (say, from Turkey) can sign certificates for the U.S. State Department, that browsers will believe. In 2011, the Dutch certificate authority Diginotar was hacked, and their secret signature key was stolen (Fox-IT 2012). As a result, fake certificates were issued for a number of sensitive sites. Do CAs have incentives to protect their key? Do they have enough incentives to check the identity of the people or entities behind the certificates they sign?
Cryptographic primitives like ciphers and digital signatures have been combined in a variety of protocols. One of the most famous is the Secure Socket Layer SSL or TLS, which provides encryption to access encrypted Web sites on the Internet (all sites following the https:// protocol). Interestingly, once secure primitives are combined into larger protocols, their composition is not guaranteed to be secure. For example a number of problems have been identified against SSL and TLS that are not related to the weaknesses of the basic ciphers used (Vaudenay 2002).
The observation that cryptographic schemes are brittle and could be insecure even if they rely on secure primitives (as did many deployed protocols) led to a crisis within cryptologic research circles. The school of “provable security” proposes that rigorous proofs of security should accompany any cryptographic protocol to ensure it is secure. In fact “provable security” is a bit of a misnomer: the basic building blocks of cryptography, namely public key schemes and ciphers cannot be proved secure, as Shannon argued. So a security proof is merely a reduction proof: it shows that any weakness in the complex cryptographic scheme can be reduced to a weakness in one of the primitives, or a well-recognized cryptographic hardness assumption. It effectively proves that a complex cryptographic scheme reduces to the security of a small set of cryptographic components, not unlike arguments about a small Trusted Computing Base. Yet, even those proofs of security often work at a certain level of abstraction and often do not include all details of the protocol. Furthermore, not all properties can be described in the logic used to perform the proofs. As a result, even provably secure protocols have been found to have weaknesses (Pfitzmann & Waidner 1992).
So, the question of “How much can you trust cryptography?” has in part itself been reduced to “How much can you trust the correctness of a mathematical proof on a model of the world?” and “How much can one trust that a correct proof in a model applies to the real world?” These are deep epistemological questions, and it is somehow ironic that national, corporate, and personal security depends on them. In addition to these, one may have to trust certificate authorities and assumptions on the hardness of deep mathematical problems. Therefore, it is fair to say that trust in cryptographic mechanisms is an extremely complex social process.
23 January 2016
The petlib library exposes basic elliptic curve (EC), big number and crypto functions. As an example, I also implemented genzkp.py, a simple non-interactive zero-knowledge proof engine. This is a short tutorial on how to use genzkp, in case you need a proof of knowledge and do not wish to write all details by hand.
We will use as a running example proving knowledge of the opening and the secret of a Pedersen commitment over an elliptic curve field. In a nutshell the prover defines a commitment of the form:
The variables h and g are publicly known points on an elliptic curve. The commitment Cxo commits to the secret value x, using the secret opening value o. One can show that this commitment scheme provides perfect hiding and computational binding. The creator of this commitment may with to prove it knows values x and o, without revealing them. For this we can use an non-interactive zero-knowledge proof.
The second session is on “Equipment Interference”, Hacking or “Computer Network Exploitation”. There is little mention of this in previous legislation, and as a result there was much confusion about oversight according to Eric King, who chairs the session. This changes earlier this year with the publication of the code of practice, since it allows us to talk about these issues publicly, and now these powers are also in the bill.
7 January 2016
Once again we have to thank her majesty’s government for an opportunity to get together and discuss encryption and surveillance policy. Here are my notes from the first section. Since they are in real-time they are not a very faithful record, and probably mistakes are due to me rather than the speakers…
5 December 2015
As many in the UK are fighting a rear-guard action to prevent the most shocking provisions of the IP Bill becoming law (incl. secrecy and loose definitions), I was invited to provide three public policy recommendations for strengthening IT security in the EU. Instead of trying to limit specific powers (such as backdoors) here are some more radical options, more likely to resolve the continuous tug-of-war cyber civil liberties and the security services have been engaging in a while.
26 September 2014
I am today attending the first Internet Privacy Engineering Network (IPEN), where the issue of translating Data Protection principles into requirements has been raised a number of times. While this exercise needs to be repeated for each given service or application, it reminded me that I had drafted a number of generic Technical Requirements for Processing PII. These need to be reviewed and validated, but I hope they offer at least proof that the problem can be made tractable.
31 October 2013
The latest revelations about the NSA attacking some of the largest US cloud providers’ communications, are also accompanied by Cambridge Member of Parliament, Julian Huppert, call to revise the oversight of UK intelligence agencies. Similar calls were made in the US about better oversight of their security agencies. Julian concludes in this Guardian “Comment is Free” piece that:
“Who can read this, and how do we want to protect this? We need to agree the rules now, before we completely lose control.”
While better oversight is in itself a good thing, the over-reliance on “oversight” or privacy regulation, such as data protection regimes, is a typical example of what I call the “liberal fallacy”. The liberal fallacy is the belief that privacy is a complex social technical issue, and as a result it needs to be addressed first and foremost by better regulation, since it cannot be addressed by technical means alone.
The argument is extremely appealing for a number of reasons, and when put so reasonably I would be surprised if most privacy and security professionals, as well as policy makers and civil society advocates would not agree with it. After all, privacy in indeed both complex, and not merely a technical property. Privacy is not an absolute right, and regulation can “balance” the rights of the individual against the collective needs to revoke this right in certain circumstances. In a liberal democracy both the state and companies operate within the rule of the law, therefore proper regulation seems a light weight mechanism to solve the privacy problem.
The problem is that the “better regulation and oversight” argument is just non-sense in the context of the NSA and GCHQ spying allegations. The reason for thi, is that the national regulations do not affect the willingness, legality or ability of other states to conduct mass surveillance operations. Better German privacy legislation would not have protected the German head of state’s telephone conversation against US agencies. Similarly, better UK oversight of GCHQ will not extend any protections the US afford to US persons only to the UK population. For any national legislation offering you strong privacy guarantees and good oversight, there are about 205 other jurisdictions in which spying on you is not only legal, but highly ethical, patriotic, in the national interest, and rather well funded by tax payers.
National legislation works best in the context of territorial matters, where proximity and ability to harm is related to physical distance and location, and an army ensures territorial integrity. The internet is not like that: a US, Russian or Chinese router is as close to your UK web-site or switch as one in the UK. Benefiting from strong protections by UK entities, does nothing to protect you from other dangers that are just as close. It is shocking that US agencies were targeting cloud providers, but now we know they were not doing so only using their legal authority, but also just intercepting their communications. Even given perfect US oversight, better regulation will not prevent other countries doing the same — only better encryption systems can do that.
I am quite delighted that Julian Huppert does also mention that that placing back doors into security products makes everyone less safe, in line with the statement many UK security researchers made earlier this year. Yet, the focus on regulation is misplaced: against nation state level threats, sadly, only better security and privacy technologies can provide a credible defense for privacy.
27 July 2011
I am currently sitting at the PETS 2011 symposium in Waterloo, CA. I will be blogging about selected papers (depending on the sessions I attend) over the next couple of days — authors and other participants are welcome to comment!
The first session is about data mining and privacy.
“How Unique and Traceable are Usernames?”
Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils (INRIA)
The first paper looks at the identifiably of on-line usernames. The authors looked at user names from different sites and assess the extent to which they can be linked together, as well as link them to a real person. Interestingly they used Google Profiles as ground truth, since they allow users to provide links to other accounts. First they assess the uniqueness of pseudonyms based on a probabilistic model: a k-th order markov chain is used to compute the probability of each pseudonym, and its information content (i.e. -log_2 P(username)). The authors show that most of the usernames observed have “high entropy” and should therefore be linkable.
The second phase of the analysis looks at usernames from different services, and attempts to link them even given small modifications to the name. The key dataset used was 300K google profiles, that list (sometimes — they used 10K tuples of usernames) other accounts as well. They then show that the Levenshtein distance (i.e. edit distance) of usernames from the same person is small compared to the distance of two random user names. A classifier is built, based on a threshold of the probabilistic Levenshtein distance, to assess whether a pair of usernames belongs to the same or a different user. The results seem good: about 50% of usernames are linkable with no mistakes.
So what are the interesting avenues for future work here? First, the analysis used is a simple threshold on the edit distance metric. It would be surprising if more advanced classifiers could not be applied to the task. I would definitely try to use random forests for the same task. Second, the technique can be used for good not evil: as users try to migrate between services, they need an effective way to find their contacts — maybe the proposed techniques can help with that.
“Text Classification for Data Loss Prevention” (any public PDF?)
Michael Hart (Symantec Research Labs), Pratyusa Manadhata (HP Labs), and Rob Johnson (Stony Brook University)
The paper looks at the automatic classification of documents as sensitive or not. This is to assist “data loss prevention” systems, that raise an alarm when personal data is about to be leaked (i.e. when it is about to be emailed or stored on-line — mostly by mistake). Traditionally DLP try to describe what is confidential through a set of simple rules, that are not expressive enough to describe and find what is confidential — thus the authors present a machine learning approach to automatically classify documents using training data as sensitive or not. As with all ML techniques there is a fear of mistakes: the technique described tries to minimise errors when it comes to classifying company media (ie. public documents) and internet documents, to prevent the system getting on the way of day to day business activities.
The results were rather interesting: the first SVN classifier looked at unigrams with binary weights to classify documents. Yet, it first had a very high rate of false positives for public documents. It seems the classifiers also had a tendency to classify documents as “secret”. A first solution was to supplement the training set with public documents (from wikipedia), to best described “what is public”. Second, the classifier was tweaked to (in a rather mysterious way to me) by “pushing the decision boundary closer to the true negative”. A further step does 3-category classification as secret, public and non-enterprise, rather than just secret and not-secret.
Overall: They manage to get the false positive / false negative rate down to less than 2%-3% on the largest datasets evaluated. That is nice. The downside of the approach, is common to most work that I have seen using SVNs. It requires a lot of manual tweaking, and further it does not really make much sense — it is possible to evaluate how well the technique performs on a test corpus, but difficult to tell why it works, or what makes it good or better than other approaches. As a resut, even early positive resutls should be considered as preliminary until more extensive evaluation is done (more like medicine rather than engineering). I would personally like to see a probabilistic model based classifier on similar features that integrates the two-step classification process into one model, to really understand what is going on — but then I tend to have a Baysian bias.
P3CA: Private Anomaly Detection Across ISP Networks
Shishir Nagaraja (IIIT Delhi) and Virajith Jalaparti, Matthew Caesar, and Nikita Borisov (University of Illinois at Urbana-Champaign)
The final paper in the session looks at privacy preserving intrusion detection to enable cooperation between internet service providers. ISPs would like to pool data from their networks to detect attacks: either because the volume of communications is abnormal at certain times, or because some frequency component is odd. Cooperation between multiple ISPs is important, but this cooperation should not leak volumes of traffic at each IPS since they are a commercial secret.
Technically, privacy of computations is achieved by using two semi-trusted entities, a coordinator and key holder. All ISPs encrypt their traffic under an additive homomorphic scheme (Paillier) under the keyholder key, and send it to the coordinator. The coordinator is using the key-holder as an oracle to perform a PCA computation to output the top-n eighen vectors and values of traffic. The cryptographic techniques are quite but standard, and involve doing additions, subtraction, multiplication, comparison and normalization of matrices privately though a joint private two-party computation.
Surprisingly, the performance of the scheme is quite good! Using a small cluster, can process a few tens of time slots from hundresds of different ISPs in tens of minutes. A further incremental algorithm allows on-line computations of eighen vector/value pairs in seconds — making real-time use of the privacy preserving algorithm possible (5 minutes of updates takes about 10 seconds to process).
This is a surprising result: my intuition would be that the matrix multiplication would make the approach impractically slow. I would be quite interested to compare the implementation and algorithm used here with a general MPC compiler (under the same honest-but-curious model).
4 October 2010
- Using Social Networks to Harvest Email Addresses by Iasonas Polakis, Georgios Kontaxis, Eleni Gessiou, Thanasis Petsas, Evangelos P. Markatos and Spiros Antonatos (Institute of Computer Science, Foundation for Research and Technology Hellas)
- Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships and Other Sensitive Binary Attribute Vectors by Arthur Ascuncion and Michael Goodrich (University of California, Irvine) (not on-line yet).
The first work by Polakis et al, looks at how easy it is to massively harvest email addresses using social networks and search engines to further use them as targets for spam. Furthermore, they attach to each email address contextual social information and personal information to produce more convincing spam and phishing emails. For this they used different techniques on three target platforms.
On facebook, they use the facility that allows users to find other by email to harvest personal information. This acts as an oracle to map harvested email addresses to real world names and demographic information. Once a facebook profile is linked, a very convincing fishing email can be crafted – including an email to ask to befriend them. (About 30% of users would befriend a stranger in that manner – a result not reported in the paper.)
A second vector of information is the use of nicknames that are constant across different sites. They use twitter to harvest pairs of (nicknames, email) and then further use the Facebook email to name oracle to link them to real world addresses. Finally, the authors use a google buzz feature to extract emails: every Buzz user ID is also their gmail address – this means that by searching buzz for particular words you can harvest gmail addresses as well as personal information of the users.
But how effective are the email harvesting techniques? How do you even assess this? The authors check the name and address they have harvested against an exact match with the name extracted from Facebook. The first technique yields about 0.3% correct addresses, the second 7%, and the final one 40%, showing that the techniques are practical when linking email to real names.
The second paper by Ascuncion et al. looks at how to aggregate information leaked by social networks to construct complete profiles of users on Social Networks. The aim is to reconstruct the friendship network as well as recovering attributes even if the privacy settings are used.
The techniques assume you can use an oracle to ask group queries against the social network site to check for a particular attribute. The objective is then to find a scheme that minimises the number of queries. It turns out there is a body of work on combinatorial group testing, including adaptive variants, that are readily applicable to this problem. This is not unlike our work on prying data out of a social network. Applying these techniques to social networks is even narrower allowing a lower number of queries to extract attributes (a logarithmic number of queries in the size of possible profiles, and linear in the number of profiles with a certain attribute to be extracted).
The attack is applied and validated by applying it to extract friends in Facebook, DNA sequences in mitochondrial databases, and movie preferences in the NetFlix database. These techniques are interesting as they are very general. At the same time it is likely that faster ways exist to extract specific attributes of users in real-world social networks, as there are strong correlation between attributes and the social structure of users.