Boing Boing just released a classified GCHQ document that was meant to act as the Sept 2011 guide to open research problems in Data Mining. The intended audience, Heilbronn Institute for Mathematical Research (HIMR), is part of the University of Bristol and composed of mathematicians working for half their time on classified problems with GCHQ.
First off, a quick perusal of the actual publication record of the HIMR makes a sad reading for GCHQ: it seems that very little research on data mining was actually performed post-2011-2014 despite this pitch. I guess this is what you get trying to make pure mathematicians solve core computer science problems.
However, the document presents one of the clearest explanations of GCHQ’s operations and their scale at the time; as well as a very interesting list of open problems, along with salient examples.
Overall, reading this document very much resembles reading the needs of any other organization with big-data, struggling to process it to get any value. The constrains under which they operate (see below), and in particular the limitations to O(n log n) storage per vertex and O(1) per edge event, is a serious threat — but of course this is only for un-selected traffic. So the 5000 or so Tor nodes probably would have a little more space and processing allocated to them, and so would known botnets — I presume.
Secondly, there is clear evidence that timing information is both recognized as being key to correlating events and streams; and it is being recorded and stored at an increasing granularity. There is no smoking gun as of 2011 to say they casually de-anonymize Tor circuits, but the writing is on the wall for the onion routing system. GCHQ at 2011 had all ingredients needed to trace Tor circuits. It would take extra-ordinary incompetence to not have refined their traffic analysis techniques in the past 5 years. The Tor project should do well to not underestimate GCHQ’s capabilities to this point.
Thirdly, one should wonder why we have been waiting for 3 years until such clear documents are finally being published from the Snowden revelations. If those had been the first published, instead of the obscure, misleading and very non-informative slides, it would have saved a lot of time — and may even have engaged the public a bit more than bad powerpoint.