A technical reading of the “HIMR Data Mining Research Problem Book”
3 February 2016
Boing Boing just released a classified GCHQ document that was meant to act as the Sept 2011 guide to open research problems in Data Mining. The intended audience, Heilbronn Institute for Mathematical Research (HIMR), is part of the University of Bristol and composed of mathematicians working for half their time on classified problems with GCHQ.
First off, a quick perusal of the actual publication record of the HIMR makes a sad reading for GCHQ: it seems that very little research on data mining was actually performed post-2011-2014 despite this pitch. I guess this is what you get trying to make pure mathematicians solve core computer science problems.
However, the document presents one of the clearest explanations of GCHQ’s operations and their scale at the time; as well as a very interesting list of open problems, along with salient examples.
Overall, reading this document very much resembles reading the needs of any other organization with big-data, struggling to process it to get any value. The constrains under which they operate (see below), and in particular the limitations to O(n log n) storage per vertex and O(1) per edge event, is a serious threat — but of course this is only for un-selected traffic. So the 5000 or so Tor nodes probably would have a little more space and processing allocated to them, and so would known botnets — I presume.
Secondly, there is clear evidence that timing information is both recognized as being key to correlating events and streams; and it is being recorded and stored at an increasing granularity. There is no smoking gun as of 2011 to say they casually de-anonymize Tor circuits, but the writing is on the wall for the onion routing system. GCHQ at 2011 had all ingredients needed to trace Tor circuits. It would take extra-ordinary incompetence to not have refined their traffic analysis techniques in the past 5 years. The Tor project should do well to not underestimate GCHQ’s capabilities to this point.
Thirdly, one should wonder why we have been waiting for 3 years until such clear documents are finally being published from the Snowden revelations. If those had been the first published, instead of the obscure, misleading and very non-informative slides, it would have saved a lot of time — and may even have engaged the public a bit more than bad powerpoint.
Some interesting points in the document in order:
- It turns out that GCHQ has a written innovation strategy, reference [I75]. If someone has a copy it would be great to see it, and also understand where the ACE program fits in it.
- GCHQ, at the time used heavily two families of technologies: Hadoop, for bulk processing of raw collected data (6 months for meta-data apparently), and IBM Streams (DISTILLERY) for stream, real-time, processing. A lot of the discussion, and open problems, relate to the fact that bulk collection can only provide a limited window of visibility, and intelligence related selection, and processing has to happen within this window. Hence the interest in processing streaming data.
- Section 2 is probably the clearest explanation of how modern SIGINT works. I would like to congratulate the (anonymous) author, and will be setting it as the key reading for my PETS class. It summarizes well what countless crappy articles on the Snowden leaks have struggled to piece together. I wish journalists just released this first, and skipped the ugly slides.
- The intro provides a hit at the scale of cable interception operations as of 2011. It seems that 200 “bearers” of 10 Gigabit / sec were being collected at any time; it makes clear that many more sources were available to switch to depending on need. That is about 2 Terabit / sec, across 3 sites (Cheltenham, Bude, and LECKWITH).
- Section 2.1.2 explains that a lot (the majority) of data is discarded very early on, by special hardware performing simple matching on internet packets. I presume this is to filter out bulk downloads (from CDNs), known sources of spam, youtube videos, etc.
- The same section (2.1.2) also explains that all meta-data is pulled from the bearer, and provides an interpretation of what meta-data is.
- Finally (2.1.2) there is a hint at indexing databases (Query focused databases / QFD) that are specialized to store meta-data, such as IP traffic flow data, for quick retrival based on selectors (like IP addresses).
- Section 2.1.3 explains the problem of “target development”, namely when no good known selectors exist for a target, and it is the job of the analyst to match it though either contact chaining or modus-operandi (MO) matching. It is a very instructive section, which is the technical justification underpinning a lot of the mass surveillance going on.
- The cute example chosen to illustrate it (Page 12, end of 2.1.3): Apparently GCHQ developed many of those techniques to spy on foreign delegations during the 2009 G20 meeting. Welcome to London!
- Section 2.2.2 provides a glimpse at the cybersecurity doctrine and world-view at GCHQ, already in 2011. In particular, there is a vision that CESG will act as a network security service for the nation, blocking attacks at the “firewalls”, and doing attribution (as if the attacks will be coming “from outside”). GCHQ would then counter-attack the hostile sources, or simply use the material they intercepted from others (4th party collection, the euphemism goes).
- Section 2.2.3 provides a glimpse of the difficulties of running implants on compromised machines: something that is openly admitted. Apparently ex-filtrating traffic and establishing command-and-control with implants is susceptible to passive SIGINT, both a problem and an opportunity.
- Section 3 and beyond describes research challenges that are very similar with any other large organization or research group: the difficulty of creating labelled data sets for training machine learning models; the challenges on working on partial or streaming data; the need for succinct representations of data structures; and the problem of inferring “information flow” — namely chains of communications that are related to each other.
- It seems the technique of choice when it comes to machine learning is Random Decision Forrests. Good choice, I also prefer it to others. They have an in-house innovation: they weight the outputs of each decision tree. (Something that is sometimes called gradual learning in the open literature, I believe).
- Steganography detection seems to be a highlight: however there is no explanation if steganography is a real problem they encountered in the field, or if it was an easy dataset to generate synthetically.
- Section 4, deals with research problems of “Information Flow in Graphs”. This is the problem of associated multiple related connections together, including across types of channels, detecting botnet Command and Control nodes, and also tracing Tor connections. Tracing Tor nodes is in fact a stated problem, with a stated solution.
- Highlights include the simple “remit” algorithm that developed by Detica (page 26, Sect. 4.2.1); PRIME TIME that looks at chains of length 2; and finally HIDDEN OTTER, that specifically targets Tor, and botnets. (Apparently an internal group codenamed ICTR-NE develop it).
- Section 4.2.2 looks at communications association through temporal correlation: one more piece of evidence that timing analysis, at a coarse scale, is on the cards for mining associations. What is cute is the example used is how to detect all GCHQ employees: they are the ones with phones not active between 9am and 5pm when they are at work.
- Beside these, they are interested in change / anomaly detection (4.4.1), spread of information (such as extremest material), etc. Not that dissimilar from say an analysis Facebook would perform.
- Section 5 poses problem relating to algorithms on streaming graph data. It provides a definitions of the tolerable costs of analysis algorithms (the semi-streaming paradigm): for a graph of n vertices (nodes), they can store a bit of info per vertex, but not all edges, or even process all edges. So they have O(n log n) storage and can only do O(1) processing per event / edge. That could be distilled to a set of security assumptions.
- Section 5.2.2 / 5.2.3 is an interesting discussion about relaxations of cliques and also point of that very popular nodes (the pizza delivery line) probably are noise and should be discarded.
- As of 2011 (sect 5.2.4) it was an open problem how far contact chaining is required. This is set as an open problem, but states that analysis usually use 2-hops from targets. Note that other possible numbers are 3, 4, and 5 since after 6 you probably have included nearly everyone in the world. So it is not that exciting a problem and cannot blame the pure mathematicians for not tackling it.
- Section 5.5.1 asks the question on whether there is an approximation of the correlation matrix, to avoid storing and processing an n x n matrix. It generally seems that matching identifiers with identifiers is big business.
- Section 6 poses problems relating to the processing, data mining, and analysis of “expiring” graphs, namely graphs with edges that disappear after a deadline. This is again related to the constraint that storage for bulk un-selected data is limited.
- In section 6.3.2 the semi-streaming model where only O(n log n) storage per vertex is allowed, and O(1) processing per incoming event / edge is re-iterated.
- Appendix A deals with models of academic engagement. I have to say it is very enlightened: it recognizes the value of openly publishing the research, after some sanitization. Nice.
- Appendix B and C discuss the technical details and size of the IBM Streams and Hadoop clusters. Section D presents the production clusters (652 nodes, total 5216 cores, and 32 GB memory for each node).
- Section E discusses the legalities of using intercepted data for research, and bless them they do try to provide some logging and Human Rights Justification (bulk authorization for research purposes).