Piercing through encryption with traffic analysis (PETS2014 in real time)
16 July 2014
The paper “I Know Why You Went to the Clinic: Risks and Realization of HTTPS Traffic Analysis” is revisiting the vulnerability of HTTPS encryption to traffic analysis.
The authors note that most secure web connections access directly a specific website that is requested by a user (as opposed to the setting where a user is observed using Tor, where the site is not known). There are challenges in doing this: different websites are usually quite distinct, while many sub-pages in the same website may share a common layout that makes them hard to distinguish through encryption. Additional complications are added by cache, cookies, and the use of Flash. However, this setting may be easier than the Tor case, in that the adversary could realistically enumerate all pages on a website.
The paper chose 10 sites, including health, finance, video rental, and civil liberties. Traces are collected from encrypted requests and replies, and some simple pre-processing is done. In particular, the traffic volumes are collated in bursts and only their sizes are stored. This is done to add robustness to jitter in the traffic. Traces are then clustered using k-means and Gaussians are used to model each of the clusters. The use of the Gaussian distribution provides some further resilience to noise in the size of resources.
A machine learning approach is then followed to select which features distinguish pages: a logistic regression model is used to learn weights that best separate pages. A Hidden Markov Model is then used to track full traces of used navigation through an encrypted web-site.
To evaluate the approach traces were collected using virtual machines. This allows for the full state of a VM to be frozen, to reset the machine state perfectly between trace collection. Traces were collected in sessions where user state was meant to be preserved, and otherwise a fresh VM could be spun to clear the state. This way different conditions of cache on/off, cookies and browsing different sites were used for the collection. Interestingly the effect of cache being on systematically reduced the accuracy of the attack by 10%-20%.
Finally, the paper presents a defense against traffic analysis. It is based on padding the bursts of traffic to some threshold in order to confuse the feature extraction stage. Since many thresholds are possible, and the maximum is chosen, subject to a size “inflation” budget. Other defenses were also tried and compared, and there is a substantial reduction in the accuracy of this attack.