## ACM CCS 2011: Under the hood of the reviewing process

### 23 October 2011

ACM CCS 2011 just took place this week, so I decided to give a bit more insight into a few processes the program chairs used behind the scenes to manage what is the largest security conference to date. Vitaly Shmatikov (CCS11 Program co-chair) has already given a short introduction this year’s process: we received 429 full papers that we had to review with 54 PC members. While no hard target was set at the start of the process we expected to accept around the 60 papers that are now forming the program of CCS 2011. These are my views and opinions on the process, and they are not automatically shared by anyone else, including Vitaly.

**Note: This post describes automated statistics we used to interpret scores of reviews to guide us in assigning more reviews or guiding discussion. All final acceptance decision were taken the old fashioned way through qualitative assesment of the reviews and discussion in the PC.**

A naïve reviewing strategy would have been straight forward: traditionally each paper has to receive 3 reviews, leading to a load of 23-24 reviews per PC member. This is a high, but not impossible load. Yet it would not have led to the best outcome for a number of reasons:

- Papers with conflicting reviews are best discussed by requesting additional reviews. This leads to a large fraction of papers needing in fact 4 or more reviews, stretching further the pressure on the members of the PC.
- Author rebuttals to reviews require careful consideration from independent members of the PC, to judge the validity of the author’s response over the previous reviews and paper. This further increases the need for reviews on some papers – authors need at least a couple of reviews to respond to, and a further round of reviewing increases the load for papers with conflicting reviews.
- Finally, it would be nice to assign most papers that have a good chance of being accepted more reviews than the baseline three. This is to make sure that papers are of high quality: identify subtle overlaps with existing work to be mentioned, and provide authors of viable papers with as much feedback as possible to improve their work in case it was accepted.

Given the above, even if only 200 papers needed 4 reviews, we would be asking our PC to review about 27 papers each. This would be getting too heavy, and likely to impact the quality of all reviews. In practice we knew that many papers would need even more reviews. Since the straightforward naïve strategy was not possible, here is a description of the strategy followed, and my personal rationale behind it. Hopefully this will lead to discussion on its effectiveness, fairness and possible improvements for future years.

Broadly speaking the strategy has been to concentrate reviewing resources and discussion at any time where needed to decide whether the paper should be accepted or not. This was done in an adaptive manner given the information available at any time.

First of all the decision was taken to give some papers only 2 or even one reviews in the first phase. This was a time consuming manual task: PC chairs looked at all submissions and flagged some to receive fewer reviews. Factors taken into account were possible lack of an evaluation section, no mention of security or privacy, no references to the security / cryptography literature, no security argument, number of members of the PC willing to review the paper etc. These issues were all considered as a whole when making this decision. Over 100 papers were manually flagged for a reduced number of reviews initially.

In the first round of reviews each PC member received about 20 reviews. We also allocated a large number of reviews to external reviewers, many of whom accepted to do 3 or 4 reviews. Unlike previews conferences I have managed, no attempt was made to keep the load overly fair in terms of numbers of papers: some reviewers got as few as 15 papers while others did over 25 reviews in the first phase. The rational being that it is less onerous, and provides best results, to review papers on one’s field than even a handful of papers in an unknown field.

By the time reviews were sent to authors most papers had already received at least 2 reviews – sadly many late reviews were not given to authors to avoid delaying the rebuttal deadlines. Once rebuttals came in, the second phase of reviewing started, involving further reviews and active discussion.

Controversially, we asked the second phase reviewers to have a look at the previous reviews and author rebuttals. Their job was not only to assess the paper, but also to discuss with previous reviewers, and the answers authors had provided in their rebuttal. At the same time, this invalidates the usual assumption that reviews are independent. In practice reviews and scores get changed so often during the discussion phase (and this is a good thing!) that this is unlikely to severely distort the results.

This is where things get tricky: after rebuttals we had a collection of 429 papers, some with 1, some with 2, many with 3 and some with more reviews. How do we know which papers need more reviews? How do we know if a paper with one bad review may simply have been unlucky? How do we know whether a paper with a certain number of reviews is likely to make it in the top 60 that are to be accepted, particularly when the reviewers’ confidences are different?

Of course, reading the reviews gives a lot of qualitative information to answer the above questions. Some reviews and comments explicitly were requesting more reviews. At all stages of the process any PC member could simply request more reviews for any paper. Qualitative factors and the view of the PC at the end decided which papers needed more reviews as well as the final outcome for a paper.

At the same time, it is genuinely hard to judge where a paper (out or many hundred) sits in comparison with the others given a set of often contradictory scores. It is also genuinely hard to get intuitions about how likely it is that a paper has received a few bad (or good) reviews by chance or by mistake, i.e. the natural variance of a small set of reviews. Taking into account reviewer confidence when judging paper ranking is difficult and poorly supported in conference management systems.

To help with this task, along with the quantitative feedback, we devised a system to estimate the ranking of papers – for the purpose of assigning more reviews (not acceptance!). Broadly, data sets of reviewer scores and confidences were used from this year’s CCS as well as past years to build a model of the range of scores we expect each paper to be taking if more reviews were to be assigned. Then we used a Monte-Carlo approach to get an estimate of the rank of the paper as well as confidence intervals on this estimate (and any other desirable statistic).

A few more words on the model used, as well as its rationale:

- We consider each paper in turn and sample sets of scores it would have, given the scores so far, up to 5 scores.
- To do this we first apply a stage a bootstrapping: we sample from the list of scores it has with replacement the same number of scores. Scores include the confidence of the reviewer (i.e each score is a tuple (score, confidence)). This stage is meant to smooth out the influence of any individual a-typical review.
- Then we take the re-sampled scores, and assign new scores until we have 5 scores. The assignment is done by taking two random existing scores (and their confidences), and conditioning the new score on these scores. The dataset of the 3 last CSS conferences was used to achieve this.
- Rejection sampling was used to ensure we do not end up with a-typical triplets (i.e. we have seen a tuple of scores before on real data a few times).
- This procedure was repeated 2000 times for each paper, creating 2000 different “scenarios” of rankings for all papers.
- For each paper we then count how often it would have been in the top 60 papers (per score only). That gives us a measure of the probability of being included in the final program.
- By taking the ranking of a paper in all 2000 scenarios we can calculate the confidence intervals (CI) of the paper rank.

All the above modelling steps are conservative to increase the variance in the distribution of scores of a paper.

A set of heuristics was then used to determine which papers needed more reviews. Any papers with probability lower than 1% of being included in the program suffered a sudden death: unless the reviews were ambivalent it received no further reviews. Any papers with probability between 1% and 10% received at least 3 reviews, and any papers with probability over 10% received more reviews. Of course these were only heuristics – the qualitative information in the reviews, and the express feedback from reviewers guided heavily the assignments of additional reviews.

It is informative to give a feel for the distribution of papers according to their respective probabilities of acceptance using the above model. At the very end of the reviewing process, about 50 papers had probability more than 50% of being accepted. About 125 papers had probability higher than 10% of being accepted and 194 papers had a probability greater than 1%. The last number is less than half of the submitted papers.

This is mildly encouraging: it means that the reviewing process actually does yield value – the posterior probability distributions of papers (taking into account the variance and conditioning on scores) is more informative than the prior probability of 60 / 429, even if we take into account the variance of reviews.

When one compares the probabilities given by the model with the actual acceptances and rejections of papers, it is clear that there has been a slight bias in accepting papers with higher probabilities of acceptance (We would have expected to choose about 43 papers from the 60 papers with highest acceptance probability, but in fact 48 were selected). This bias may be due to the qualitative discussion around those papers, or it might be due to reviewers paying too much attention to the actual scores and specific reviews – without contextualizing them in relations to their natural variance.

Since the model gives us a probability of acceptance for each paper, we can use it to estimate the probability of error in the program. Assuming the model is correct, we have made an expected 20 out of 60 mistakes in our selection of papers in the worst case! Reality of course is less tragic, as the model takes no account of the qualitative feedback and discussion, but the potential magnitude of the error is a sobering realization that scores alone are a poor way to choose papers. (Note that if we were selecting papers at random we would have made 52 mistakes on average – so it’s still much better than random). Given just the score information, we would have to accept at least 120 papers to ensure that 55 of the top 60 papers were included in the program on average – an enormous cost.

The above really illustrates the importance of deep engagement in the discussion phase. After the second round of reviews came back, most papers were discussed in some way. Discussions went deep into the technical contributions of the paper, and reasons for rejecting papers had to be documented and cross checked by all reviewers. In total 1972 comments were made on papers and the number of comments is correlated with the score as well as the variance of reviews. Given the natural variance of mere scores, this makes all the difference.

It is also interesting to estimate the cost of the process on the community of volunteers. If we assume that each review takes at least 3 hours, and that each comments takes at least 5 minutes, the review process has taken about 1.8 person work years (300 days / 8 hours a day). If we cost the opportunity cost of each reviewer at $100K a year, the process has cost $177K or a minimum of $413 per submission. This volunteer commitment is worth keeping in mind when debating the cost of academic publishing, as well as conferences.

In conclusions: the rebuttal phase followed by further reviews has been a positive experiment, and requiring 2^{nd} round reviewers to comment on previous reviews and rebuttals ensures that the author opinions are taken seriously. The decision to focus reviewing effort where uncertainty was, is controversial, and was supported by both time-consuming qualitative understanding of the papers and the reviews, as well as quite advanced quantitative models for the natural variance of papers given a set of scores. The latter can only be used as a guide to commit reviewing effort. Scores on their own do not give a very good indication of which papers should be included in the program and which should be excluded. Concentrating eyeballs where the stronger papers were was overall a very good idea, and a robust and lengthy discussion amongst the PC is absolutely essential. It is still astonishing that more than half the papers submitted had less than 1% probability of finding their way into the program (despite the variance of reviews). Even more astonishing is the potential expected error from the selection process – one more reminder that acceptance (or rejection) from a specific venue, no matter how competitive, is only one hint about the ultimate quality of the work.

Finally as a reward for reading all the way through the post here is some data for the geeks! That can hopefully illuminate the debates about reviewing with some facts:

- A table illustrating the natural variance of scores in the past few CCS conferences.
A simplified table from the CCS11 reviewing process (and previous CCS) showing the distribution of other scores (columns), given a (score, confidence) pair in a review (rows). Entries with less than 4 have been redacted to "x" to illustrate they are not significant at all and preserve privacy. Note: The actual model used to estimate missing scores from CCS11 submissions was taking into account 2 pairs of (score, confidence) at a time and mapping them to another (score, confidence) pair with the appropriate empirical probability. This full table is not shown here. -3 -2 -1 0 1 2 3 -3(1) 7 10 4 x x x x -3(2) 22 31 x x 10 x x -3(3) 87 147 34 19 9 x x -3(4) 84 115 34 11 15 9 x -2(0) x x x x 5 x x -2(1) 17 45 26 x 18 4 x -2(2) 50 207 185 68 41 32 x -2(3) 154 454 419 161 150 70 4 -2(4) 82 225 180 75 72 40 4 -1(0) x x x 5 x x x -1(1) 12 48 50 12 15 5 x -1(2) 24 310 167 95 128 52 7 -1(3) 36 350 238 190 197 102 6 -1(4) x 102 62 65 48 26 x 0(1) 7 29 23 18 17 11 x 0(2) 18 137 150 88 85 63 5 0(3) x 117 136 91 99 85 5 0(4) 4 22 58 32 26 15 x 1(0) x 4 x x 4 x x 1(1) 9 38 29 14 23 29 x 1(2) 16 115 159 84 97 76 12 1(3) 11 95 154 102 113 105 9 1(4) x 34 47 27 35 35 x 2(0) x x x x x x x 2(1) x 17 22 10 12 20 x 2(2) 7 59 85 56 83 68 17 2(3) 4 60 60 81 104 81 15 2(4) 4 11 17 27 46 40 6 3(1) x x x x x x x 3(2) x 7 12 x x 11 4 3(3) x x x 4 14 22 7 3(4) x x x 4 6 6 x

- A graph of the ordering of CCS11 papers by rank according to the model, and the 90% CI of the rank. Acceptance / Rejection decision is not given — this information was used to inform assigning reviews, not acceptance.

- A table showing the number of papers within 10% bands of acceptance probabilities according to the model, as well as the number accepted in each band.
The distribution of accept probabilities for CCS11 submissions given the score model used to assign further reviews. Columns: Accept prob.: The probability of the paper being in the top-60 papers by score, given the model of scores. Sub.: The number of submissions with a probability in this range. Accepts: The number of accepted papers with a probability in this range. Accept prob. Sub. Accepts 0% - 9%: 302 1 10% - 19%: 40 2 20% - 29%: 16 2 30% - 39%: 17 9 40% - 49%: 6 3 50% - 59%: 10 8 60% - 69%: 8 6 70% - 79%: 7 7 80% - 89%: 6 6 90% - 99%: 14 13 100%: 3 3

- A table showing a histogram of the number of comments, the number of papers for a certain number of comments, the average score of the papers, the average difference between the high and low review and the average number of reviews.
Histogram of number of comments during the ACM CCS11 review process Columns: - Comments: number of comments - Papers: number of papers with a certain number of comments - Av.Sc.: The average score of the papers in this bucket. - Av.Dif.: The average difference between high and low score of the papers in this bucket. - A.Revs: The average number of reviews of the papers in this bucket. Comments Papers Av.Sc. Av.Dif. Av.Revs 0 112 -1.98 0.78 2.3 1 52 -1.49 1.35 2.8 2 44 -1.08 1.48 3.0 3 42 -0.62 2.38 3.3 4 30 -0.24 2.03 3.6 5 23 -0.21 2.00 3.7 6 16 -0.40 2.38 3.8 7 18 0.18 2.61 3.8 8 17 0.28 2.29 3.9 9 13 0.08 2.85 3.8 10 7 0.22 2.57 4.1 11 15 0.06 3.07 3.9 12 5 0.65 2.60 4.0 13 5 0.65 2.40 4.0 14 3 XXXX XXXX 4.0 15 7 0.24 2.00 3.7 16 20 0.62 3.00 4.5

Fascinating post George. Thanks again to both you and Vitaly for your hard work in organizing the careful review of an unprecedented number of papers.

Sorry to suggest more work, but it would be awesome to open source your analytical approach so that other conferences might adopt a more quantitative approach to review assignment.

I especially like your estimate of the $ cost of reviewing, BTW. It really drives home the point that review assignment needs to be done in a more thoughtful way. I would expect that if we put an explicit value on reviewer time, there would be more demand for approaches like the one you describe.