A grey paper on sea green background with circles marking certain positions in the text. Two sets of eyes oggling the paper.
Data Science

How Blind Is A Blind Peer Review Really?

19 ​​min

During the process of academic publishing, the double-blind peer review was introduced to conceal the identity of the author of a paper to reduce the bias toward minorities. This should ensure an equal and fair evaluation across all authors, regardless of gender, country of origin, or affiliation.

But you might be wondering: isn’t it possible to recognize and identify the authors based on other factors such as the content, writing style, or references? Based on a self-generated citation network, this article provides various prediction models that could identify patterns and predict the authors‘ sensitive information in their selected references.

What is a Peer Review?

Academic peer review, also called scholarly peer review, describes assessing a draft of scientific papers by experts in the same field (peers) to determine the suitability of publication in a scientific journal or the proceedings of an academic conference. Single-blind and double-blind peer reviews have been established as the most common types of academic peer review. The single-blind peer review conceals the reviewer’s identity, enabling them to give an honest critique of the article without the influence exerted by the author. As the author’s identity is exposed in a single-blind peer review, the reviewer may become influenced by the author’s reputation, gender, or affiliation. This bias could impact the quality of the evaluation and ultimately lead to inequality in academic publishing.
The double-blind peer review, widely adopted in the 21th century, eliminates this issue by keeping the identity of both entities, the writer and the reviewer, hidden. Typically, all information associated with the author is removed. This includes the author’s name, affiliations, and acknowledgments. Further, pronouns that could hint at the author must be removed, such as “in our previous work [X].“

Effects of double-blind review

Some case studies and experiments investigated the effect of adopting the double-blind review.

  • An experiment conducted in 2017 used the 10th Association for Computing Machinery International Conference on Web Search and Data Mining with two different peer-reviewing groups. One group used a single-blind review, whereas the others applied a double-blind study. The single-blind peers were significantly more likely to bid for papers by more prestigious authors, universities, and companies than their counterparts [1].
  • In a study, when the gender was known to the reviewer, men’s work was generally rated higher by men and women. This was not the case when gender was kept hidden. Also, the journal Behavioral Ecology noticed a significant increase in publications by female first authors when introducing the double-blind review in 2001 [2,3].
  • Previous papers suggest that single-blind peer review is also prone to nepotism. Authors associated with their corresponding peers had a significant advantage over those without personal ties [3].

In general, double-blind reviews are more critical and attempt to be more objective by disregarding a person’s prestige, gender, or relationships, resulting in a lower acceptance rate than single-blind peer-reviewed papers.

However, there exists one key issue with the double-blind peer review. It provides a false sense of security as the reviewer could still identify the authors of a paper with no personal information attached. The American Journal of Public Health survey showed that most reviewers could recognize the author based on their background knowledge and the reference list [4].

Data Collection

We extracted the paper IDs from the arXive repository and additional information from the Semantic Scholar API. We limited the data to the computer vision field from 2000 to 2022. Semantic Scholar returns the essential information, including the title, publication year, authors, and the Digital Object Identifier (DOI). The DOI provided us access to the publisher’s website, on which a web crawler was implemented to extract the affiliation and country for each author. Typically, publishers do not contain information about the author’s gender. Consequently, genderize.io, a Machine Learning model, predicted the author’s gender for this dataset.

Prediction models

Simple Count Approach (SCA)

The prediction model presented in this chapter is a self-citation approach introduced by Hill and Provost [5] in 2003. It aims to classify the authors of papers based on the most frequently occurring features in their references. This approach leverages the idea that authors cite authors with similar characteristics. For example, female authors tend to cite other female authors, as females are more present in their network. Also, if somebody works for Inovex, the odds are high that they could quote somebody else from the same company due to proximity or collaborations. Authors could also cite their work repeatedly to make a point across.

SCA: Evaluation

Including Co-Authors-0.530.280.21

The results suggest that the more specific the features, the lower the accuracy. The gender prediction returns an accuracy of 63 % which seems reasonably good. However, a rudimentary model that predicts only males would return an accuracy of 83 % due to a gender imbalance in the dataset. One could predict the primary country 1/3 of the time and the affiliation and author 1/5 by simply taking the most occurring feature in the references. If we include the co-authors, the performance increases significantly.

Impact of Paper Count

We divide the authors into buckets containing the number of papers they published previously. It is observable that the higher the paper count of an author, the easier it is for the SCA to predict the desired feature. One outlier exists for bucket 60-79 and the author feature, which could be due to the small number of authors within that bucket.

Overall, this model provides a simple and effective way to determine the author’s characteristics. However, this model still holds some limitations, as it assumes that (1) an author must have prior publications and (2) have a high tendency to cite these prior publications. Additionally, the Simple Count Approach is biased towards more represented classes which implicitly disadvantages minority classes.

Random Forests (RF)

Random Forests were chosen due to the adequate predictive power using an ensemble of decision trees. Instead of directly taking the references of the test papers, we use previously published papers by the same authors. That way, the model can identify patterns in the training set and then use the models‘ parameters on the test set. As a result, authors with minor self-citation bias could also be identified.

Random Forest Process
Random Forest Process (illustration based on [6])
When building each decision tree, the Random Forest takes pseudo samples of the training set. Additionally, it performs feature bagging. Thus, it takes a random subset of features from the original training set. As a result, features with high predictive power are not present in all trees, reducing the risk of overfitting the training set. For an unseen test set, the Random Forest model then propagates the input to each decision tree, each predicting an individual output. Regarding class prediction, the majority of votes are taken as the model’s total output. For regression problems, the average could be taken.

RF: Evaluation


The Random Forest model captures relationships between prior publications in the training set and the test data. Compared to the Simple Count Approach, it improves the accuracy by ~20 % in each category. However, we assume that the performance could be further enhanced as Random Forests are not optimized toward graph-structured data:

  • The features were converted to one-hot encoded features for the RF. This results in a large sparse matrix representation, which could lead to overfitting.
  • The matrix representation does not consider edge weights or additional node attributes.
  • The model can not directly capture structural graph information, making it challenging to detect non-linear relationships.

Consequently, a Graph Neural Network might be a more suitable option as this model directly employs Deep Learning in the graph representation of the data.

Graph Neural Network

Neural networks have had a lot of success recently. Researchers made significant progress in object detection or speech recognition with deep learning architectures, such as convolutional neural networks, recurrent neural networks, or auto-encoders. Yet, these paradigms do not work in every use case as they are limited to Euclidean data. By definition, Euclidean data like images or videos can be plotted in an n-dimensional linear space.

However, in this case, scientific papers can reference an arbitrary number of other papers. Therefore, a citation network deals with non-Euclidean data. When dealing with non-Euclidean data, convolutional neural networks and recurrent neural networks reach their limit as they are not designed to handle arbitrary sizes and complex graph topologies. Additionally, the embeddings in a graph neural network stay consistent across various permutated graphs. Therefore, it utilizes a permutation invariant function that outputs the same vector regardless of the graph’s permutation and order. 

Graph Convolutional Neural Network (GCN)

Graph convolutional neural networks are a specification of graph neural networks. GCNs apply convolutions on the graph by learning local feature representations from their neighborhood. This enables the neural networks for each node to learn about the graph’s structure.

GCN downstream calculations (Own Illustration based on [7])
The graph convolutional neural network always contains an encoder. The encoder typically consists of multiple layers. One layer corresponds to one message-passing process. In other words, each node aggregates information from the nearby nodes with a single layer. With two layers, it would aggregate the information of nodes two hops away, and eventually, they will learn about the complete graphs‘ structure by stacking more layers.

In a single graph convolutional layer, we apply shared permutation invariant functions \(\phi (x_{u},X_{N_{u}})\) over the local neighborhood of the vertex \(u\). \(N_{U}\) is defined as the neighboring vertices of the node \(u\). \(X_{N_{u}}\) aggregates the features of the neighborhood of \(u\) and is potentially transformed by some function \(\psi\) with a permutation invariant function \(\bigoplus\). The features of node \(u\) are then updated by some function \(\phi\). The functions \(\phi\) and \(\psi\) are typically learnable, whereas the nonparametric operation \(\bigoplus\), like mean or max, is not. Note that the nonparametric operation ensures that the result of the message propagation stays the same regardless of the permutation of the graph.

\[ h_{u} = \phi (x_{u}, \bigoplus_{v \in N_{u}} c_{uv}\psi (x_{v})) \]

In the convolutional layer, the features of the neighbors are initialized with a constant weight \(c_{uv}\). That specifies the importance of a neighbor to the final aggregation of \(u\). Typically, that relates to the structural properties of graph G (e.g., the number of incoming edges of the neighbor v). Finally, the GCN maps each node in the graph to its corresponding node embedding. These embeddings could be used for node/graph classification or link prediction.

A graph convolutional neural network can also predict the likelihood of edges between two nodes in the graph. Compared with the other downstream task, this prediction requires nonexistent edges of the graph, as it needs to learn the relationship between the scores and negative labels. For this case, the simplest and most common approach of negative sampling is applied: Uniform sampling. This involves randomly selecting edges of the negative graph with equal probability.

Note that link prediction is a binary classification as the negative edges‘ labels are 0, and the positive ones are 1. Further, the likelihood of an edge can be computed with their corresponding pair of node embeddings. Common node pair aggregation functions include the cosine similarity or the dot product. The binary cross entropy loss function is typically used to compute the loss between the likelihood and the labels.

After calculating the loss during training, the next step is to update the models‘ parameters with backpropagation. This includes calculating the gradient concerning the model’s parameters. It then determines and updates the parameters according to the negative gradient using optimizers such as the Adam optimizer. Once the parameters have been updated, the next batch of training data is processed. This process is repeated until convergence or a specific number of epochs is reached. The resulting model should be capable of predicting missing or future links.

Heterogeneous Graph Convolutional Neural Network (HGCN)

Heterogeneous graph convolutional neural networks extend graph convolutional neural networks. The main difference is that HGCNs contain the functionality to operate on heterogeneous graphs. Specifically, it treats each edge and node type differently, applying different convolutional operations. For instance, for the edge tuple (author, writes, paper), the aggregation function uses a convolutional layer that takes the sum of authors and concatenates the papers‘ features. In the case of a co-author edge (author, co-author, author), the HGCN could apply a different aggregation function that directly aggregates the features. As a result, it learns different weight matrices for each node and edge type. With the final embeddings of various node and edge types, it decodes the similarity of the node embeddings as before.

HGCN: Implementation

Graph Structure
Graph Abstraction
This visualizes the abstraction of the graph.

Each feature of the author is represented as a separate entity in the graph. The regular arrows visualize the message propagation layers used during the training process. The dashed lines are the edges that the models aim to predict. Thus, we train four models, each corresponding to one feature.

HGCN architecture
HGCN architecture
HGCN architecture

Each class of features is converted to a one-hot encoded vector and stored as a feature vector in each node. During the next step, multiple sum pooling layers aggregate the authors‘ features into the authors‘ vertices. The model then aggregates the authors‘ information with another sum pooling layer and concatenates the result with the one-hot encoded paper IDs. A graph convolutional layer is then applied to the “cites“ relation. A slight variation of the standard GCN function was used to avoid information leakage. That function does not regard its feature vector when aggregating its neighboring vectors: 

\[h_i^{(l+1)} = \sigma (b^{(l)}+ \Sigma_{j \in N(i)} \frac{1}{c_{ji}} h_j^{l}W^{(l)})\]

Further, a dropout layer ensures the model is not overfitting on the training set. Additionally, the ReLU function adds non-linearity. A final dense layer transforms the final output to the desired output layer to compute the dot product between the papers‘ node embedding and the author’s feature vector.

HGCN: Evaluation


Training Metrics
Train and Validation metrics during training for Author Prediction

During training, we utilized a technique called Early Stopping. Generally, this refers to stopping the training process when the validation accuracy diminishes or decreases. In this case, we obtained the model at epoch 330 with a validation accuracy of ~50 % for authorship prediction. That procedure is repeated for every feature. Finally, the HGCN performance looks as follows:



The results suggest that the Blind-Peer Review is, in fact, not blind. A basic counting approach could already identify 1/5 of all authors assuming that authors would cite themselves the most. We incorporated a statistical Random Forest model, significantly improving the performance to 38 %. A deep learning model, the heterogeneous graph convolutional neural network could identify almost 50 % of all authors. The model’s predictive power increases as the features get more generic, with ~60 % for country prediction and 86 % for gender prediction.


How to solve this issue?

Some publishers, such as Elsevier, keep the double-blind review but introduced countermeasures to reduce the risk of authorship identification:

  • Anonymize the author when self-referencing: [Tom 2003] -> [Anonymous 2003]
  • Remove and limit non-essential self-references

Though these points mitigate bias during the reviewing process, this could negatively affect the quality of the review or is challenging to enforce. The publisher must also instruct the reviewers to avoid looking up personal information.

There are other methodologies in the peer review process that could tackle that issue differently:

For example, in the Open Peer Review, the identity of the author and the writer are disclosed. The reviewer, however, is obliged to return feedback on the article. This improves transparency and potential biases as the reviewer needs to reevaluate and justify their decision to the public [10].

Elsevier employed the Collaborative Peer Review in an experiment in 2014. This peer review technique allowed authors, editors, and reviewers to openly discuss and interact during the reviewing process. Ultimately, about 94 % believed the discussion led to a better review. On the downside, due to the discussions, this form of review causes the reviewer to put in significantly more effort. Compared to the classical approach, this also takes longer.

In conclusion, all peer review methods have advantages and disadvantages. When selecting a peer review method, editors should carefully evaluate their requirements and goals. At the same time, it should ensure the quality and integrity of the publication and provide a fair and transparent evaluation of scientific papers.


[1] A. Tomkins, M. Zhang, and W. D. Heavlin. “Reviewer bias in single- versusdouble-blind peer review.“ In: Proceedings of the National Academy of Sciences 114.48 (2017), pp. 12708–12713. doi: 10.1073/pnas.1707323114. eprint: https://www.pnas.org/doi/pdf/10.1073/pnas.1707323114.

[2] R. M. Blank. “The Effects of Double-Blind versus Single-Blind Reviewing: Experimental Evidence from The American Economic Review.“ In: The American Economic Review 81.5 (1991), pp. 1041–1067. issn: 00028282.

[3] C. Wennerås and A. Wold. “Nepotism and sexism in peer-review.“ In: Nature 387.6631 (May 1997), pp. 341–343. issn: 1476-4687. doi: 10.1038/387341a0.52

[4] S. J. Ceci and D. P. Peters. “How blind is blind review.“ In: American Psychologist 39 (1984), pp. 1491–1494.

[5] S. Hill and F. Provost. “The Myth of the Double-Blind Review? Author Identification Using Only Citations.“ In: SIGKDD Explor. Newsl. 5.2 (Dec. 2003), pp. 179–184. issn: 1931-0145. doi: 10.1145/980972.981001.

[6] Introducing TensorFlow Decision Forests. https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html. Accessed: 2023-3-7

[7] Y. Long, M. Wu, Y. Liu, Y. Fang, C. Kwoh, J. Luo, and X. Li. “Pre-training graph neural networks for link prediction in biomedical networks.“ In: Bioinformatics (2022).

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert