LLMs in academic peer review: the bad good idea, and vice versa — Blog and news on ethical AI and statistics in biomedicine.

“Dear Dr. Gosselin, thank you for your submission to our journal. We have now received the reports from the reviewers, you will find them below. Note that your manuscript was reviewed in 31 and 43 seconds respectively by our two AI automated experts as part of our fast-track evaluation process. No human reviewers were involved at any stage of the evaluation workflow, maximizing speed and unbiasedness.”

An extract romanticized by a Philip K. Dick’s or Isaac Asimov’s fan? I don’t know. An inescapable future of ours ahead of us? Maybe. What is sure though, is that large language models (LLMs) were quickly assimilated by academic peer reviewers, just like in virtually all other corners of our society. The early survey that was conducted in 2024 (prehistoric times for the recent LLM boom!) by Frontiers showed that about half of scientists were already early LLM adopters for tasks including summarising findings, drafting reports or assessing methodological soundness. No doubt that these numbers have evolved ever since, and it is not surprising that articles and position papers are now available (see references at the end of this post).

In the following blog post, I bring my vision of the topic. Note that the article focuses on general purpose LLM (like ChatGPT, Claude, Copilot for example) because they are the crux of the global issue, and the term LLM refers to such tools. The possible, and perhaps awaited, rise of LLM specifically designed for peer review will be mentioned occasionally.

No, I am not against LLMs!

Let’s be clear, the problem is not the existence of LLMs or the reality of the help they may provide. Many journals now even accept that referees use them to polish the prose of a report as long as the manuscript itself was not uploaded. That is defensible, and possibly helping non-native English scientists, but also English speaking colleagues with time or vocabulary constraints, write clear reviews. It may also help standardize review styles, and match them with specific formats that journals would like to create for their peer reviews.

On the journal side, LLMs may also help streamline tasks like reviewer selection, synthesis of multiple reviews, fraud detection (both at the review and manuscript levels), identification of across-reviewers review gaps or overlaps, or even the automated detection of AI use itself in reports. A recent AI agent that uses multiple LLMs was successfully designed to help reviewers write more constructive comments by providing automated feedback on unclear comments, content misinterpretation and unprofessional criticisms. The hope is a faster, more efficient, and cheaper process, and the full replacement of human referees is even heralded almost at our fingertips (see the recent Preprint from Google Research).

We are here catching a glimpse of how LLMs may be facilitators of peer review, qualitatively, quantitatively, and ethically with an enhancement of principles such as equity.

However, LLMs in also create major tensions that mostly revolve around the confiscation of peer’s autonomy when the use of an LLM in not transparent, the violation of privacy and trust when pasting someone else's unpublished work into a chatbot, the reliability concerns about the LLM outputs, and the possible collective deskilling across the scientific community. These are the dimensions we will explore further in the following.

Human peer review is already flawed, so why bother possible LLM issues?

This blog post has not really started yet, I can already hear from behind my laptop’s screen the complaints of those who read this article hoping for a global criticism of the very principle of the peer review system. Whether the current human-based evaluation is imperfect and globally outdated is one of the most important in the sociology of science and a key reformative challenge ahead of us.

Yes, human peer review is flawed and overdue for reform: it is slow, inconsistent, shaped by the reviewer's biases, poor at catching fraud, and increasingly mismatched with modern science like science openness, massive datasets, and interdisciplinary work aggregating multiple techniques that no single referee can fully assess. I have been a very active peer reviewer for more than 15 years, an author, and a journal editor, I know.

But conceding the above does not justify handing the task to an LLM, partially or wholly because the problems specific to LLMs are not the problems of human review in disguise. As we will discuss, in addition to problems qualitatively similar to those of human reviewers, specific issues arise. They pertain to the chatbot’s lack of accountability or the privacy and intellectual property risks bound to the export of unpublished confidential material to a commercial server abroad.

NB: The reader interested in detailed ethics essays that compare LLM and human peer review may consult the early paper from Schintler, McNeely, and Witte (2023). The paper is a little dated since it was written just after ChatGPT was launched, and predates routine undisclosed reviewer use, journal disclosure rules, and the confidentiality problems now central to the debate, but it remains interesting. They ask whether it is legitimate to use AI in peer review and conclude that AI should not be compared to the ideal human reviewer that does not exist. When judged against a reality of slow, biased and secretive human reviews, AI would offer real gains but also carries its real specific risks.

An avalanche of ethical bumps.

The prevailing reviewer’s accountability.

Remember that the human referee who was invited by the journal remains fully and personally accountable for everything the general purpose LLM does. Whether there is a misjudgment about a manuscript, data leak, false claim of fraud, or any mishap, the reviewer (and through a knock-on effect, the journal too) is liable. That is because the algorithm was used for a purpose for which it was not specifically designed and trained for, exonerating the LLM provider from legal liability.

NB: Note that this legal liability framework might be slowly evolving, with the recent landmark decision of the regional court in Munich (Germany) that ruled that Google was directly accountable for hallucinated contents returned by its AI feature that led to misinformation.

Trust and autonomy are shaken by undisclosed AI use.

On March 18, 2026, the International Conference on Machine Learning (ICML) desk-rejected almost 500 papers (about 2% of all submissions) after catching 506 reviewers violating its no-AI policy. The trap was ingenious: organizers embedded invisible prompt-injection instructions in submission PDFs, telling any LLM to slip specific detectable contents into the review. Importantly, these are the reviewers who had explicitly opted into the strict "no-AI" policy and then fed papers to a chatbot who were unmasked by those watermark phrases. Every flagged review was checked by a human before action. Beyond the debate about the procedure, these events highlight the entrenched modern appetite for non-transparent use of LLMs in peer review.

Peer review runs on a chain of trust. An author entrusts an unpublished manuscript to an editor, the editor entrusts it to a referee, and the referee is trusted to read it carefully, judge it fairly, and respect secrecy. A referee who hands part, or all, of the reviewing task to an LLM without disclosure creates an important trust issue since the parties involved should have a legitimate right to refuse. Authors did not tacitly consent to having their work evaluated by a commercial model whose training, performance and data-retention policies are far from transparent or certified. Similarly, readers place trust in the journal's stated process, and the journal itself is the protector of the quality of its contents, which is not guaranteed by a LLM.

Data leakage in LLMs’ interstices

Perhaps you would ask “why would any one of these parties object to the use of a LLM in the first place?” Well, unpublished manuscripts and grant proposals are among the most confidential documents in academic life. They contain unprotected ideas and sometimes the seed or fruit of someone's entire career. Uploading such documents to an LLM means exporting them to a server, virtually always abroad, and governed by opaque terms the referee, or anyone else, never reads. The risk in terms of intellectual property is particularly intense when the LLM user has not opted out from the reuse of uploaded data for model training.

I believe there is a constructive path here for editors and journals to create a governance service. Rather than banning tools that reviewers will use anyway, they could build and make available secure, in-house platforms where manuscripts and data can be uploaded and analysed by models under contract, on servers the journal controls. This keeps the confidential material inside the walls, and resolves the data-export problem.

LLMs are incompletely “informed” reviewers

I am pretty sure you would be upset by a reviewer who only raises superficial issues and asks for complementary experiments that are obviously irrelevant for anyone with a genuine technical bench know-how in the domain. Probably the most important LLM limitation is the lack of hidden domain knowledge. A good reviewer knows what was never written down in publications, books or public websites.

This is particularly true in experimental sciences like biology, where conference posters that never became papers, corridor conversations at a meeting, negative results nobody bothered to submit, the model that never worked, the reagent supplier everyone in the field has learned not to trust, etc, are all important elements the peer review principle is built on. I still remember informal conversations I had 20 years ago in congresses about the unreliability of certain antibody vendors, or the protocol flaws that led to (never retracted) landmark papers in my field. An LLM cannot carry the limitations of techniques and the recurring data patterns seen at the bench but not in its training or interrogated databases.

The lurking threat of deskilling

The normalization of LLM use creates a fear that domain expertise will erode. The concern is legitimate in my opinion, and of course not restricted to peer review, scientific research or even AI (even word processing programs with auto-correct features did not help maintain our language literacy).

When researchers take on the role of peer reviewers, it helps develop crucial evaluative skills and critical thinking. Additionally, they also gain valuable insights into assessing their own research. Automated systems replacing human review of a manuscript are a real threat to this skill acquisition and career-long consolidation. Learning to write is also learning to structure thinking and systematically delegating writing to an LLM chatbot is an idea I do not command.

In summary, there is space for synergy.

LLM use in academic peer review sits at a fragile ethical line. As discussed, it raises trust and confidentiality concerns, with reviewers risking breaching author confidentiality and data protection obligations by uploading unpublished manuscripts to third-party models, and accelerating reviewer deskilling. Conversely, however, the opportunities are real with LLMs excelling at formatting, language polishing, and synthesizing dense literature or reviewer reports. These tasks would free reviewers and journals from mechanical overhead. Yet, full automation would be a mistake. LLMs remain incompletely informed and blind to tacit, unpublished domain knowledge.

In my opinion, the way forward is division of labor instead of full replacement. LLMs can handle methodological consistency checks, plagiarism screening, formatting, and even drafting review reports from human bullet-point remarks. Humans must retain unique authority over novelty, soundness, methodological gaps, and tacit knowledge no model has access to.

A minimum workable ethics guideline of LLMs in peer review can be stated in a few lines:

Do not upload confidential manuscripts to LLM;
Do not let a model render judgments you have not personally verified and would not personally sign. In other words, use LLMs to work faster and in a larger scale than what you could personally accomplish, not to fully replace your skills;
Disclose, plainly, whatever assistance you use, so that editors and authors can exercise their rights. Ideally, you may also directly contact the editorial board right after being invited for review to discuss your perimeter of actions.

References

The 2024 Frontiers survey https://www.frontiersin.org/news/2025/12/15/most-peer-reviewers-now-use-ai-and-publishing-policy-must-keep-pace

A Critical Examination of the Ethics of AI-Mediated Peer Review, by Laurie A. Schintler and colleagues (2023) https://arxiv.org/abs/2309.12356

The 2026 publication by Thakkar and colleagues about the design of a LLM agent providing feedback in peer review https://www.nature.com/articles/s42256-026-01188-x

Large language models in peer review: challenges and opportunities by Zhuanlan Sun (2025) https://link.springer.com/article/10.1007/s11192-025-05440-w ,

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenge by Thi Huyen Nguyen and Zahra Ahmadi (2026) https://arxiv.org/abs/2606.25057,

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review by Zhu and colleagues (2025) https://arxiv.org/abs/2509.09912).

ChatGPT and the Future of Journal Reviews: A Feasibility Study by Som Biswas and colleagues (2023) https://pdfs.semanticscholar.org/bf0e/1affdaa1ee4a6a5aaca350dbbe8c52bcfc34.pdf

The 2026 Munich case about Google liability: https://www.technology.org/2026/06/12/german-court-google-ai-overviews-liable/

The ICML conference reviewer trap https://www.nature.com/articles/d41586-026-00893-2

Banner created by Google Gemini Banana Pro. Draft prepared with the assistance of Claude (Anthropic). Arguments, errors, and final wording developed by RDG.

LLMs in academic peer review: the bad good idea, and vice versa.