Surviving the Swiss biomedical AI ethical and regulatory maze (III): Chat-GPT and friends in medicine, a risky business.

Surviving the Swiss biomedical AI ethical and regulatory maze (III): Chat-GPT and friends in medicine, a risky business.

Friday, October 3, 2025

Think twice (or many more) before using generic LLM in healthcare (8-9 min read).

Generic large language models (gLLMs, we could have called them general purpose AI like the EU AI Act does, or even generative AI) such as ChatGPT, Claude or Gemini have attracted much attention in medicine (see for example Aster A. 2025 * for a recent scoping review). They have been attributed the anthropomorphic ability to pass medical certifications (see Casals-Farre O. 2025 for example), and the seemingly superiority over doctors (see Hoppe JM. 2025 for example). Inevitably, we now contemplate the possibility of reducing the need for human intervention in medicine.

Technology gung-hos would celebrate the imminence of this great replacement and even grumble about the fact that this shift has not really started yet. It is also tantalising for doctors to occasionally, if not regularly, consult gLLMs for assistance. Anyway, the medical community, especially students, seems to relatively trust the accuracy of these systems (see Abdelhafiz AS. 2025 for example), creating the breeding ground for professional acceptance.

Not so fast! Let’s flesh out this matter, under ethical and legal angles of course. In particular, the human rights that will concern us mostly are the right to life and personal freedom and the right to privacy, both guaranteed in the Federal Constitution and the CEDH. The former grants patients the right to dispose of their own body and not to receive treatments against their will, the latter protects their data from any unauthorised access. 

*Find abbreviations, links, and references at the end of the post.

The gLLM reliability issue.

Yes, gLLMs have the potential to help doctors in actions like writing, creating clinical use cases, or get crude medical refreshers. However, gLLMs have not been specifically designed or trained for medicine (hence the “generic”), putting users at risk of medical errors. We can feel the effects of this unadapted training on their shaky clinical decision-making (see Casals-Farre O. 2025). This insufficient reliability is at odds with fundamental ethical principles such as non-maleficence, justice or fidelity, and its impact extends to issues pertaining to insurance or healthcare costs. 

Roughly speaking (look away if you are a specialist), gLLMs are designed to receive textual requests and generate outputs as strings of “tokens” in natural language, using probabilities based on their training data. Their “transformer” architecture is versatile, enabling the management of items beyond texts, such as images. Their training is made on more or less legitimately scrapped Internet sources such as social networks (Reddit, Facebook, Instagram, Youtube…) often owned by their proprietary companies, and increasingly polluted by AI-generated contents by the way (see Ng LHX and Carley KM 2025). They generate outputs, most probable token after most probable token, based on their training data that are vastly not medical. There is no reason to automatically assume the truthfulness of these outputs.

But wait! Am I not contradicting my own claim that gLLMs were performing very well in medicine? Let’s dwell a little on that.

Are gLLMs really good “doctors”?

Rather than spending too much time on comprehensive analyses of gLLMs accuracy, I will share some of my own anecdotal experience. Why? Mostly because you may find very insightful reports elsewhere (for example Goh E. et al. 2025, Arvidsson R. 2024, or Takita H. 2025), but also because I am rather interested in illustrating how easily these gLLMs may mislead and be misled. I will arguably be unfair to gLLMs by not contextualising my prompts with patient history or lab reports, but let’s keep things simple.

Claude will be the first toast of this section. I probed its ability to read electrocardiograms (ECG) signatures of ST elevation myocardial infarction (STEMI, a heart attack). These textbook cases show typical anomalies highlighted in yellow on the recordings on V3 and V4 precordial leads (downloaded from a Swiss cardiology directory) and by inserted green arrows added to the chat screenshots.  

The results were unequivocal: Claude confidently interpreted both ECGs as "normal," despite classical life-threatening patterns that medical students are trained to recognise (I only show the chat about V3).

Ok, enough for Claude. Let's now turn to ChatGPT. I submitted the same ECG to it and it successfully spotted the STEMI (I will not display our chat, for the sake of brevity). Good job! I decided on another test and asked whether it would notice anything special on a normal ECG (From the Swiss cardiology directory).

It successfully found nothing (once again, I will spare you the chat). 

Hurrah!

Playing the role of an insecure doctor, I pretended to suspect a long QT syndrome in this patient despite the previous chat output, and asked for a second analysis. That’s where trust erosion resurrected.

Yikes! ChatGPT seems easily influenced by vague and unfounded impressions. 

Our third guest (victim?) is Gemini, which I challenged with an X-ray of healthy metacarpophalangeal (MCP) joints. It is an X-ray of a friend’s MCPs, dating back from 2012 (for those leery among you, it is a broad angle shot of a wrist investigation that ended up being some ligament issue at the radio-ulnar level).

Gemini unhesitatingly and repeatedly stresses that MCP joints are strongly evocative of rheumatoid arthritis (RA). Wow! My friend may not have top notch health at times, but RA is not among their medical issues (screenshot below).

My point here is that gLLMs are not reliable, but perhaps are you thinking I am overreacting with my anecdotes. The problem is that my worries are harboured by real-life evidence that some young clinicians already rely on free web gLLMs to analyse images like ECG during their hospital shifts.

But am I not exaggerating? After all, there are specialized add-on options and adjustable settings.

It is true that gLLMs sometimes provide add-on options presumably relevant to medical diagnostics, such as the X-ray and ECG interpreters available in ChatGPT. Yet, sorry to let you down here, they are a smokescreen and their results are not better. I prompted the normal ECG to ChatGPT ECG analyser and it agreed with me about an elongated QT interval.

Regarding the gLLMs settings, yes they may be manually tuned to more or less agree with the user, and these modifications will change the output. After all, we have to remember that these AI systems are not designed to regurgitate the truth but instead to be statistically correct, which is influenced by agreement settings. Here, prompt engineering is important. In the above example, the ECG interpretation was strongly driven by the inclusion of “...looks like there is a long QT to me” in the query, and we can imagine a different output if this sentence had been “...looks like the ECG is normal to me”. However, this fickleness is precisely a strong argument that further discredits gLLM clinical reliability.

Ethico-legal issues: walking on an unstable high wire.

Generic LLMs are not medical devices.

I am pretty sure you know what I am up to: none of these gLLMs legally qualifies as a medical device described in Swiss law (definitions in ODim Art 3 and requirements in ODim Art 6-17). They have not been through certification processes as requested in the various Swiss acts (LPTh, ODim, nLPD, OPDo, OCPD), and consequently they have not received green light by Swissmedic (the Swiss medicines and medical devices safety agency). The chapter of FMH regulatory guidelines dedicated to AI assumes in its introduction the compliance of AI systems to Swiss law (Chapters 6.7). The consequence of this regulatory safeguard, in case anything goes wrong, is the full legal responsibility of healthcare providers who choose to use gLLMs. Lawsuits may end up with a three year custodial sentence or a monetary penalty (LPTh Art. 86, al. 1d), which even reaches a ten year sentence if one knew or must assume that human health was endangered (LPTh Art. 86, al. 2a). 

Note that legal liability extends to subordinates (CO Art 101, al 1). Therefore, it is the legal responsibility of senior clinicians to make sure that doctors under their supervision do not use gLLMs for care. It might help to foster a culture of in-house seminars, clear guidelines and open discussions about this issue.

Preserving, not eroding, patient trust.

The ethical principle of self-determination ensures patients’ autonomy in deciding about accepting or not their medical care, and is grounded in patient information and consent, as stipulated in the FMH code of deontology (Art 10). Failure to inform patients and get their consent about the use of gLLMs is likely to undermine their trust. The code makes it particularly explicit that the use of IT systems is possible if the right of patient information is respected (Art 7).

As mentioned in a previous blog post, the FMH legal guidelines say that there are two contexts where patient information and consent about AI are required (Chapter 3.2): when using AI presents a significant risk for the patient, and when the information plays a role in the patient’s decisions regarding their treatments. Both conditions are fulfilled when gLLMs are used since knowledge of reliability and legality breaches would certainly influence a patient's acceptance of medical decisions. 

The data leakage.

Privacy concerns surrounding gLLMs in healthcare are particularly delicate. When healthcare providers upload patient data into gLLM platforms, this data is typically transmitted to external servers, very likely abroad. Furthermore, unless the user has actively opted-out, this data may be used by the platform for training its algorithm. Under the nLPD, this export is possible if the legislation of the State concerned guarantees an adequate level of protection (nLPD Art. 16, al. 1) or if the patient has explicitly consented (nLPD Art. 17, al. 1), and if the record of processing activities by the doctor includes the details about this State (nLPD Art. 12, al. 1g). A breach of personality rights, even locally, may be justified if the patient gives their consent and the data are anonymised (nLPD Art. 31, al. 1, 2e). The FMH legal guidelines explain that the patients should be informed if their data are collected for being processed by an AI system (Chapter 3.2) and dedicate an entire chapter on clinical data management (Chapter 7.2), inspired by the contents of the nLPD. The law provides for a fine up to 250,000 Swiss francs for disclosing data abroad or violation of the professional duty of confidentiality (nLPD Art 61-62). Here again, even if we forget about reliability and certification, transparency and patient information and consent would be key.

Yet, I doubt clinicians using ChatGPT will openly disclose it to patients, hierarchy, colleagues, or the authorities.

Conclusion.

Currently, the use of gLLMs in medical practice contravene many dimensions of ethics and law such as personal freedom, autonomy, or privacy. It is often an example of shadow AI (an unsanctioned or ad-hoc use of AI within an organization that is outside IT governance). It is the illegal unconsented utilisation of unreliable, non-certified, and non-specific automated systems, whose potentially leaky servers are usually located abroad. Sounds less sexy than the usual headlines.

Although doctors have no legal obligation regarding results in terms of recovery, accuracy, or prognosis, they are bound by a legal duty of diligence (Code of deontology, Art. 3). It means that they must, in good faith and with professional conscience, use all the means available to them in their work. Using gLLMs, fully aware that they are far from certified evidence-based state-of-the-art medical technologies contradicts this obligation.

Abbreviations and links

Constitution fédérale (Federal constitution) https://www.fedlex.admin.ch/eli/oc/1999/404/fr (French)

CEDH: European court of human rights (ECHR) https://www.echr.coe.int/ 

EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng (English)

nLPD: Nouvelle loi sur la protection des données (New act on data protection) 

https://www.fedlex.admin.ch/eli/cc/2022/491/en (English)  / https://www.fedlex.admin.ch/eli/cc/2020/552/fr (French)

OPDo: Ordonnance sur la protection des données (Ordinance of data protection) 

https://www.fedlex.admin.ch/eli/cc/2022/568/en (English) / https://www.fedlex.admin.ch/eli/cc/2022/568/fr (French)

OCPD: Ordonnance sur les certifications en matière de protection des données (Ordinance on certification on data protection) 

https://www.fedlex.admin.ch/eli/cc/2022/569/en (English) / https://www.fedlex.admin.ch/eli/cc/2022/569/fr (French)

LPTh: Loi sur les produits thérapeutiques (Act on therapeutical products) 

https://www.fedlex.admin.ch/eli/cc/2001/422/en (English) / https://www.fedlex.admin.ch/eli/cc/2001/422/fr (French)

ODim: Ordonnance sur les dispositifs médicaux (Ordinance on medical devices) 

https://www.fedlex.admin.ch/eli/cc/2020/552/en (English) / https://www.fedlex.admin.ch/eli/cc/2020/552/fr (French)

CO: Code des obligations suisse (The code of obligations): https://www.fedlex.admin.ch/eli/cc/27/317_321_377/fr (only available in FR)

Swissmedic: https://www.swissmedic.ch/swissmedic/en/home.html 

Bases juridiques pour le quotidien médical : https://leitfaden.samw.fmh.ch/fr/guide-pratique-bases-juridique/tables-des-matieres-guide-jur.cfm# (French)

Code of deontology: https://www.fmh.ch/files/pdf30/standesordnung---fr---2024-04.pdf (French)

Swiss cardiology directory (ECG STEMI): https://www.cardio-fr.com/fr/p/ecgs/149/ 

Swiss cardiology directory (normal ECG): https://www.cardio-fr.com/fr/p/ecgs/138/ 

References:

Aster a. et al 2025: ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-025-06731-9

Abdelhafiz AS. et al. 2025: Medical students and ChatGPT: analyzing attitudes, practices, and academic perceptions. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-025-06731-9 

Hoppe JM. et al. 2024: ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. https://www.jmir.org/2024/1/e56110 

Goh E. et al. 2025: GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. https://www.nature.com/articles/s41591-024-03456-y 

Ng LHX and Carley KM 2025: A global comparison of social media bot and human characteristics. https://www.nature.com/articles/s41598-025-96372-1

Takita H. et al. 2025: A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. https://www.nature.com/articles/s41746-025-01543-z 

Arvidsson R. et al. 2024: ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study https://bmjopen.bmj.com/content/14/12/e086148 

Casals-Farre O. et al. 2025: Assessing ChatGPT 4.0’s Capabilities in the United Kingdom Medical Licensing Examination (UKMLA): A Robust Categorical Analysis

https://www.nature.com/articles/s41598-025-97327-2 

Banner created with ChatGPT, text and documents 100% non-IA.

No comments yet
Search