AI Test #2 - How Many NHS Visits Are Related to Complications from Patients Seeking Medical Treatment Overseas?

In this series we compare AI (LLM) output / performance for questions related to public policy.

AI tests

There is not yet any data or estimates available for the number of number of NHS visits due to complications from patients seeking medical treatment abroad. ¹ ² ³

When I asked LLMs for estimates I received the following responses to this prompt:

Please estimate how many NHS patients, and how many NHS consultations / visits might be the result of medical procedures obtained overseas? Please provide a lower and upper bound. Please show your working.

Model Patients Consultations Evaluation
ChatGPT ~2,500 to ~37,500 ~5,000 to ~225,000 No assessment ⚪
Perplexity ≈3,000 to ≈30,000 ≈6,000 to ≈150,000 No assessment ⚪
Gemini 52,300 to 180,000 156,000 to 450,000 No assessment ⚪
Claude ~5,500 to ~41,500 ~15,000 to ~260,000 No assessment ⚪

Please note that I have not checked any of the data, assumptions or calculations. I fully expect there could be significant errors in any of these from the current LLM models

Tiny Changes to Prompt? Big Changes to Answer!


Interestingly when I had previously asked a subtly different question which was slightly less clear and with some minor errors in it, I got significantly different estimates:

Please estimate how many visits to NHS, NHS patients, and consultations might be the result of medical procedures obtained overseas? Please provide a lower and upper bound. Please show your working.

Model Patients Consultations Evaluation
ChatGPT 5,000 to 30,000 10,000 to 150,000 More confident (narrower) ranges.
Changes up to 50%.
Perplexity 1,000 to 30,000 2,000 to 180,000 Less confident (larger) ranges.
Changes up to 66%.
Gemini 15,000 to 45,000 45,000 to 135,000 GP consultations plus 3,750 to 22,500 hospital admissions
Changes up to 75%
Claude 7,500 to 80,000 ~22,500 to ~640,000

The changes to the outputs of geometric mean, and range are shown below. Also included is the factor of change (x1, x2, x3, x4), as well as a qualitative description of the change in range as "confidence" in the answers given:

Model Patients Consultations
ChatGPT mean: 9,682 → 12,247 x1.3▲
range: 35,000 → 25,000 x1.4▼
More confident
mean: 33,541 → 38,730 x1.2▲
range: 220,000 → 140,000 x1.6▼
More confident
Perplexity mean: 9,487 → 5,477 x1.7▼
range: 27,000 → 29,000 x1.1▲
~ Same confidence
mean: 30,000 → 18,974 x1.6▼
range: 144,000 → 178,000 x1.2▲
~ Same confidence
Gemini mean: 97,026 → 25,981 x3.7▼
range: 127,700 → 30,000 x4.3▼
Much more confident
mean: 264,953 → 87,625 x3.0▼
range: 294,000 → 108,750 x2.7▼
Much more confident
Claude mean: 15,108 → 24,495 x1.6▲
range: 36,000 → 72,500 x2.0▲
Less confident
mean: 62,450 → 120,000 x1.9▲
range: 245,000 → 617,500 x2.5▲
Less confident

Notes

³ Complications and costs to the UK National Health Service due to outward medical tourism for elective surgery: a rapid review. Published by researchers from Cardiff and Bangor Universities: England C, Bromham N, Needham-Taylor A, et al. BMJ Open 2026; 10.1136/bmjopen-2025-109050

This was reported as "contacts", not consultations or visits, which seems like a better word to use.

Gemini picked up on the recent news regarding cost estimates from Cardiff and Bangor universities, see notes 1-3 above, but failed to cite any references to the underlying study.

I appreciate that the concept of confidence is not well suited to the function of LLMs but in human terms a narrower absolute range would be perceived as higher confidence. Also worth noting that a relative range could also be used to assess confidence in which case the Gemini ranges would be classified as not significantly changing in confidence.

edited 2026-02-27

Added third table with changes between to answers from the models given the two different prompts.