AI Test #2 - How Many NHS Visits Are Related to Complications from Patients Seeking Medical Treatment Overseas?

Tiny changes in prompts produced >300% (4x!) change in values.

2026-02-23

AI tests

In this series we continue to compare AI (LLM) output / performance for questions related to understanding and addressing societal planning and public policy.

There is not yet any data or estimates available for the number of NHS visits due to complications from patients seeking medical treatment abroad. ¹ ² ³

When I asked LLMs for estimates I received the following responses to this prompt:

Please estimate how many NHS patients, and how many NHS consultations / visits might be the result of medical procedures obtained overseas? Please provide a lower and upper bound. Please show your working.

Model	Patients	Consultations	Evaluation
ChatGPT	~2,500 to ~37,500	~5,000 to ~225,000	No assessment ⚪
Perplexity	≈3,000 to ≈30,000	≈6,000 to ≈150,000⁴	No assessment ⚪
Gemini	52,300 to 180,000	156,000 to 450,000	No assessment ⚪⁵
Claude	~5,500 to ~41,500	~15,000 to ~260,000	No assessment ⚪

Please note that I have not checked any of the data, assumptions or calculations. I fully expect there could be significant errors in any of these from the current LLM models

Tiny Changes to Prompt? Big Changes to Answer!

Interestingly when I had previously asked a subtly different question which was slightly less clear and with some minor errors in it, I got significantly different estimates:

Please estimate how many visits to NHS, NHS patients, and consultations might be the result of medical procedures obtained overseas? Please provide a lower and upper bound. Please show your working.

Model	Patients	Consultations	Evaluation
ChatGPT	5,000 to 30,000	10,000 to 150,000	More confident (narrower) ranges. Changes up to 50%.
Perplexity	1,000 to 30,000	2,000 to 180,000	Less confident (larger) ranges. Changes up to 66%.
Gemini	15,000 to 45,000	45,000 to 135,000 GP consultations plus 3,750 to 22,500 hospital admissions	Changes up to 75%
Claude	7,500 to 80,000	~22,500 to ~640,000

The changes to the outputs of geometric mean, and range are shown below. Also included is the factor of change (x1, x2, x3, x4), as well as a qualitative description of the change in range as "confidence" in the answers given⁶:

Model	Patients	Consultations
ChatGPT	mean: 9,682 → 12,247 x1.3▲ range: 35,000 → 25,000 x1.4▼ More confident	mean: 33,541 → 38,730 x1.2▲ range: 220,000 → 140,000 x1.6▼ More confident
Perplexity	mean: 9,487 → 5,477 x1.7▼ range: 27,000 → 29,000 x1.1▲ ~ Same confidence	mean: 30,000 → 18,974 x1.6▼ range: 144,000 → 178,000 x1.2▲ ~ Same confidence
Gemini	mean: 97,026 → 25,981 x3.7▼ range: 127,700 → 30,000 x4.3▼ Much more confident	mean: 264,953 → 87,625 x3.0▼ range: 294,000 → 108,750 x2.7▼ Much more confident
Claude	mean: 15,108 → 24,495 x1.6▲ range: 36,000 → 72,500 x2.0▲ Less confident	mean: 62,450 → 120,000 x1.9▲ range: 245,000 → 617,500 x2.5▲ Less confident

Notes

¹ NHS faces high costs from patients seeking elective surgery abroad; Cardiff University news; 4 February 2026

² Research finds complications of medical tourism may cost NHS up to £20,000 per patient; Bangor University; 10 February 2026

³ Complications and costs to the UK National Health Service due to outward medical tourism for elective surgery: a rapid review. Published by researchers from Cardiff and Bangor Universities: England C, Bromham N, Needham-Taylor A, et al. BMJ Open 2026; 10.1136/bmjopen-2025-109050

⁴ This was reported as "contacts", not consultations or visits, which seems like a better word to use.

⁵ Gemini picked up on the recent news regarding cost estimates from Cardiff and Bangor universities, see notes 1-3 above, but failed to cite any references to the underlying study.

⁶ I appreciate that the concept of confidence is not well suited to the function of LLMs but in human terms a narrower absolute range would be perceived as higher confidence. Also worth noting that a relative range could also be used to assess confidence in which case the Gemini ranges would be classified as not significantly changing in confidence.

⁷ edited 2026-02-27

Added third table with changes between to answers from the models given the two different prompts.