AI Test #2 - How Many NHS Visits Are Related to Complications from Patients Seeking Medical Treatment Overseas?
Tiny changes in prompts produced >300% (4x!) change in values.
Tiny changes in prompts produced >300% (4x!) change in values.
There is not yet any data or estimates available for the number of NHS visits due to complications from patients seeking medical treatment abroad. ¹ ² ³
When I asked LLMs for estimates I received the following responses to this prompt:
Please estimate how many NHS patients, and how many NHS consultations / visits might be the result of medical procedures obtained overseas? Please provide a lower and upper bound. Please show your working.
| Model | Patients | Consultations | Evaluation |
|---|---|---|---|
| ChatGPT | ~2,500 to ~37,500 | ~5,000 to ~225,000 | No assessment ⚪ |
| Perplexity | ≈3,000 to ≈30,000 | ≈6,000 to ≈150,000⁴ | No assessment ⚪ |
| Gemini | 52,300 to 180,000 | 156,000 to 450,000 | No assessment ⚪⁵ |
| Claude | ~5,500 to ~41,500 | ~15,000 to ~260,000 | No assessment ⚪ |
Please note that I have not checked any of the data, assumptions or calculations. I fully expect there could be significant errors in any of these from the current LLM models
Tiny Changes to Prompt? Big Changes to Answer!
Interestingly when I had previously asked a subtly different question which was slightly less clear and with some minor errors in it, I got significantly different estimates:
Please estimate how many visits to NHS, NHS patients, and consultations might be the result of medical procedures obtained overseas? Please provide a lower and upper bound. Please show your working.
| Model | Patients | Consultations | Evaluation |
|---|---|---|---|
| ChatGPT | 5,000 to 30,000 | 10,000 to 150,000 | More confident (narrower) ranges. Changes up to 50%. |
| Perplexity | 1,000 to 30,000 | 2,000 to 180,000 | Less confident (larger) ranges. Changes up to 66%. |
| Gemini | 15,000 to 45,000 | 45,000 to 135,000 GP consultations plus 3,750 to 22,500 hospital admissions | Changes up to 75% |
| Claude | 7,500 to 80,000 | ~22,500 to ~640,000 |
The changes to the outputs of geometric mean, and range are shown below. Also included is the factor of change (x1, x2, x3, x4), as well as a qualitative description of the change in range as "confidence" in the answers given⁶:
| Model | Patients | Consultations |
|---|---|---|
| ChatGPT | mean: 9,682 → 12,247 x1.3▲ range: 35,000 → 25,000 x1.4▼ More confident |
mean: 33,541 → 38,730 x1.2▲ range: 220,000 → 140,000 x1.6▼ More confident |
| Perplexity | mean: 9,487 → 5,477 x1.7▼ range: 27,000 → 29,000 x1.1▲ ~ Same confidence |
mean: 30,000 → 18,974 x1.6▼ range: 144,000 → 178,000 x1.2▲ ~ Same confidence |
| Gemini | mean: 97,026 → 25,981 x3.7▼ range: 127,700 → 30,000 x4.3▼ Much more confident |
mean: 264,953 → 87,625 x3.0▼ range: 294,000 → 108,750 x2.7▼ Much more confident |
| Claude | mean: 15,108 → 24,495 x1.6▲ range: 36,000 → 72,500 x2.0▲ Less confident |
mean: 62,450 → 120,000 x1.9▲ range: 245,000 → 617,500 x2.5▲ Less confident |
Notes
³ Complications and costs to the UK National Health Service due to outward medical tourism for elective surgery: a rapid review. Published by researchers from Cardiff and Bangor Universities: England C, Bromham N, Needham-Taylor A, et al. BMJ Open 2026; 10.1136/bmjopen-2025-109050
⁴ This was reported as "contacts", not consultations or visits, which seems like a better word to use.
⁵ Gemini picked up on the recent news regarding cost estimates from Cardiff and Bangor universities, see notes 1-3 above, but failed to cite any references to the underlying study.
⁶ I appreciate that the concept of confidence is not well suited to the function of LLMs but in human terms a narrower absolute range would be perceived as higher confidence. Also worth noting that a relative range could also be used to assess confidence in which case the Gemini ranges would be classified as not significantly changing in confidence.
⁷ edited 2026-02-27
Added third table with changes between to answers from the models given the two different prompts.