Coming from a neuroscience background, sample size was always a tricky but essential part of experimental design. At least in neuroscience, we were aware of how important it was and how problematic it could be to get it wrong in terms of replicability and generalizability of results.
In AI for healthcare, however, there is an assumption that complex models can somehow handle limited data. But a recent Lancet Digital health viewpoint by Riley and colleagues shows this just isn’t true and why small sample size in clinical AI is dangerous.
Here is why small datasets are still a big problem in clinical AI:
Even if your data is randomly sampled from the right population, small samples are rarely representative.
When training samples are small, the model becomes unstable—different data can lead to different predictors and different prediction behavior.
That instability creates a lot of uncertainty when making predictions for individual patients.
Predictions often end up poorly calibrated, meaning the risks the model estimates don’t match what actually happens.
When models are trained on too little data, their performance drops—and that can directly harm clinical decision-making.
Evaluating a model’s performance also needs enough data to estimate things like calibration and clinical utility with confidence.
What can you do?
If you’re developing an AI model for healthcare, make sure your dataset is both large enough and representative. Authors mention tools like the pmvalsampsize
package in R or Stata to help you calculate the sample size you need to get reliable performance estimates.
Can you rely on existing datasets?
Developers of clinical AI usually rely on published datasets. These can be helpful and usually have enough samples. But big datasets aren’t always high-quality, and if you’re working with fewer data points than ideal, be transperant about that. Communicate the limitations clearly when you describe how your model was developed.
Thanks for reading.
If you're interested in how AI literacy fits into healthcare regulation, especially under the EU AI Act, check out my recent talk at Doctor to Doctor Talks.