Generating Non-English Synthetic Medical Data Sets

Lenart Dolinar, Erik Calcina and Erik Novak

Abstract
While using synthetic data sets to train medicine-focused ma-
chine learning models has been shown to improve their per-
formance, most of the research focuses on English texts. In this
paper, we explore generating non-English synthetic medical texts.
We propose a methodology for generating medical synthetic data,
showcasing it by generating Greeklish medical texts relating to
hypertension. We test the approach using seven different lan-
guage models and evaluate the datasets’ quality by training a
classifier to discern which examples are from the original and
which from the synthetic data sets. We find that the Llama-3
performs best for our task.

Generating Non-English Synthetic Medical Data Sets