{"id":16566,"date":"2024-09-20T12:17:24","date_gmt":"2024-09-20T10:17:24","guid":{"rendered":"https:\/\/is.ijs.si\/?p=16566"},"modified":"2025-03-26T13:15:18","modified_gmt":"2025-03-26T12:15:18","slug":"generating-non-english-synthetic-medical-data-sets","status":"publish","type":"post","link":"https:\/\/is.ijs.si\/?p=16566","title":{"rendered":"Generating Non-English Synthetic Medical Data Sets"},"content":{"rendered":"\n<p>Lenart Dolinar, Erik Calcina and Erik Novak<\/p>\n<p>Abstract<br \/>While using synthetic data sets to train medicine-focused ma-<br \/>chine learning models has been shown to improve their per-<br \/>formance, most of the research focuses on English texts. In this<br \/>paper, we explore generating non-English synthetic medical texts.<br \/>We propose a methodology for generating medical synthetic data,<br \/>showcasing it by generating Greeklish medical texts relating to<br \/>hypertension. We test the approach using seven different lan-<br \/>guage models and evaluate the datasets\u2019 quality by training a<br \/>classifier to discern which examples are from the original and<br \/>which from the synthetic data sets. We find that the Llama-3<br \/>performs best for our task.<\/p>\n<p>\u00a0<\/p>\n\n\n\n<div data-wp-interactive=\"core\/file\" class=\"wp-block-file\"><object data-wp-bind--hidden=\"!state.hasPdfPreview\" hidden class=\"wp-block-file__embed\" data=\"https:\/\/is.ijs.si\/wp-content\/uploads\/2024\/10\/IS2024_-_SIKDD_2024_paper_4-1.pdf\" type=\"application\/pdf\" style=\"width:100%;height:600px\" aria-label=\"Embed of IS2024_-_SIKDD_2024_paper_4-1.\"><\/object><a id=\"wp-block-file--media-b452f445-b8fe-4e92-9ebe-f44015ae3772\" href=\"https:\/\/is.ijs.si\/wp-content\/uploads\/2024\/10\/IS2024_-_SIKDD_2024_paper_4-1.pdf\">IS2024_-_SIKDD_2024_paper_4-1<\/a><a href=\"https:\/\/is.ijs.si\/wp-content\/uploads\/2024\/10\/IS2024_-_SIKDD_2024_paper_4-1.pdf\" class=\"wp-block-file__button wp-element-button\" download aria-describedby=\"wp-block-file--media-b452f445-b8fe-4e92-9ebe-f44015ae3772\">Download<\/a><\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":29,"featured_media":24966,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[109,102],"tags":[],"class_list":["post-16566","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-doi-sikdd-2024","category-papers"],"_links":{"self":[{"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/posts\/16566","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/is.ijs.si\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=16566"}],"version-history":[{"count":2,"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/posts\/16566\/revisions"}],"predecessor-version":[{"id":16928,"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/posts\/16566\/revisions\/16928"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/is.ijs.si\/index.php?rest_route=\/wp\/v2\/media\/24966"}],"wp:attachment":[{"href":"https:\/\/is.ijs.si\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=16566"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/is.ijs.si\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=16566"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/is.ijs.si\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=16566"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}