Even before the COVID-19 crisis, health systems, medical researchers, and medical institutions grappled with efficient ways of gathering patient data while maintaining patient privacy.
When researching for health innovation or crisis management, healthcare institutions must extract data from a multitude of systems. Answering questions about trends in chronic conditions, the viability of a treatment in a community, the utilization rates of certain procedures, or the rising costs of health care—all of these scenarios require collecting, analyzing, and sharing patient and population data.
Unfortunately, that process is fraught with possible data breaches, navigation of industry privacy regulation, dependence on healthcare IT specialists, and precious time. On top of that, compiling and researching patient data requires navigation through massive troves of data that may exist in a variety of systems that are siloed or frustratingly dispersed across differing archives.
Related: How Cloud Technology Facilitates the Management of Patient Care → |
Usage of patient data in clinical research
Most of the time, medical researchers must submit data requests to even access individual and population patient data. It takes time to request and receive data pulls, and even more time and skill to read and manipulate any received data. It also requires incredibly specific queries from the medical professional, researcher, or institution, that may or may not need supplemental queries for clarification. The cherry on top? All patient information must be redacted due to its sensitive nature. Compromising patient security and confidentiality by failing to remove all identifying attributes goes directly against healthcare compliance guidelines such as the Health Insurance Portability and Accountability Act (HIPAA), Health Information Technology for Economic and Clinical Health Act (HITECH), and General Data Protection Regulation (GDPR).
Electronic health records (EHR) are now digitized—the progress that has improved the storage of and access to a patient’s health records didn’t necessarily translate to a convergence of those records. The transition of legacy health care systems into more nimble, cloud-based systems didn’t immediately erase (clunky) workflows when it comes to clinical communication and collaboration. More than likely, health systems must now contend with duplicate data that must be cleaned and access controls that must be determined on a case-by-case, title-by-title basis.
All of this illustrates that there’s a reason why advancements in health care solutions, digital health, and patient satisfaction haven’t necessarily resulted in the complete and efficient transformation of the healthcare industry. This is a global problem. The U.S. healthcare system is notorious for being inefficient, but the worldwide COVID-19 pandemic has made it clear that there are global issues of data sharing, resource pooling, and research opportunities.
How do we fix this? How can we truly understand and learn from gaps in care and medical research so that we can protect everyone on the planet and possibly prevent another pandemic like COVID-19?
Synthetic data offers a compelling solution.
What is Synthetic Data?
Synthetic data is artificial data that is rooted in actual data but employs mathematical, machine learning, or artificial intelligence programs and algorithms that enable researchers in any field to test out scenarios, train and manipulate data according to specific conditions, and ensure data privacy—synthetic data provides infinite possibilities.
Synthetic data in health care
AI Multiple’s guide to synthetic data describes the usefulness of synthetic data in cases where paramount privacy requirements limit data availability, the costs of real-life product testing negatively restricts endeavors, or datasets need to be quickly trained to be effective. Synthetic data produces statistically comparable datasets in a quicker, safer setting, allowing companies, institutions, and organizations to become more nimble, innovative, and effective.
Its application in the healthcare industry posits intriguing potential. Regardless of all the information that is entered into and accessed by medical professionals, all patient information is sensitive and requires protection and de-identification before it can be used for any research purposes. The healthcare application of synthetic data allows medical researchers to create and consult those statistically comparable datasets on fictional patients.
To be clear, these datasets are not wild shots in the dark. “Fictional patients” mean unattributable patient data; unattributable patient data strips all data of patient and demographic identifiers. The University of Copenhagen nicely sums up the attributes of these fictional patients:
In a nutshell, synthetic health data adds to the scope of existing or “real” data, circumventing the issue of too little data availability.
Protecting patient identity is paramount. However, that stringent protection causes breakdowns in clinical data and clinical research workflows. For example, when a clinical care coordinator contacts hospital administrators for patient documentation, they must fax in forms, follow up with administrators over the phone, and manually input data. This is the procedure for every single patient. Clinical care coordinators must also take care to not request information too early because shared documents have a short lifespan. That is just one scenario that is already rife with bottlenecks.
Now apply that bumpy workflow to clinical researchers or pharmaceutical drug developers, who are trying to make predictions, identify trends, and determine population health initiatives on a larger scale. Sure, larger health systems may have larger databases (or data lakes) to hold all of their patients’ information, but these databases are not structured in a one-to-one way. A patient’s medical record can exist separately from their records of procedures, referrals, and ancillary care history. A patient’s medical data can even exist separately between different entities of the same company. Effectively, this results in data scarcity.
As the youths would say, de-identification walked so that synthetic health data could run. De-identification of patient data is the censoring or removal of identifiable patient attributes for the purposes of population health research. The difference between de-identification and synthetic health data is that the latter is completely removed from patient information. Synthetic data contains zero personal data. In addition, intelligent patient data generators (iPDGs) and EHR generators can be utilized to generate synthetic patient records regardless of the amount of bulk patient data stored in a hospital’s admin system.
There’s also the amazingly acronymized FHIR. The Fast Healthcare Interoperability Resources, more commonly referred to as FHIR, helped pave the way in terms of data collection and sharing. FHIR provides the healthcare industry with a cloud-based data storage standard that improves health information exchange (HIE) and data interoperability. FHIR significantly improves clinical communication and collaboration by enabling the tagging and organizing of clinical data within a healthcare organization’s data system.
Robert Lieberthal, health economics principal at The MITRE Corporation, believes that “Synthetic data is a solution to many of the problems that plague our health IT system…In a way, synthetic data represents current health IT standards while also incorporating the best of what health IT could be.”
Once synthetic data solutions are integrated within the databases of a healthcare organization, it ingests all data points, automating data de-duplication and cleaning, capturing statistical insights and relationships between data points, and facilitating data sharing, delivery, and modeling.
Again, because synthetic data does not contain protected health information, the generated artificial data can be shared between medical and clinical researchers and scientists. They are no longer constrained to utilizing redacted patient information that may or may not adhere to healthcare compliance guidelines when developing new health strategies, payment initiatives, and health policies, and digital health development.
Concerns of utilizing synthetic data
While the benefits of generating and applying synthetic data to health care are clear, it’s still in the early stages of adoption and implementation. Detractors of synthetic data do exist, and for good reason, as with any solution that relies on machine learning and automation to hone and polish.
There are limitations to synthetic data in a healthcare setting, and all stakeholders who want to leverage synthetic data must be aware of them. |
|
Players in synthesized health care data
Synthetic data, and particularly synthetic health data, is a relatively new forum in which research is conducted. Correspondingly, the following list of synthetic health data players is short but will grow as this healthcare technology becomes more widely accepted and improved upon.
MDClone
MDClone is an Israel-based health IT vendor with the mission of easing access to health data and improving overall methods of health research and activity. MDClone’s platform intends to democratize data across the healthcare ecosystem by enabling the broad use of data that resides inside health systems.
Synthea
Synthea is an open-source, synthetic patient data generator that can be used to create models of medical history of synthetic patients. Synthea’s free data lake enables health data research while adhering to privacy and security restrictions, regardless of the healthcare industry.
Statice
Statice has developed privacy-compliant data anonymization solutions that can be used by businesses and organizations across all industries. Statice enables healthcare institutions to work faster, safer, and in compliance, while furthering research, development, and delivery of patient care.
MHMD
Consulting firm Lynkeus led the European Union-funded MyHealthMyData (MHMD) project that aimed—and succeeded—to prove the validity and usefulness of making anonymized (read: synthetic) data available for open research.
Simulacrum
The Human Data Science Company, IQVIA collaborated with biopharma research company AstraZeneca to develop the synthetic database Simulacrum. Simulacrum is comprised of solely artificial (read: synthetic) data to conduct research and perform analyses on population cancer care.
Way forward
The potential impact of creating and utilizing synthetic data to improve clinical research and health strategies is huge. As with most things, it takes time for an industry to reap the benefits from a new kind of technology or process before everyone gets on board. However, during a worldwide health crisis, we’re short on time and resources. Both the regional and global medical communities must take cues from the current leaders in synthetic health data to transform how they share and protect patient data, encourage clinical collaboration, and instigate necessary change in their approach to creating and improving health plans, strategies, and initiatives.
Read More: Telemedicine’s Critical Role in the COVID-19 Crisis → |
Quer aprender mais sobre Software de Saúde? Explore os produtos de Cuidados de Saúde.

Jasmine Lee
Jasmine is a former Senior Market Research Analyst at G2. Prior to G2, she worked in the nonprofit sector and contributed to a handful of online entertainment and pop culture publications.