The main tasks in facilitating, or even enabling, the reuse of medical RWD in a research context are to promote interoperability, harmonization, data quality, and ensure privacy, to optimize the retrieval and management of patient consent, and to establish rules for data use and access12,13. These measures aim to address the various challenges of scientifically reusing routine clinical data described below.
Challenges in balancing benefits and harms
Personal, i.e. non-anonymized medical data, is inherently sensitive1,17,22. As a result, uncertainties in MDS project preparation and execution arise for all roles involved in performing research on medical RWD, i.e. for patients, researchers and governing entities. The patients may lack trust in research using their personal data. Concerns about data misuse, becoming completely transparent and data leakage – especially in the case of long-term storage – can result in the patients overprotecting their own data and not giving their consent for its reuse in research23,24,25. On the other hand, it has also been shown that most EU citizens support secondary use of medical data if it serves further common good24. So, convincing patients about the social expediency of MDS can decrease their ambivalence and avoid overprotection. This can be achieved, for example, by reporting on MDS success stories13. A second important aspect is patient empowerment by informing patients about the processing and use of their data through open scientific communication and enabling their active engagement in the form of a dynamic consent management12,23.
However, there are also concerns on the part of the researcher resulting e.g. from a lack of explicit training in a complex landscape of ethical and legal requirements. These could be mitigated by discussions in interdisciplinary team meetings but differences in the daily work routine make it difficult to arrange such meetings8,9,18,21. As a consequence of unresolved concerns, researchers could delay or even cancel their MDS projects. Moreover, even governing entities such as data protection officers and ethics committees exhibit a certain level of uncertainty regarding permissible practices in MDS. They tend to overprotect the rights of the patients whose medical data is to be used while underestimating the necessity of reusing medical RWD for research purposes9,23,26,27. This leads to restrictive policies hindering scientific progress.
In general, education is a promising approach to address the uncertainties mentioned above. Technical training for medical researchers and governing entities as well as ethical and legal training for technical experts can increase confidence in project-related decision making1,18,23,24,27,28. The same effect can be achieved by developing MDS guidelines and actionable data protection concepts (DPC)13,14,15,16. A good example is the DPC of the MI-I that was developed in collaboration with the German working group of medical ethics committees (AK-EK)12. Figure 1 summarizes the sources and consequences of the aforementioned uncertainties that lead to significant challenges in the reuse of medical RWD. Each source of uncertainty is associated with the roles it affects and possible measures to mitigate its impact. The challenges posed by these uncertainties are discussed in more detail below.
Uncertainties due to the legal framework
As mentioned above, the complex legal landscape resulting from various intervening laws contributes significantly to the uncertainty surrounding the reuse of medical RWD. At the European level, the General Data Protection Regulation (GDPR) holds substantial influence over the legal framework. In general, it prohibits the processing of health-related personal data (GDPR Art. 9 (1)) unless the informed consent of every affected person is given (GDPR Art. 9 (2a)) or a scientific exemption is present (GDPR Art. 9 (2j)). The latter is the case if the processing is in the public interest, secured by data protection measures, and adequately justified by a sufficient scientific goal. However, substantiating the presence of such a scientific exemption poses significant challenges29,30. Similarly, or even more difficult, is obtaining informed consent of patients after they have left the clinics. As such, both GDPR-based possibilities to justify the secondary use of RWD in research are difficult to implement in practice26,29. If the processing is legally based on the scientific exemption, GDPR Art. 89 further mandates the implementation of appropriate privacy safeguards supported by technical and organizational measures. Additionally, it stipulates that only the data necessary for the project should be utilized (principle of data minimization)30,31. This ensures the protection of sensitive personal data, but also introduces further challenges for the researchers.
The situation becomes further complicated due to the GDPR allowing for various interpretations by the data protection laws of EU member states30,31. Moreover, there are country-specific regulations, such as job-specific laws, that impact the legal framework of MDS31. This complex scenario poses particular challenges for international MDS projects29. As a result, identifying the correct legal basis and implementing appropriate data protection measures becomes exceptionally difficult29,30. This task, crucial in the preparation of clinical data set compilation, necessitates not only technical and medical expertise but also a comprehensive understanding of legal aspects. Thus, a well-functioning interdisciplinary team or researchers with broad training are essential.
Analyses of the current legal framework for data-driven medical research suggest that this framework is remote from practice and thus inhibits scientific progress31,32. To address these limitations, certain legal amendments or substantial infrastructure enhancements are necessary. Particularly, the infrastructure should focus on incorporating components and tools that facilitate semi-automated data integration and data anonymization. Although the current legal framework permits physicians to access, integrate, and anonymize data from their own patients, they often lack the technical expertise and time to effectively carry out these tasks. By implementing an infrastructure that enables semi-automated data integration and anonymization, researchers would be able to legally utilize valuable medical RWD without imposing additional workload on physicians29,30. Attaining a fully automated solution is not feasible since effective data integration and anonymization, leading to meaningful data sets, necessitate manual parameter selection by a domain expert. Nonetheless, by prioritizing maximal automation and specifically assigning domain experts to handle the manual steps in the process, rapid and compliant access to medical RWD, along with reduced uncertainties for researchers, can be achieved.
Ethical considerations and overprotectiveness
Not only the legal framework, but also ethical considerations can cause uncertainties. These can affect the patients and researchers but, in the context of an MDS project, especially the ethics committees as they have to judge whether a project is ethically justifiable. There are a variety of ethical principles to be taken into account for such a decision. These principles encompass patient privacy, data ownership, individual autonomy, confidentiality, necessity of data processing, non-maleficence and beneficence1,33. Considered jointly, they result in a trade-off to be made between the preservation of ethical rights of treated patients and the beneficence of the scientific project15,18,26. Criticism often arises concerning the prevailing trade-off in favor of patients’ privacy, where ethics committees tend to overprotect patient data23,27. What is frequently overlooked is the ethical responsibility to share and reuse medical RWD to advance medical progress in diagnoses and treatment. Thus, a consequence of overprotecting data is suboptimal patient care which is, in turn, unethical1,9,26. Measures to prevent overprotection are increasing the awareness of its risks through education, as well as the development of clear ethical regulations and guidelines28. To facilitate the latter, the data set compilation process for medical RWD should be simplified, e.g. by standardization of processes and data formats because its current complexity challenges the creation of regulations and guidelines17.
Uncertainties in project planning
Many of the mentioned concerns related to legal and ethical requirements occur during project planning and design. Here a variety of decisions are made regarding the composition of the RWD set and its processing. These affect all subsequent project steps, but must be determined at an early stage if the project framework necessitates approvals from governing entities. This is because the governing entities require all planned processing steps to be documented in a study plan, serving as the foundation for their decision-making process. This results in long project planning phases due to uncertainties in a complex multi-player environment13,14,15,16,21. Additionally, creating a strict study plan usually works for clinical trials, but in data science, meaningful results often require more flexibility. For instance, it might be necessary to redesign the project plan throughout data processing. Therefore, project frameworks that show researchers how to reshape their project in specific cases would be much better suited for secondary use of medical RWD25,34. Taking it a step further, a general guideline or regulation on how to conduct MDS projects would decrease planning time and the risk of errors, both of which are higher if each project is designed individually14. To already now minimize the uncertainties in project planning and, thereby, the duration of the planning phase, research teams should communicate intensely and collaboratively plan their tasks9,18. Since this is a challenging task in a highly interdisciplinary environment, early definition of structures, binding deadlines, and clear assignment of responsibilities, such as designating a person responsible for timely data provision in each department, are crucial8,14.
The role of the patient consent
As mentioned in the introduction to this section 3.1, dynamical consent management allowing the patients to effectively give and withdraw their consent at any point in time is a crucial measure to foster patient empowerment. As a result, it also leads to more acceptance of MDS by the affected individuals. Furthermore, in section 3.1.1 the informed patient consent was mentioned as a possible legal justification for processing personal sensitive data. However, the traditional informed consent requires patients to explicitly consent to the specific processing of their data. This means their consent is tied to a specific project35,36. For retrospective projects such a consent cannot be obtained during the patients’ stay at the hospital because the project idea does not exist at that time. Hence, the researcher would have to retrospectively contact all patients whose data is needed for the project, describe the project objective and methodology to them and then ask for their consent. This requires great effort, is, itself, questionable in terms of data protection and even not feasible if the patients are deceased. Making clinical data truly reusable in a research context, therefore, requires a broad consent in which the patients generally agree to the secondary use of their data in ethically approved research contexts. Furthermore, the retrieval of such a broad consent must be integrated into daily clinical routine and the consent management needs to be digitized. Otherwise, the information about the patient consent status might not be easily retrievable for the researcher8,18,21,37.
Previous research has documented that most patients are willing to share their data and even perceive sharing their medical data as a common duty38. Therefore, it is highly likely that extensively introducing a broad consent such as the one developed by the MI-I in Germany into clinical practice, combined with a fully digital and dynamic consent management, would have a significant positive impact on the feasibility of MDS projects39. It would allow patients to actively determine which future research projects may use their data.
When describing the challenges resulting from balancing benefits and harms in MDS projects, some measures were suggested that require technical solutions. One example for this is the implementation of data protection measures like data access control, safe data transfer, encryption, or de-identification20. However, there are not only technical solutions but also challenges, as shown in Fig. 2.
One category of technical challenges results from the specificities of medical data outlined in section 2. Medical RWD is characterized by a higher level of heterogeneity regarding data types and feature availability than data from any other scientific field18,19,26. Thus, compiling usable medical data sets from RWD requires the technical capabilities of skillful data integration, type conversion and data imputation. However, heterogeneity is not restricted to data formats. A common problem is differences in the primary purpose of data acquisition or primary care leading to different data formats and standards being used8. This results in different physicians, clinical departments, or clinical sites not necessarily using the same data scales or units, syntax, data models, ontology, or terminology. Hence, it is difficult to decide which standards to use in an MDS project. A subsequent challenge arising from this lack of interoperability is the conversion between standards that potentially leads to information loss19,26,40. Last but not least, heterogeneity is also reflected in different identifiers being used in different sites. This challenges the linkage of related medical records, which may even become impossible once the data is de-identified41. Promising and important measures to meet the challenges concerning heterogeneity are the development, standardization, harmonization and, eventually, deployment of conceptual frameworks, data models, formats, terminologies, and interfaces8,13,14,16,42. An example illustrating the feasibility and effectiveness of these measures is the widely used DICOM standard for Picture Archiving and Communications systems (PACS)18. Similar effects are expected from the deployment of the HL7 FHIR standard for general healthcare related data that is currently being developed43. However, besides appreciating the benefits of new approaches, the potential of already existing standards like the SNOMED CT terminology should not be neglected. It still has limitations, such as its complexity challenging the identification of respectively fitting codes and its incompleteness partly requiring to add own codes. On the other hand, SNOMED CT is already very comprehensive. Once its practical applicability is improved, SNOMED CT could be introduced as an obligatory standard in medical data systems fostering interoperability13,16,42.
Another significant technical challenge is the fact that a majority of medical RWD is typically available in a semi-structured or unstructured format, while the application of most machine learning algorithms necessitates structured data8,19,42,44. Primary care documentation often relies on free text fields or letters because they can capture all real-world contingencies while structured and standardized data models cannot. Additionally documenting the cases in a structured way, is too time-consuming for clinical practice. So, the primary clinical systems mainly contain semi-structured or unstructured RWD7,13,23. To increase the amount of available structured data, automated data structuring using Natural Language Processing (NLP) is a possible solution. However, it is not easy to implement for various reasons. Among them are the already mentioned inconsistent application of terms and abbreviations in medical texts and the requirement to manually structure some free text data sets to get annotated training data13,42.
Workflows in primary care settings not only lead to predominantly semi-structured or unstructured documentation of medical cases, but also greatly influence the design of clinical data management systems. In primary care and administrative contexts, such as accounting, clinical staff typically need a comprehensive overview of all data pertaining to an individual patient or case. As a result, clinical data management systems have been developed with a case- or patient-centric design that presents data in a transaction-oriented manner. However, this design is at odds with the need for query-driven extract-transform-load (ETL) processes when accessing data for MDS projects. These projects typically require only a subset of the available data features, but for a group of patients8,26. Developing a functional ETL pipeline is further complicated by the overall lack of accessible interfaces to the data management systems and the fragmented distribution of data across various clinical departments’ systems8,13.
This means the design of primary clinical systems could be improved significantly if it allowed for more flexibility, i.e. support patient- and case-centricity for primary care as well as data-centricity for secondary use. Moreover, the system design should comply with data specifications and developed standards rather than requiring the data to be created according to system specifications13. However, a complete redesign of primary clinical systems is most likely not feasible. An alternative solution is creating clinical data repositories in the form of data lakes or data warehouses that extract and transform medical RWD from primary systems and make it usable for research45,46. In this context, the use of standardized platforms and frameworks such as OMOP or i2b2 further increases the interoperability of the collected data47. In Germany, the MI-I established DIC and MeDIC whose goal is the creation of such data repositories for the medical RWD gathered at German university hospitals. As a common standard they agreed on the HL7 FHIR based MI-I core data set (CDS)48. Because this is work in progress and the data repositories are populated with data from primary clinical systems, the DIC and MeDIC still need to address the challenges identified in this comment paper to create FAIR data repositories for research.