Basic statistics of HIRA CDM
We extracted, transformed, and loaded (ETL) the HIRA database into the OMOP-CDM version 5.3.1. All tables specified by the OMOP-CDM conversion specifications were created. The number of converted claims specification and number of patients included were 10,098,730,241 and 56,579,726, respectively (Table 1). Among the converted data, the number of males and females was 28,439,311 (50.3%) and 28,140,325 (49.7%), respectively. All records of the source database were converted into CDM format without errors in classification by year, type of visit, and type of claiming medical institution (Table S1 in the Supplements). Among the CDM tables, the death table contained information of 3,804,948 people who had died over 11 years, accounting for 6.7% of the total population (Table 1). The condition, drug, and procedure tables, which are the main clinical information of the OMOP-CDM, included more than 99.0% of patients, and devices and measurements included more than 90.0% of patients (Table 1).
The results of vocabulary mapping from the Electronic Data Interchange (EDI) codes of Korea to the OMOP standardized vocabulary are shown in Table 2. Table 2 lists the number of EDI codes according to the OMOP domain, ratio of codes mapped to standard terminologies, and number of mapped records per source record. Regarding the ratio of mapped codes to source codes, condition (99.1%), drug (100.0%), observation (99.97%), and procedure (84.5%) were high, however, device (10.8%) and measurement (31.0%) were relatively low. However, the ratio of mapped records (mapped records per source records) was over 85.0% in all domains including device (87.6%) and measurement (91.5%).
Data quality and reliability
We compared the amount of original (source) and converted data for the condition/drug/procedure/device codes. The number of records from the source and converted data and their differences from the top 10 codes in each domain are presented in the Tables S2–S6 in the Supplements. The differences were due to (1) the multiple mapping of the source code, (2) the assignment to a different domain table from the source table, and (3) the absence of mapping to OMOP standardized vocabulary.
The number of patients with T2DM was extracted according to the same definition from the source and converted data, and the numbers of patients were 3,031,462 (21.3%) and 3,030,183 (21.3%), respectively (Fig. 1). The incidence of T2DM per 100,000 patients ranged from 550.1 to 650.9 and 549.9 to 649.7 in the source and converted HIRA CDM database, respectively (Table 3). In 2012, the difference in the number of patients with T2DM between the source and converted data was 590, and the difference in the incidence rate was the largest at 1.2 per 100,000 patients. The difference in the number of patients was 14, and the difference in the incidence rate was 0.0 in 2020. In addition, there were no differences in T2DM incidence by year-gender and year-age groups (Tables S7, S8 in the Supplements).
In the HIRA CDM database, by 2020, 32,633 outpatients were diagnosed with COVID-19. We could validate a previously published COVID-19 prediction model (COVER model) which developed based on the OMOP-CDM15. The performance of the COVER models to predict hospitalization for pneumonia, admission to the ICU or death from pneumonia, and all-cause death were 0.816, 0.891, and 0.892, respectively (Table S9 in the Supplements). We also tried to validate the models using newly updated sampled database. The HIRA 20% sample database until April 2022, 1,530,350 outpatients were diagnosed with COVID-19, and the performance of the model was 0.748 (hospitalization for pneumonia), 0.879 (admission to the ICU with pneumonia or death due to pneumonia), and 0.891 (all-cause mortality). Through version control of the database, we confirmed that predictive models developed earlier could be easily applied to databases of different versions with different periods.
Data analytic environment and open policy
We built a Docker-based analytic environment for the use of open-source tools even in an intranet environment (offline for Internet) of the HIRA and to enable the installation of statistical tools and frequently updated packages (Fig. 2)16. For data security, the data officer of the HIRA is responsible for managing access sessions and logs from database and analytic servers.
By implementing the open policy of the HIRA CDM, researchers can apply for research requests through the healthcare distributed research network (HDRN) platform operated by the Korean government (https://hcdl.mohw.go.kr/). The specific application method is as follows: (1) The researcher must request a review of their research hypotheses and plan for ethical feasibility through an institutional or public review board. (2) The research must submit an approval letter from the review board and the research protocol to the HDRN platform. (3) The HIRA reviews the appropriateness of research/data provision and decides whether to provide it. (4) The researcher writes an analysis query, code, or package based on the open sample data and environment and sends it to the HIRA. (5) The HIRA reviews queries and expected results and derives results by running queries/codes/packages. (6) After the results are reviewed and the protected health information checked for infringement, the results are exported to the researcher.
We followed all FAIR principles, and the results of applying each principle to the HIRA CDM are shown in Table S10 in the Supplement. Metadata, disclosure policy, and sample data of HIRA CDM have been made available to the public online (https://opendata.hira.or.kr/op/opb/selectNotice.do?sno=13906&ntfcIteDivCd=&searchCnd=&searchWrd=cdm&pageIndex=1).