Reference medical datasets (MosMedData) for independent external evaluation of algorithms based on artificial intelligence in diagnostics

Nikolay A. Pavlov; Павлов Николай Александрович; Anna E. Andreychenko; Андрейченко Анна Евгеньевна; Anton V. Vladzymyrskyy; Владзимирский Антон Вячеславович; Anush A. Revazyan; Ревазян Ануш Артуровна; Yury S. Kirpichev; Кирпичев Юрий Сергеевич; Sergey P. Morozov; Морозов Сергей Павлович

doi:10.17816/DD60635

Reference medical datasets (MosMedData) for independent external evaluation of algorithms based on artificial intelligence in diagnostics

Authors: Pavlov N.A.¹, Andreychenko A.E.¹, Vladzymyrskyy A.V.¹, Revazyan A.A.¹, Kirpichev Y.S.¹, Morozov S.P.¹
Affiliations:
1. Moscow Center for Diagnostics and Telemedicine
Issue: Vol 2, No 1 (2021)
Pages: 49-66
Section: Technical Reports
Submitted: 12.02.2021
Accepted: 23.03.2021
Published: 30.04.2021
URL: https://jdigitaldiagnostics.com/DD/article/view/60635
DOI: https://doi.org/10.17816/DD60635
ID: 60635

Cite item

Full Text

Abstract
Full Text
About the authors
References
Supplementary files
Statistics

Abstract

The article describes a novel approach to creating annotated medical datasets for testing artificial intelligence-based diagnostic solutions. Moreover, there are four stages of dataset formation described: planning, selection of initial data, marking and verification, and documentation. There are also examples of datasets created using the described methods. The technique is scalable and versatile, and it can be applied to other areas of medicine and healthcare that are being automated and developed using artificial intelligence and big data technologies.

Keywords

artificial intelligence, medical data, dataset, marking, computer-assisted learning, big data, verification

Full Text

List of abbreviations

GB, MB, TB ― digital storage capacity: gigabyte, megabyte, terabyte

Dataset; Data set ― a structured set of information united according to certain logical principles, suitable for machine processing by computed methods of data analysis. A dataset is a complex concept characterized by four main stages: the presence of content (observations, values, records, files, etc.); the presence of a goal (for example, a knowledge base, use for a specific task); the presence of groupings (aggregation and organization of content into sets, collections, etc.); and the presence of cohesion (relation to the subject, integration, logical collection of content, etc.)

UMIAS ― Unified Medical Information Analysis System of Moscow

URIS ― Unified Radiological Information Service of Moscow

AI (artificial intelligence) ― the science and technology of creating intelligent computer programs capable of performing tasks for which, as a rule, human intelligence is required

CT ― computed tomography

CT 0–4 ― classification of COVID-19 CT signs developed by the Scientific and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Department of Health in 2020. CT0 is the norm and the absence of CT signs of viral pneumonia. CТ1 — areas of induration with the appearance of frosted glass; involvement of the lung parenchyma is ≤25%. CT2 — areas of induration by the type of frosted glass; involvement of the lung parenchyma is 25%–50%. CT3 — areas of induration by the type of frosted glass and consolidation; involvement of the lung parenchyma is 50%–75%. CT4 — diffuse induration of the lung tissue with the appearance of ground glass and consolidation in combination with reticular changes; involvement of the lung parenchyma is >75%

MIS ― Medical Information System

MMG ― mammography

LDCT ― low-dose computed tomography

Chest ― thoracic organs

X-ray ― X-ray study

FLG ― fluorography

COVID-19 ― an infectious disease caused by the SARS-CoV-2 virus the spread of which in 2020 was characterized by the World Health Organization as a pandemic. According to the International Classification of Diseases of the 10th revision, it is coded as U07.1 or U07.2, depending on the presence/absence of laboratory identification of the virus, respectively

DICOM (Digital Imaging and Communications in Medicine) ― medical industry standard for the creation, storage, transmission, and visualization of digital medical images and documents of examined patients

MeSH (Medical Subject Headings) ― a thesaurus containing key medical terms used to index, catalog, and search articles in an English textual database of medical and biological publications created by the US National Center for Biotechnology Information (PubMed)

README ― in English, «Read me» is a well-established name for a document accompanying an executable code, database, or other software product, usually containing basic information about files in the same directory.

SARS-CoV-2 ― enveloped single-stranded (+)RNA virus of Betacoronavirus

BACKGROUND

Progress in artificial intelligence (AI) technologies and their practical uses in various fields, medicine in particular, demonstrates the potential utility of such technologies in applications such as automated diagnostic systems; systems for recognizing unstructured medical records and understanding natural language, analyzing and predicting events, and automatic classification and verification of information; and automatic chat bots to support patients [1]. In connection with the rapid development of deep machine learning and the associated computer recognition of images and patterns within them, considerable attention among all areas of application of automated diagnostic systems is currently being paid to the analysis of medical images, in particular, radiation studies [2].

In practical healthcare, the task of automating diagnostic processes is one of the top priorities for the aging population, increase the availability and, accordingly, the number of diagnostic procedures which are not compensated for by an increase in the number of qualified personnel necessary to ensure proper interpretation of results and, as a result, provide timely medical care. This problem is particularly acute in radiation diagnostics [3], which is based on the visual analysis of images by a doctor. For most modern methods in radiation diagnostics, the number of two-dimensional images per patient requiring interpretation can reach 1000 or more. In this regard, radiation diagnostics is currently an area of active development of deep learning technologies, which is part of the AI concept, for creating computer vision systems that automate the interpretation of medical images. A distinctive feature of deep learning from other machine learning methods is that the accuracy, reliability, and practical value of the created models depends directly on the quantity and quality of the data used in the learning, validation (fine-tuning), and testing processes [4].

That is why one of the main barriers to the development of AI-based solutions in medical diagnostics is the absence of verified (free from incomplete and erroneous) and high-quality (unified, prepared for automatic machine processing) data sets [5]. Annotated datasets [6] are necessary not only for “training” AI, particularly for machine learning of computer neural networks, but also for testing networks trained on other data.

The requirements for datasets do not allow the use of simple unloading from a medical information system but require that a number of manipulations be carried out with data before they become an annotated dataset suitable for effective use by AI models. The difference between medical data and data in other areas in which machine learning is actively used (for example, banking and other services) lies in the historically established culture of medical records, the absence of structure or minimal structuring, and the limited comparison of different studies of the same patient with each other. At the moment, the literature on the preparation of medical datasets is represented by few publications [7–9]. With this publication, the authors aim to expand the understanding of the problem and the features of preparing datasets based on medical data among medical specialists related to or involved in the development or testing of AI as well as programmers and data scientists to improve the process of independent evaluation of algorithms for AI-based applications.

This article presents a unified approach (methodology) to the development of datasets for objective (as far as possible in each specific case) testing of solutions using AI technologies in the field of radiation diagnostics. In the course of describing the stages of our proposed methodology, we give practical examples of datasets developed by us in the period from September 2019 to December 2020 using data from the departments of radiological diagnostics of medical outpatient and inpatient institutions in Moscow deposited in the Unified Radiological Information Service (URIS UMIAS) [10]. The basic principles described in the article can be used to form medical datasets in other areas of medicine.

METHODS AND RESULTS

A dataset differs from a simple collection of medical data in that it is endowed with special properties: data unification and structuring; a lack of gross inaccuracies or erroneous research; the presence of additional information (categories and values of attributes or characteristics of data items); and the presence of accompanying documentation. In the Russian Federation, a dataset is equated to a database and is subject to voluntary state registration as a result of intellectual activity. In foreign practice, datasets are often published not only as datasets available for download but also as scientific publications in journals. Each dataset is unique not only in the composition of the studies but also in the way they are classified and the approaches to markup, and the process of creating a dataset is exploratory in nature. Even in the presence of a structured method of dataset formation, at certain stages, departures, exceptions, and changes to the original dataset are possible, depending on its purpose.

The whole process can be divided into four major stages: planning, selection of initial data, markup and verification, and documentation (Fig. 1).

Fig. 1. Stages of forming a medical dataset.

1. Planning stage

The preparation of a dataset, as in scientific research, begins with the planning stage, which consists of the following steps:

formulation of a clinical and/or practical problem in the field of medicine, which is (potentially) subject to automation by intelligent systems;
compilation of a list of features and/or characteristics of the initial data, information about which will be received from the intelligent system in the process of solving the problem and by which it is possible to assess the correctness of the solution adopted by the system;
determination of the verification methodology for the values of the selected features and/or characteristics of the elements of the generated data set;
definition of data sources;
description of the steps planned for data anonymization;
determination of criteria for inclusion and exclusion of a study from the dataset;
determination of significant data characteristics necessary to assess not only the accuracy but also the limits of the reliability and scalability of an intelligent system.

Setting a clinical task is one of the most important tasks facing the creator of a dataset. Insufficient attention to it leads to sudden pop-up questions both in the process of preparing a dataset and when introducing a diagnostic algorithm based on AI into clinical practice. (Fig. 2).

Fig. 2. Relationships among the clinical task, dataset, and success in the implementation of a solution based on artificial intelligence (AI) in routine clinical practice.

In order for the task to correspond to the class of tasks in which AI has established itself as a promising technology and, at the same time, has an important socioeconomic component from the point of view of clinical specialists, a working group of professionals of various profiles, namely clinicians, specialists on medical data processing, research engineers (machine learning or validating AI solutions), and administrators who access and upload raw data, should participate in task definition.

The clinical task should allow the creators of the dataset to answer the following questions:

1) What modalities, procedures, clinical, demographic, and similar information should be taken as input to the algorithm to solve it, and what should be taken as one data unit?

2) What features should be determined using AI technologies?

3) What nosology or groups of nosology are the desired signs?

4) How does the solution to the problem help the clinical specialist?

5) How many data units are necessary and sufficient for the purpose of using the created dataset (AI validation, machine learning, etc.)?

An important criterion for the selection of the number of data units and characteristics of the study is the purpose of applying the dataset in relation to AI. The following classification of datasets can be given by their purpose:

1) general sets:

a self-test to check the AI for technical compliance;
a clinical test to assess the metrics of the accuracy and productivity of AI;
«additional training» for tweaking the already trained AI model;
machine learning for learning new models underlying AI and solving new clinical problems;

2) specialized kits:

dynamic sets for assessing changes over time (linking several data items to one subject);
technological defects to assess the stability and reliability of diagnostic solutions based on AI when attempting to analyze a defective study.

The number of research units required for a self-test is usually calculated individually for each type or model of the diagnostic device; the number of research units in dynamic sets and datasets for a clinical test is usually between 10 and 100; datasets for training and «additional training» can contain from several hundred to several tens of thousands of studies. The indicated quantities are rough estimates and can vary widely depending on the availability of studies in the data source, the complexity of the clinical task, the detail and laboriousness of annotating the data, and other factors.

After the clinical task is defined, the criteria by which the intelligent system decides whether to assign a particular study or the area found in the image to a group of interest logically follows from it (basic diagnostic requirements for the work of AI). Diagnostic requirements include a formal description of the desired features of the study and also make up a list of features and/or characteristics, on the basis of which the data will be marked up in the dataset in the future. This information allows developers to more accurately customize solutions to determine the required features and for dataset preparation specialists to draw up instructions for marking and verifying data.

The balance of classes, namely in what proportion the studies in the dataset are distributed related to various features and/or characteristics, is of key importance for the value and significance of the obtained analysis of systems based on AI technologies using a dataset. In the simplest case, to assess the performance of intelligent diagnostic systems that provide dichotomous responses, an equal division is used between the two categories (for example, 50% of studies with signs of pathology according to the basic diagnostic requirements for the work of AI and 50% of studies without signs of pathology). In more complex cases, the division between several classes may be uneven and depend on the comparison method that will be used at a subsequent time.

Studies divided into classes according to a significant trait may have other differences, both in clinical (for example, the prevalence of female patients in the category with signs of pathology, due to the age and sex pattern of morbidity) and in technical (for example, artificial sampling bias due to the preference for directing patients with an already-identified pathology to a study performed on a device with a higher resolution) aspects. In order to avoid systematic errors, it is necessary to identify signs, even though to do so will not contribute significantly to the solution to a clinical problem but will affect the operation of the diagnostic intelligent system, and when selecting studies in a dataset, we should strive to present different examples in each of the classes. The question of systematizing such signs and characteristics for a wide range of clinical tasks (that is, the issue of class balance in datasets) remains open and is being actively investigated at the present time [9].

At the end of the planning stage, the sources of the initial data are determined, as well as the criteria for inclusion, non-inclusion, and exclusion of studies from the dataset.

To create the most representative dataset, data sources should, if possible, be either the same or relevant to those information systems in which the implementation of AI-based solutions is planned in the future. For Moscow healthcare, an example of such a source is URIS UMIAS, which unites storage systems for the departments of radiological diagnostics of dozens of outpatient and inpatient medical institutions in Moscow.

Inclusion and exclusion criteria are often determined by a clinical and/or practical task, while exclusion criteria are usually supplemented in the course of working with primary data, since certain criteria are found that negatively affect the structure and unification of the data set. These criteria can be both medical (for example, age from 18 to 99 years; presence of intact structure of the target organ, etc.), and technical (CT filter — soft tissue, convolution core — FC51, etc.). Data unification is necessary for the reliable operation of tools for evaluating the work of an AI-based solution (see Section 3 “Markup and verification”).

Take as an example the dataset “MosMedData: results of ultra-low-dose computed tomography with lesions in the lungs.”¹ The purpose of creating a database is to ensure the possibility of verifying the readiness of automated systems (including those using AI) to work in ERIS UMIAS. The clinical task is the search for and identification of pulmonary foci during lung cancer screening. For the dataset, anonymized computed tomography (CT) studies in DICOM format were selected, carried out in a special ultra-low-dose tomography mode (effective dose of radiation exposure less than 1 mSv at an increased voltage of 135 kV). One unit is one chest CT (CT) scan that meets the criteria below.

Inclusion criteria:
The patient’s age is over 55 years and under 75 years.
Experience of smoking more than 30 packs per year (at least 1 pack per day for 30 years or 2 packs per day for 15 years, etc.).
Current smoking or smoking cessation no more than 15 years ago.
The study was carried out in the mode of ultra-low-dose CT in the first round of screening for lung cancer.
Criteria for non-inclusion:
Lung cancer detected within 2 years after the first round of lung cancer screening using ultra-low-dose CT.
History of lung cancer and/or lung surgery (not including percutaneous lung biopsy).
History of cancer diagnosed less than 5 years ago, with the exception of skin cancer and cervical cancer in situ.
The presence of pronounced pathology of the cardiovascular, immune, respiratory, or endocrine systems, as well as a life expectancy of less than 5 years.
Acute disease of the respiratory system.
Antibiotic treatment in the past 12 weeks.
Presence of hemoptysis or weight loss of more than 10 kg in the last year.
Criteria for exclusion:
Absence of pulmonary foci in the first round of Moscow lung cancer screening.

The target value of the number of studies in the final dataset (300) is sufficient for testing AI-based automated diagnostic systems (the total number is 312 units).

2. Stage of selection of initial data

After access is gained to the source of the initial data, the stage of selecting the initial (“raw”) data begins. The approach to obtaining (unloading) the data depends on the source and method of data storage.

Medical data can be accumulated during the routine diagnostic process in a medical institution (MeSH: Routinely Collected Health Data), by direct data collection from the patient and/or his relatives and social workers (MeSH: Patient Generated Health Data), or as a result of targeted data collection, for example, during a clinical trial. Data collected on a routine basis usually has wide variability in parameters and allows the user to create the most representative dataset. When analyzing the data collected in the course of a clinical trial, attention is drawn to (1) the criteria for inclusion, non-inclusion, and exclusion of subjects from the study, set by its design and limiting the possibilities for preparing the dataset, as well as (2) the amount of data, which is limited by the power of the study.

Digitizing documents that are not primarily electronic makes little sense; documents stored on external media are often poorly structured, and digitizing and/or transferring data from other media can be costly (for example, transferring a radiation imaging database stored on CD-ROMs). The presence of a medical information system (MIS; MeSH: Health Information Systems) simplifies unloading, since it allows the user to apply filters and select the necessary studies by such criteria as, for example, the presence of a particular study or diagnosis. However, the necessary information is not contained in electronic medical records for all clinical tasks: lists of patients suitable for the criteria of a clinical task can be generated separately from the MIS, and the selection of studies for patients from these lists takes a significant amount of time.

The general principles for selecting “raw” data are the following:

1) Choose the largest possible range of studies of the modality and procedure of interest.

2) Preserve the amount of accompanying information necessary for solving the clinical problem (including text documents describing the results of the study, the clinical diagnosis of the patient who ended the medical case, etc.).

3) If possible, depersonalize the research “on the spot,” without leaving the information circuit of the institution in which the data is selected.

At the selection stage, the criteria for including and exclusion of the study in the future dataset are also applied. This operation can be carried out both directly, during the selection of studies in the MIS, and immediately after unloading (already outside the information circuit of a medical organization). It should be borne in mind that this step can lead to a a 10-fold or greater decrease in the size of the dataset.

During study selection, the class balance identified in Step 1 should be borne in mind.

For example, for the dataset “MosMedData: Results of ultralow-dose computed tomography with lesions in the lungs” mentioned in the description of stage 1, stage 2 can consist of the following steps:

1) selection of patients in the MIS who underwent a study of low-dose chest CT in order to screen for malignant lung tumors;

2) analysis of electronic medical records of selected patients (life history, history of previous diseases, data from previous studies) to select patients in accordance with the inclusion and exclusion criteria formed at stage 1;

3) decision-making to include studies in the dataset in accordance with the desired balance of classes.

3. Markup and verification stage

Markup is the process of determining the value of attributes or characteristics for a data item in a dataset. Based on the markup, it becomes possible to classify elements and assign them to a particular group. For markup, both the information already available at the time of selection of the initial data (retrospective markup) and markup made by a specialist with medical education and/or work experience after the selection stage (prospective markup) can be used [9].

For retrospective markup, data from accompanying documents (such as, for example, the texts of conclusions for the results of instrumental studies), MIS, electronic medical records, etc. can be used. An example is the metadata generated automatically by the device during the study and stored in the initial data. The obvious advantage of retrospective markup is that it takes significantly less time on the part of healthcare professionals, since most of the preparatory work is performed by the data scientist.

Prospective markup involves the active involvement of medical professionals in the process of “saturation” of the dataset with additional information, for example, allowing the user to effectively divide the elements of the dataset into classes and categories. In radiation diagnostics, markup is most often understood as the classification of studies by classes (the presence or absence of radiological signs of the selected disease) as well as the graphic designation of the area of interest corresponding to the desired signs (for example, foci of demyelination in multiple sclerosis on MR images of the brain). The degree of involvement can be divided into more or less costly: in the first case, experts are asked to outline the contour of the area of interest and in the second, to designate its coordinates with a simple geometric figure.

In cases where expert opinion is the most significant factor in determining the values of features or characteristics of the data, it is reasonable to conduct a simultaneous reading of the study by two independent experts. In case of inconsistency between two experts, the disputed research is sent to a third, more qualified expert (based on practical experience, degree or other criteria). Studies that remain controversial after reading by three experts may be considered controversial and excluded from the dataset. From our practice of preparing a dataset consisting of 100 chest CT with signs of various pathologies of the respiratory system, up to one quarter of the studies may be controversial after two independent readings; up to 4% of studies may remain controversial after being read by a third, more qualified expert (who has more than 5 years of medical experience).

Before proceeding with prospective markup, it is necessary to determine the scope of research of each specialist; the criteria for markup signs; and software that allows textual, graphic, or other designation of the desired features and prepare a markup of physician’s instructions. In the process of preparing such instructions, if possible, the same working group that defined the clinical task at the planning stage should be involved.

Markup verification provides a degree of “trust” in markup on the part of developers or evaluators of intelligent systems. Markup verification can be divided into:

low (the fact of the presence of a find) – based on the documentation;
average (classification of finds) – based on expert opinion;
high (confirmed diagnosis) – based on the results of a more sensitive research method or dynamic observation (repeated performance of the same method after a certain time interval).

The classification of markup types is shown in Fig. 4. Part of the dataset can have one class, while the other has a different class; a combination of retro- and prospective markup is allowed in the same dataset. An important part of the markup process is its correct documentation in the accompanying documentation (see clause 4 “Documentation stage”).

Fig. 4. Classification of markup by labor costs and degree of verification

For both retro- and prospective markup, various data automation tools can be used (for example, viewing medical imaging results and creating binary masks, analyzing databases) using various technologies and programming languages (C / C ++, Python, Kotlin, Java, etc.) [11].

4. Documentation stage

After the dataset has passed all the previous stages and is ready for transfer to third parties, it is considered “ready for publication.” The publication of the dataset is accompanied by the release of the first major version (1.0.0), as well as the preparation and publication of accompanying documentation (README file).

In the process of preparing a dataset, certain criteria are inevitably overlooked that pop up when end users work directly with the dataset (specialists in validation of AI-based solutions or researchers using machine learning). Making adjustments to the dataset should be transparent to all process participants and users. Dataset versioning keeps track of such changes.

We have proposed the following original approach to solving the described problem as a variation of semantic versioning [12]:

Major version (Major): increases when significant parameters of the dataset change, related to the clinical task, purpose, and principles of data marking and verification.
Minor version (Minor): increases when replacing, adding, or deleting data units in the dataset without changing other significant parameters of the dataset; in this case, the learning or validation algorithms can use the new minor version without changing the code. When a new major version is released, the minor version is set to 0.
Patch version (Patch): increases when making adjustments to the accompanying documentation, correcting typos and other errors in markup files, while the quantity and quality of data units in the dataset does not change. When a new major and/or minor version is released, the patch version is set to 0.

For ease of use of the dataset, a file named README.md in Markdown format and the generated README.pdf in Adobe PDF format are placed in the root directory. A unified approach to the structure of the README file will allow future organization of convenient searching and filtering of all published datasets. The basic structure of the README file is shown in Fig. 5; however, other sections can be added to the file if necessary.

Fig. 5. Basic structure of the README file.

For convenience of reporting, a single register of prepared data sets is of practical value, an example of which is given in Table 1 [13].

Table 1. List of medical DATAsets developed using the described experimental approach

No.	Internal code	Purpose	Modality/Procedure	Desired signs and/ or target nosology	Data unit	Number of data units	Markup classes (number of data units)
1	DS_FT-I_CT_OGK_CANCER	FT-I	CHEST CT	Lung cancer	CT	5	Without signs of pathology (2), with signs of pathology (2), with technical defect (1)
2	DS_FT-I_CT_OGK_COVID	FT-I	CHEST CT	Viral pneumonia (COVID-19)	CT	5
3	DS_FT-II_CT_OGK_COVID	FT-II	CHEST CT	Viral pneumonia (COVID-19)	CT	4	Without signs of pathology (2), with signs of pathology (2)
4	DS_FT-I_LDCT_OGK_CANCER	FT-I	CHEST LDCT	Lung cancer	CT	5	Without signs of pathology (2), with signs of pathology (2), with technical defect (1)
5	DS_FT-I_MMG_CANCER	FT-I	MMG	Breast cancer	MMG	5
6	DS_FT-II_MMG_CANCER	FT-II	MMG	Breast cancer	MMG	4	Without signs of pathology (2), with signs of pathology (2)
7	DS_FT-I_DX_OGK_PAT	FT-I	CHEST X-RAY	Respiratory pathology	X-ray study	5	Without signs of pathology (2), with signs of pathology (2), with technical defect (1)
8	DS_FT-II_DX_OGK_PAT	FT-II	CHEST X-RAY	Respiratory pathology	X-ray study	5
9	DS_FT-I_DX_OGK_COVID	FT-I	CHEST X-RAY	Viral pneumonia (COVID-19)	X-ray study	4	Without signs of pathology (2), with signs of pathology (2)
10	DS_FT-II_DX_OGK_COVID	FT-II	CHEST X-RAY	Viral pneumonia (COVID-19)	X-ray study	4
11	DS_FT-I_FLG_OGK_PAT	FT-I	CHEST FLG	Respiratory pathology	FLG	4
12	DS_FT-II_FLG2_OGK_PAT	FT-II	CHEST FLG	Respiratory pathology	FLG	4
13	DS_CT-I_CT_OGK_CANCER	ClT-I	CHEST CT	Lung cancer	CT	100	Without signs of pathology (50), with signs of pathology (50)
14	DS_CT-I_CT_OGK_COVID	ClT-I	CHEST CT	Viral pneumonia (COVID-19)	CT	100
	DS_CT-II_CT_OGK_COVID	ClT-II	CHEST CT	Viral pneumonia (COVID-19)	CT	100	Without signs of pathology (50), with signs of pathology (50)
	DS_CT-II_CT_OGK_COVID_2	ClT-I	CHEST CT	Viral pneumonia (COVID-19)	CT	125	CT0 (25), CT1 (25), CT2 (25), CT3 (25), CT4 (25)
	DS_CT-II_CT_OGK_COVID_3	ClT-I	CHEST CT	Viral pneumonia (COVID-19)	CT	200	CT0 (100), CT1 (25), CT2 (25), CT3 (25), CT4 (25)
	DS_CT-I_LDCT_OGK_CANCER	ClT-I	CHEST LDCT	Lung cancer	CT	100	Without signs of pathology (50), with signs of pathology (50)
	DS_CT-I_MMG_CANCER	ClT-I	MMG	Breast cancer	MMG	100
	DS_CT-II_MMG_CANCER	ClT-II	MMG	Breast cancer	MMG	100
	DS_CT-I_DX_OGK_CANCER	ClT-I	CHEST X-RAY	Respiratory pathology	X-ray study	100
	DS_CT-II_DX_OGK_CANCER	ClT-II	CHEST X-RAY	Respiratory pathology	X-ray study	100
	DS_CT-I_DX_OGK_COVID	ClT-I	CHEST X-RAY	Viral pneumonia (COVID-19)	X-ray study	100
	DS_CT-I_FLG_OGK_CANCER	ClT-I	CHEST FLG	Respiratory pathology	FLG	100
	DS_CT-II_FLG_OGK_CANCER	ClT-II	CHEST FLG	Respiratory pathology	FLG	100

Note. FT-I ― primary functional testing; FT-II ― repeated functional testing; ClT-I ― primary calibration testing; ClT-II ― repeated calibration testing; CT ― computed tomography; MMG ― mammography; X-ray ― X-ray study; FLG ― fluorography; Chest ― thoracic organs; CТ1 ― areas of induration by the type of frosted glass; involvement of the lung parenchyma is ≤25%. CT2 ― areas of induration by the type of frosted glass; involvement of the lung parenchyma is 25-50%. CT3 ― areas of induration by the type of frosted glass and consolidation; involvement of the lung parenchyma is 50–75%. CT4 ― diffuse induration of the lung tissue like ground glass and consolidation in combination with reticular changes; involvement of the lung parenchyma is >75% [13].

The minimum set of recommended registry fields is the following:

The sequential number of the registry entry.
An internal code unique to the dataset in the current registry and/or institution.
The purpose and scope of the dataset.
Modality/procedure (characteristics of studies, suitable for their search and selection in the IIA).
Searched signs and/or target pathology (if possible, indicating the code of the International Classification of Diseases).
The definition of a data unit.
The number of data units (if possible, indicating the output volume of data in MB, GB, or TB).
Markup classes indicating the number of records in each class.

DISCUSSION

This paper presents an experimental approach to the formation of sets of medical data (datasets) for use in the development and evaluation of intelligent medical diagnostic systems using AI technologies.

The use of a large-scale MIS (URIS UMIAS) as a data source for a dataset is a certain guarantee of its representativeness. The performance parameters of the AI algorithm after implementation in the diagnostic process will most likely correspond to the parameters obtained during validation on such a dataset. At the same time, it is necessary to account for the variability of the fleet of diagnostic devices, as well as variations in the physical parameters of the studies being carried out, while attempting to present the widest range of studies in the dataset. The value of the variability of devices from different manufacturers presented in datasets can be of practical importance for fine-tuning the threshold of AI systems in order to ensure their reliable operation [14].

Another advantage of working with URIS UMIAS is practically unlimited access to hundreds of thousands of beam studies of various modalities, which permits the creation of datasets with extremely diverse sets of technical, demographic, and clinical characteristics. Such variations ensure the value of the generated datasets for assessing not only the accuracy but also the scalability and reliability of the AI systems being developed and tested.

The proposed approach was developed and tested during the creation of 25 datasets in seven directions in radiation diagnostics with a total of more than 1400 data units (studies), including during the implementation of the Moscow experiment on the use of innovative technologies in the field of computer vision for the analysis of medical images and further application in the healthcare system in Moscow [15] (see Fig. 3). A complete list of datasets is given in the table 1. The provisions described in this article are consistent with the criteria for reference datasets included in the guidelines for clinical trials of software based on intelligent technologies in radiation diagnostics [16].

Fig. 3. Datasets of the Moscow experiment on the use of innovative technologies in the field of computer vision for the analysis of medical images and further use in the healthcare system of Moscow, prepared according to this method.

Over the course of the Moscow experiment, an independent external assessment of AI algorithms is provided in two stages (functional and calibration testing, respectively): at the first stage, relatively small datasets (up to five data units) are used to check the technical feasibility of reading and processing studies; at the second stage, medium-sized datasets are used (on average, from 100 to 200 data units) to compare the results of processing AI studies with a verified markup. In cases where, as a result of initial testing, the developer of an AI-based solution receives recommendations for finalizing their solution, it is possible to retest said solution on a different dataset.

An important part of the life cycle of a dataset in the post-publication phase is the scientific presentation of the work in relevant publications and manuscripts. One of the portals that support free placement of information on public datasets is medRxiv² ― service of preprints on biomedical topics. The advantage of the service is the absence of external peer review of publications, which allows the community to be informed about the results of their work as soon as possible. An example of a publication about a dataset on the medRxiv portal is presented in [17].

It should be noted that datasets generated by this method are successfully used by domestic and foreign research teams, as evidenced by recent publications [18, 19]. The use of the work product in practice confirms the timeliness and adequacy of the formulated approaches and methodology.

When the necessary changes are made, the technique can be fully or partially used not only for other areas in radiation diagnostics but also outside of it, in other areas of practical medicine, in which primary electronic information is accumulated in the course of medical activity (electroencephalograms, electrocardiograms, and other records of physiological signals, records from bedside resuscitation monitors, log records of modern laboratory equipment, such as chemical analyzers, etc.). In particular, the principles of formulating a clinical and/or practical problem, working with MIS for unloading initial data, general principles of marking and documenting in an experimental mode were successfully tested in the formation of a data set of electrocardiograms with signs of cardiovascular diseases. In the future, this technique can be included in the state standard, thereby ensuring the continuity and unification of medical datasets for teaching and testing AI technologies at the national level.

A hotly debated issue is the problem of depersonalization of medical data, especially the results of radiation studies. There is currently no generally accepted standard for anonymizing medical images. Professionals working with this kind of data must follow a sound logic to prevent the disclosure of the patient’s confidential medical information and personal data. It should be remembered that the results of a radiation study in and of themselves can serve as a source of personal data: for example, it is possible to reconstruct a three-dimensional image of the soft tissues of the facial skull from head cuts, which in turn makes it possible to sufficiently identify a person. Despite the absence of explicit legislative norms or standards for depersonalization in such situations, the author of the dataset should make the decision to remove the record of soft tissues of the head from the studies, starting from the clinical and/or practical task and continuing with the purpose of the dataset.

To maintain the growth rates of the market for AI technologies in medicine, one should, if possible, consider providing free access to datasets, subject to all the anonymization conditions described above. Portals such as arXiv (https://arxiv.org), medRxiv (https://medrxiv.org), and Zenodo (https://zenodo.org) are used to publish articles describing datasets. There are a large number of public repositories of open datasets, as well as integral search on them, for example, Google’s Dataset Search³. One of the ways to not only ensure legal access to datasets but also to attract attention from the AI developer community is to conduct online competitions among AI developers on platforms such as Kaggle⁴.

A promising direction of development is the use of “digital twins of the disease,” extensive sets of information about patients of various profiles (social, demographic, behavioral, etc.) for the formation of statistical signs characteristic of patients suffering from a specific disease. The use of such information can make it possible to create more representative medical datasets, including the widest range of signs and factors of the disease that are significant for the clinical and/or practical task. The basis for the creation of a “digital twin of a disease” is, first of all, the analysis and processing of impersonal information obtained from “digital twins of patients” containing the widest possible set of diverse information about a patient.

The approach presented in this article makes it possible to systematize and standardize the preparation of datasets and their life cycle for subsequent use in the testing of intelligent systems (including those based on AI) and registering tested systems for their further use in the healthcare sector. Such a step-by-step and detailed methodology for dataset formation will allow developers to objectively evaluate their products and regulators to ensure the objectivity and transparency of the assessment process using datasets created on the basis of the proposed methodology.

CONCLUSION

The task of forming sets of medical data for training and validating diagnostic systems based on AI technologies is gaining critical importance in connection with active development of this field. The original approaches described in the article can serve as a starting point for the creation of a full-fledged methodology for the preparation and standardization of medical datasets of various modalities and types of data. Moreover, they can be used to determine the conditions and factors necessary for the successful practical application of this methodology.

ADDITIONAL INFORMATION

Funding source. This study was not supported by any external sources of funding.

Competing interests. The authors declare that they have no competing interests.

Author contribution. N.A. Pavlov ― manuscript design and writing, development of an approach for dataset formation, formation of dataset examples; S.P. Morozov ― concept of research; A.E. Andreychenko ― study design, manuscript curation and editing; A.V. Vladzymyrskyy ― scientific rationale for dataset formation; A.A. Revazyan ― literature review, formation of dataset examples; Yu.S. Kirpichev ― formation of dataset examples. All authors made a substantial contribution to the conception of the work, acquisition, analysis, interpretation of data for the work, drafting and revising the work, final approval of the version to be published and agree to be accountable for all aspects of the work.

Footnotes

¹Morozov SP, Gonchar AP, Nikolaev AE, et al. MosMedData: results of ultralow-dose computed tomography studies with foci in the lungs (database). Certificate of state registration of the database No. 2020622727 from 21.12.2020

²Access: https://medrxiv.org. Date of circulation: 15.01.2021.

³Access: https://datasetsearch.research.google.com. Date of access: 15.01.2021.

⁴ Access: https://kaggle.com. Date of access: 15.01.2021