Artificial Intelligence and Data Mining in Healthcare: A Review of Current Processes and Tools
With the rapid increase in electronic open-source publications, understanding published unstructured textual data using traditional text mining approaches and tools is becoming increasingly challenging. The application of data mining techniques in the medical sciences is an emerging trend; however, traditional text-mining methods are insufficient to cope with the current surge in the volume of published data. Therefore, artificial intelligence-based text mining tools are being developed and used to process large volumes of data and to explore the hidden features and correlations within the data. This review provides a clear and insightful understanding of how Artificial Intelligence And Data Mining In Healthcare are being utilized to analyze medical data. We also describe a standard process of data mining based on the Cross-Industry Standard Process for Data Mining (CRISP-DM) and the most common tools and libraries available for each step of medical data mining.
1. Introduction
The rapid growth in online medical literature makes it challenging for readers to obtain desired information without significant time investment. For example, during the COVID-19 pandemic, publications related to COVID-19 increased dramatically. In the first two years, databases like PubMed and PMC saw hundreds of thousands of articles, and clinical trials listed on ClinicalTrials.gov also surged. These data, often characterized by high heterogeneity, irregularity, and timeliness, are frequently underutilized. This exponential growth in scientific literature makes it difficult for researchers to (i) retrieve relevant information, (ii) present unstructured literature concisely, and (iii) fully grasp the current state and developmental direction of a research field.
Managing or processing this rapidly increasing literature within an acceptable timeframe is beyond the capabilities of traditional technologies and methods. The sheer volume of this data complicates exploration, analysis, visualization, and the extraction of concise outcomes. The process of extracting hidden, meaningful, and interesting patterns from unstructured text literature is known as text mining. Traditional text mining techniques are inadequate for handling the current large volumes of published literature. Consequently, there is a rapid increase in the development of new data mining techniques based on artificial intelligence, aimed at benefiting patients and physicians. The integration of artificial intelligence (which includes machine learning (ML), deep learning (DL), and natural language processing (NLP) as subsets) enhances the data mining process with multiple advantages: Gaining new insights into decision-making, processing large datasets with improved accuracy and efficiency, and the ability to learn and continuously improve from new data. This review explores the role of different AI-based methods, such as NLP and neural networks (NN), in medical text mining, current data mining processes, database sources, and various AI-based tools and algorithms used in the text mining process. We review recent text mining approaches, highlight key differences between medical and non-medical data mining, and present tools and techniques currently used for medical literature text mining steps. Additionally, we discuss the role of artificial intelligence in healthcare past present and future and machine learning in medical data mining, identifying associated challenges, difficulties, and opportunities.
1.1. Medical vs. Non-Medical Literature Text Mining
Human medical data is unique and presents challenges for mining and analysis. Humans, being the most studied species, provide rich sensory input [2]. However, medical data mining faces significant challenges, primarily due to the heterogeneity and verbosity of data from non-standardized patient records. Data quality issues are also prevalent in medical science and require careful handling for data mining. Standardizing patient selection, data collection, storage, annotation, and management processes can address these challenges [3]. However, this may limit the use of existing data or data collected at multiple centers without proper coordination and Standard Operating Procedures (SOPs). A major difference between medical and non-medical data mining lies in ethical and legal aspects. Using information traceable to individuals involves privacy risks and potential legal issues. Federal regulations, such as the Common Rule in the US, govern the protection of human subjects, though de-identified or anonymized information is typically not subject to this framework [4].
The ownership of medical data is another critical concern. Data is acquired by various entities where individuals receive treatment or diagnosis. These entities collect and store data based on patient authorization at the time of acquisition. However, patients can withdraw this consent, or the consent may be valid only for a limited period, requiring data erasure thereafter [5]. Much clinical text is written in a telegraphic, information-rich style for communication between clinical staff and colleagues, often containing incomplete sentences and abbreviations. Special tools are needed to read, understand, and process this text [6]. Electronic patient records, or clinical text, pose a unique problem due to their highly specialized language, processable by only a few available tools. The telegraphic style and dense information are for clinician-to-clinician communication, lacking a standard dictionary for grammar and spelling checks. Additionally, doctors and medical staff frequently use basic sentences and omit the patient as the subject, assuming it’s implied (e.g., “Arrived with 38.3 fever and a pulse of 132”).
1.2. Use of Artificial Intelligence and Machine Learning in Medical Literature Data Mining
The digital era has shown increasing confidence in machine learning techniques to improve quality of life across almost every field. In healthcare and precision medicine, a continuous flow of medical data from heterogeneous sources is a key enabler for AI/ML-assisted treatments and diagnoses. Today, AI can help doctors achieve better patient outcomes with early diagnosis and treatment plans, as well as improved quality of life. Healthcare organizations and authorities also aim for timely AI execution for outbreak and pandemic prognosis at national and international levels. Healthcare is also seeing the use of AI-aided procedures for operational management, including automated documentation, appointment scheduling, and virtual assistance for patients. AI technology in healthcare and ML tools are currently used in various areas of medical sciences, as illustrated in Table 1.
Table 1.
AI/ML products and research prototypes from some leading organizations in healthcare.
Products/Research Prototypes | Treatment/Field of Study | Company/Institution | Reference |
---|---|---|---|
MergePACS™ | Clinical Radiology Imaging | IBM Watson | Merge PACS—Overview|IBM |
BiometryAssist™ | Diagnostic Ultrasound | Samsung Medison | https://www.intel.com/content/www/us/en/developer/tools/oneapi/application-catalog/full-catalog/diagnostic-ultrasound.html (accessed on 17 February 2022) |
LaborAssist™ | Diagnostic Ultrasound | Samsung Medison | |
Breast Cancer Detection Solution | Ultrasound, mammography, MRI | Huiying’s solution | https://builders.intel.com/ai/solutionscatalog/breast-cancer-detection-solution-657 (accessed on 17 February 2022) |
CT solution | Early detection of COVID-19 | Huiying’s solution | https://builders.intel.com/ai/solutionscatalog/ct-solution-for-early-detection-of-covid-19-704 (accessed on 17 February 2022) |
Dr. Pecker CT Pneumonia CAD System | Classification and quantification of COVID-19 | Jianpei Technology | https://www.intel.com/content/www/us/en/developer/tools/oneapi/application-catalog/full-catalog/dr-pecker-ct-pneumonia-cad-system.html (accessed on 17 February 2022) |
Before delving deeper, it’s important to note that data mining and machine learning concepts are closely related and overlap to some extent, but with a clear distinction in their overall outcomes. Data mining is the process of discovering correlations, anomalies, and new patterns in large datasets from experiments or events to predict results [7]. Data mining relies on statistical modeling techniques to represent data mathematically and then use this model to establish relationships and patterns among variables. Machine learning, conversely, is an advancement of data mining where ML algorithms enable computers to understand data (using statistical models) and make their own predictions. Data mining techniques always require human interaction to find interesting patterns, whereas machine learning is a more modern technique that allows computer programs to learn from data automatically and provide predictions without human intervention. The meaning and definition of artificial intelligence are foundational to understanding these technologies.
Natural Language Processing
Natural Language Processing (NLP) is a field within artificial intelligence that converts human language into a machine-readable format. With increased computer technology usage over the past two decades, this field has grown significantly [8]. Popular applications of NLP in healthcare include clinical documentation, speech recognition, computer-assisted coding, data mining research, automated registry reporting, clinical decision support, clinical trial matching, prior authorization, AI chatbots and virtual scribes, risk adjustment models, computational phenotyping, review management and sentiment analysis, dictation and EMR implementations, and root cause analysis [9]. The literature demonstrates a wide range of NLP applications.
Liu et al. [10] applied word embedding (WE)-skipgram and long short-term memory (LSTM) techniques to clinical text for entity recognition, achieving high accuracy for de-identification, event detection, and concept extraction. Deng et al. [11] used concept embedding (CE)–continuous bag of words (CBOW), skip-gram, and random projection to generate code and semantic representations from clinical text. Afzal et al. [12] developed a pipeline for question generation, evidence quality recognition, ranking, and summarization from biomedical literature, achieving 90.97% accuracy. Pandey et al. [13] listed numerous papers published between 2017 and 2019 that utilized NLP techniques across various text sources, including clinical text, EHR inputs, cancer pathology reports, biomedical text, and EMR text-radiology reports, among others.
2. Standard Process for Data Mining
In response to the need for a standardized data mining method, industry leaders collaborated with practitioners and experts to develop the free, well-documented, and non-proprietary Cross-Industry Standard Process for Data Mining (CRISP-DM) model [14]. While other methods exist (e.g., ASUM, KDD, SEMMA), CRISP-DM is a complete and comprehensive approach. The CRISP-DM consortium developed this generic process model in 1997 to guide beginners and experts, adaptable for specific needs. For instance, it was modified to CRISP-TDM to handle multidimensional time-series data in neonatal intensive care units (NICU) [16]. The CRISP-DM reference model comprises six phases in a data mining lifecycle: Business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Figure 1 illustrates the cyclical nature and dependencies of these phases. The following sections detail available tools and technologies for each phase in medical data mining.
Figure 1.
Diagram illustrating the Cross-Industry Standard Process for Data Mining (CRISP-DM) in healthcare data analysis.
Cross-Industry Standard Process for Data Mining (CRISP-DM)—adapted from the webpage of the Data Science Process Alliance [17] (www.datascience-pm.com/crisp-dm-2/, accessed on 16 April 2022). The circular nature of the data mining process is symbolized by the outer circle, while the arrows that connect the phases show the most essential and common dependencies.
2.1. Business Understanding
The first and most critical part of data mining is business understanding, which involves setting project objectives, targets, assessing the situation, planning execution, and evaluating risks [14]. Setting objectives requires fully grasping the project’s genuine goal to define associated variables. The steps in the business understanding phase of CRISP-DM are: (1) Determine business objectives (comprehend project goal, identify stakeholders, establish success criteria), (2) Assess the situation (identify resources, project risks, solutions, cost-benefit ratio), (3) Clarify data mining goals (establish project goals and success criteria), (4) Produce a project plan (develop detailed plans, timeline, technology/tool selection).
For example, Martins et al. [18] used a data mining approach with RapidMiner and Weka software to predict cardiovascular diseases. The project’s main question was how to detect cardiovascular disease early in high-risk individuals to prevent premature death. Thus, the goals were to create a solution for predicting cardiovascular diseases using patient data, shorten diagnosis time, and facilitate immediate, adequate treatment.
2.2. Data Understanding
This second phase, according to CRISP-DM, focuses on identifying data sources, acquiring data, initial data collection, familiarizing with the data, and identifying data problems [14]. The steps are: (1) Acquire initial data (gather from sources, load into analysis program, integrate), (2) Describe data (study surface properties: field identities, format, quantity, record count, etc.), (3) Explore data (query, visualize, identify relationships, generate report), and (4) Verify data quality (inspect and document quality and issues). In this phase, focus is on identifying data sources, acquisition processes, and handling access restrictions. The healthcare industry and medical institutions generate vast amounts of data daily from imaging, patient monitoring, and records [7]. Common types include experimental data, medical literature, clinical text, medical records, images/videos, and omics data. For instance, Martins et al. [18] used a dataset from Kaggle for cardiovascular disease prediction, including data from 70,000 patients with 12 attributes from medical examinations.
2.2.1. Literature Extraction/Data Gathering
Identifying data sources, acquiring data, and addressing acquisition problems like restrictions and privacy policies are key initial tasks in data understanding [14]. Text/data mining often utilizes public internet sources like the World Wide Web, a process known as “web scraping” or “web crawling”. While manual scraping is possible, automatic methods with web crawlers are essential for large databases like PubMed, which contain millions of publications and add new ones yearly [19]. Automated processing provides necessary quality, response time, and homogeneity for analysis. For example, Guo et al. [20] collected COVID-19 data using a Python-based web crawler connected to a MySQL database.
Web scraping and web crawling, though sometimes interchanged, are distinct processes [21],[22]. Web crawling is broadly downloading website information, extracting hyperlinks, and following them (Figure 2). Downloaded info is saved or indexed for searching. Search engines are essentially crawlers, indexing entire pages. A bot scans each page and link to the end. Crawlers are mainly used by major search engines and aggregators, collecting general information, while scrapers collect specific datasets [23],[24]. Web scraping (Figure 2) extracts specific data from a webpage, savable anywhere. An online scraper is similar to a crawler in locating content but uses specific identifiers like HTML structure, unlike crawlers’ pseudo-random IDs. Scraping uses robots to extract specific datasets for comparison, checking, and analysis based on organizational demands and objectives [25].
Figure 2.
Visual comparison showing the distinct processes of web crawling and web scraping for collecting healthcare information.
Comparison between web crawling and web scraping.
Several text mining tools are available. Kaur and Chopra [26] compared 55 popular tools, categorizing them as proprietary (39), open source (13), and online (3). Table 2 contrasts four Python-based tools not examined in that review but popular now. They serve the same purpose with different goals. ‘Requests’ is user-friendly for simple scraping. Scrapy suits large-scale projects, unlike ‘requests’, ‘beautiful soup’, and ‘selenium’ for small-scale tasks. ‘Beautiful Soup’ is easy to learn/use and handles disorganized sites. Selenium excels at scraping JavaScript-heavy websites. Table 2 provides detailed comparisons.
Table 2.
Comparison between four text mining tools.
Requests | Scrapy | Beautiful Soup | Selenium |
---|---|---|---|
What is it? | HTTP library for Python | Open-source web framework written in Python | python library |
Goal | Sending HTTP/1.1 requests using Python | – Can crawl or scrape websites and extract the structured data and saves it – Can also be used for a wide range of tasks, monitoring, and automated testing | – Can parse the data and scrape the web pages – Extract information from XML and HTML documents |
Ideal usage | Used for simple and low-level complex web scraping tasks | – Framework used for complex web scraping or web crawling tasks. – Used for large-scale projects | – Used for smaller web scraping tasks – Toolkit for searching through a document (XML or HTML) and extracting important information |
Advantage | – A simple way to retrieve data from URL – Scraping data from web – Allows to read, write, post, delete, and update the data for the given URL – Extremely easy to deal with cookies and sessions | – Portable library – Runs on Linux, Windows, and Mac – One of the faster scraping libraries – Can extract websites much faster than other tools – Consumes less memory and CPU usage – Building a robust, and flexible application with different functions | – Learning and mastering it is easy – Community support is readily available to resolve issues. |
Selectors | None | JCSS and XPath | CSS |
Documentation | Detailed and simple to understand | Detailed and simple to understand | Detailed and simple to understand |
GitHub stars | 46.8 k | 42.7 k | – |
Reference | Chandra and Varanasi [27] | Kouzis-Loukas [28] | Richardson [29] |
Access Restriction
Some pages or entire websites impose access restrictions when a web crawler visits. These are mainly for data confidentiality, integrity, quality, and legal reasons. Crawlers performing multiple requests per second can overwhelm servers. Methods like canonical tags, robots.txt, x-robots-tag, and metarobots tags guide scrapers. Robots.txt files tell scraping bots which sites to crawl, while malicious bots ignore this “do not enter” sign, as explained in Figure 3.
Figure 3.
Diagram illustrating website access restrictions, such as robots.txt files, relevant to data mining medical literature.
Layout for access restrictions.
Data Collection from Different Sources
Medical data generation is accelerating rapidly due to the information explosion in healthcare [31],[32]. Medical data includes administrative records, biometric data, clinical registrations, diagnostics, X-rays, electronic health records, patient reports, treatments, results, etc. Its massive and complex nature makes handling difficult for meaningful outcomes. Healthcare centers propose systems to manage growing data and provide services [32]. Management software is a common method for collecting and storing records (e.g., eHospital Systems, DocPulse).
For text mining, data collection is key. In medical science, various data types and trends emerge rapidly, classifiable into five categories:
- Hospital management software (Patient data/Clinical narratives).
- Clinical trials.
- Research data in Medicine.
- Publication platforms for Medicine (PubMed, etc.).
- Pharmaceuticals and regulatory data.
Tables 3, 4, and 5 provide details on data sources. Patient data from clinical trials is available from sources in Table 3. Open-access databases benefit researchers with large volumes, rich content, broad coverage, and cost-effectiveness. Many public datasets exist for various medical fields with numerous record variables (Table 4). Textual information grows rapidly, making concise extraction difficult. Published literature is the most abundant source of textual info in healthcare (Table 5).
Table 3.
Databases and registries for clinical trials.
Databases/Registries | Trial Numbers | Provided by | Location | Founded Year | URL |
---|---|---|---|---|---|
ClinicalTrials.gov | 405,612 | U.S. National Library of Medicine | Bethesda, MD, USA | 1997 | https://clinicaltrials.gov/ (accessed on 11 April 2022) |
Cochrane Central Register of Controlled Trials (CENTRAL) | 1,854,672 | a component of Cochrane Library | London, UK | 1996 | https://www.cochranelibrary.com/central (accessed on 11 April 2022) |
WHO International Clinical Trials Registry Platform (ICTRP) | 353,502 | World Health Organization | Geneva, Switzerland | – | https://trialsearch.who.int/ (accessed on 11 April 2022) |
The European Union Clinical Trials Database | 60,321 | European Medicines Agency | Amsterdam, The Netherlands | 2004 | https://www.clinicaltrialsregister.eu/ctr-search/search (accessed on 11 April 2022) |
CenterWatch | 50,112 | – | Boston, MA, USA | 1994 | http://www.centerwatch.com/clinical-trials/listings/ (accessed on 11 April 2022) |
German Clinical Trials Register (Deutsches Register Klinischer Studien—DRKS) | >13,000 | Federal Institute for Drugs and Medical Devices | Cologne, Germany | https://www.bfarm.de/EN/BfArM/Tasks/German-Clinical-Trials-Register/_node.html (accessed on 11 April 2022) |
Table 4.
Research data in Medicine.
Databases | No. of Datasets | Owned by | Domains | Available Resources | URL | Ref |
---|---|---|---|---|---|---|
Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) | 262 | National Institute of Health, Calverton, MD, USA | Cardiovascular, pulmonary, and hematological | Specimens and Study Datasets | https://biolincc.nhlbi.nih.gov/studies/ (accessed on 4 April 2022) | [[33]](#B33-jpm-12-01359) |
Biomedical Translational Research Information System (BTRIS) | Five billion rows of data | Bethesda, MD, USA | Multiple subjects | Study Datasets | https://btris.nih.gov/ (accessed on 4 April 2022) | [[34]](#B34-jpm-12-01359) |
Clinical Data Study Request | 3135 | The consortium of clinical study Sponsors | Multiple subjects | Study Datasets | https://www.clinicalstudydatarequest.com/ (accessed on 4 April 2022) | [[35]](#B35-jpm-12-01359) |
Surveillance, Epidemiology, and End Results (SEER) | – | National Cancer Institute, Bethesda, MD, USA | Cancer (All types)—Stage and histological details | Study Datasets | https://seer.cancer.gov/ (accessed on 4 April 2022) | [[36]](#B36-jpm-12-01359) |
Medical Information Mart for Intensive Care (MIMIC) MIMIC-III | 53,423 patients | MIT Laboratory for Computational Physiology, Cambridge, MA, USA | Intensive Care | Patient data (vital signs, medications, laboratory measurements, observations and notes charted by care providers, survival data, hospital length of stay, imaging reports, diagnostic codes, procedure codes, and fluid balance) | https://mimic.mit.edu/ (accessed on 4 April 2022) | [[37]](#B37-jpm-12-01359),[[38]](#B38-jpm-12-01359) |
MIMIC-CXR | 65,379 patients (377,110 images of chest radiographs) | [[39]](#B39-jpm-12-01359) | ||||
National Health and Nutrition Examination Survey (NHANES) | – | Centers for disease control and prevention, Hyattsville, MD, USA | Dietary assessment and other nutrition surveillance | data nutritional status, dietary intake, anthropometric measurements, laboratory tests, biospecimens, and clinical findings. | https://www.cdc.gov/nchs/nhanes/index.htm (accessed on 4 April 2022) | [[40]](#B40-jpm-12-01359) |
Global Burden of Disease (GBDx) | – | Institute for Health Metrics and Evaluation, Seattle, WA, USA | Epidemic patterns and disease burden | Surveys, censuses, vital statistics, and other health-related data | https://ghdx.healthdata.org/ (accessed on 4 April 2022) | [[41]](#B41-jpm-12-01359) |
UK Biobank (UKB) | 0.5 million | Stockport, UK | In-depth genetic and health information | Genetic, biospecimens, and health data | https://www.ukbiobank.ac.uk/ (accessed on 4 April 2022) | [[42]](#B42-jpm-12-01359) |
The Cancer Genome Atlas (TCGA) | molecularly characterized over 20,000 cancer samples spanning 33 cancer types | National Cancer Institute, NIH, Bethesda, MD, USA | Cancer genomics | over 2.5 petabytes of epigenomic, proteomic, transcriptomic, and genomic data | https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga (accessed on 4 April 2022) | [[43]](#B43-jpm-12-01359) |
Gene Expression Omnibus (GEO) | 4,981,280 samples | National Center for Bioinformatics (NCBI), NIH, Bethesda, MD, USA | Sequencing and gene expression | 4348 datasets available | https://www.ncbi.nlm.nih.gov/geo/ (accessed on 4 April 2022) | [[44]](#B44-jpm-12-01359) |
Table 5.
Biomedical literature sources.
Source | Articles (Million) | Launched by | Publication Type | Topic | Online | Link |
---|---|---|---|---|---|---|
PubMed | 33 | National Center for Biotechnology Information (NCBI) | Abstracts | Biomedical and life sciences | 1996 | https://www.ncbi.nlm.nih.gov/pubmed/ (accessed on 4 April 2022) |
PubMed Central (PMC) | 7.6 | National Center for Biotechnology Information (NCBI) | Full text | Biomedical and life sciences | 2000 | https://www.ncbi.nlm.nih.gov/pmc/ (accessed on 4 April 2022) |
Cochrane Library | – | Cochrane | Abstracts and full text | Healthcare | – | https://www.cochranelibrary.com/search (accessed on 4 April 2022) |
bioRxiv | – | Cold Spring Harbor Laboratory (CSHL) | Unpublished preprints | Biological sciences | 2013 | https://www.biorxiv.org/ (accessed on 4 April 2022) |
medRxiv | – | Cold Spring Harbor Laboratory (CSHL) | Unpublished manuscripts | Health sciences | 2019 | https://www.medrxiv.org/ (accessed on 4 April 2022) |
arXiv | 2.05 | Cornell Tech | Non-peer-reviewed | Multidisciplinary | 1991 | https://arxiv.org/ (accessed on 4 April 2022) |
Google Scholar | 100 (in 2014) | full text or metadata | Multidisciplinary | 2004 | https://scholar.google.com/ (accessed on 4 April 2022) | |
Semantic Scholar | 205.25 | Allen Institute for Artificial Intelligence | Abstracts and full text | Multidisciplinary | 2015 | https://www.semanticscholar.org/ (accessed on 4 April 2022) |
Elsevier | 17 (as of 2018) | Elsevier | Abstracts and full text | Multidisciplinary | 1880 | https://www.elsevier.com/ (accessed on 4 April 2022) |
Springer Nature | – | Springer Nature Group | Abstracts and full text | Multidisciplinary | 2015 | https://www.springernature.com/ (accessed on 4 April 2022) |
Springer | – | Springer Nature | Abstracts and full text | Multidisciplinary | 1842 | https://link.springer.com/ (accessed on 4 April 2022) |
2.3. Data Preparation
The third phase (data preparation) of CRISP-DM involves creating the final dataset from raw data for the modeling tool. This phase constitutes the majority (approx. 80%) of a text/data mining project. Steps include: (1) Data selection (choose dataset/attributes based on project goals, quality, type, volume), (2) Data cleaning (handle missing data, correct/impute/remove errors), (3) Data construction (create derived/new attributes/records, transform data), (4) Data integration (combine data from sources), (5) Data formatting (remove inappropriate characters, change format for modeling compatibility) [14].
2.3.1. Data Cleaning/Data Transformation
The primary goal of data cleaning is to identify and remove duplicate or erroneous data to create a reliable dataset. Cleaning involves locating and removing corrupt, incorrect, duplicated, incomplete, or improperly formatted entries (Figure 4). Data cleaning is necessary for analyzing information from multiple sources [45],[46],[47].
Figure 4.
Steps for data cleaning.
Various tools and Python libraries for data cleaning are discussed in the following sections. After cleaning, data is transformed into the proper format (Excel, JSON, XML). Data transformation simplifies preprocessing and makes data more structured and organized, improving usability for humans and computers and integration into systems [46]. Relevant tools include:
- GROBID (Generation of Bibliographic Data): A machine-learning library for extracting metadata from PDF-formatted technical and scientific documents. It aims to reconstruct the logical structure to support advanced digital library processes and text analysis. GROBID uses ML models and connects to services like ResearchGate and Mendeley. Its output is PDF transformed into XML TEI format, supplemented with online info [51],[52].
In summary, data cleansing improves dataset consistency, while transformation simplifies data processing. Both enhance the training dataset quality for model construction.
2.3.2. Feature Engineering
Feature engineering, or feature extraction, is the process of choosing, modifying, and converting raw data into features usable in supervised learning. This ML technique creates new variables not in the training set. It streamlines and accelerates data transformations, improves model accuracy, and generates features for both supervised and unsupervised learning. Feature engineering is essential for ML models; poor features directly impact the model. Numerous tools automate this process, generating many features quickly for classification and regression (e.g., FeatureTools, AutoFeat, TsFresh) [54],[55].
Vijithananda et al. [56] extracted features from brain tumor MRI ADC images of 195 patients, including demographic and Grey Level Co-occurrence Matrix (GLCM) features. GLCM homogeneity and skewness were excluded based on ANOVA f-test. A Random Forest classifier outperformed others (Decision Trees, Naive Bayes, etc.) and was chosen, achieving 90.41% accuracy in predicting malignant/benign neoplasms.
2.3.3. Searching for Keywords
Keyword extraction identifies keywords or key phrases from text documents that describe the topic. Several automated methods exist, selecting frequently used and significant words or phrases. This falls under natural language processing, important for machine learning and artificial intelligence [57]. Extractors find words or groups of words (phrases).
FlashText is a free, open-source Python package for keyword search and replacement, using an Aho-Corasick algorithm and Trie Dictionary [58]. Traditional keyword matching involves scanning a corpus for each term, which can be computationally intensive for large datasets. Table 6 compares four Python-based tools for keyword and phrase extraction, highlighting features, benefits, and NLP tasks.
Table 6.
Searching for relevant content.
Natural Language Toolkit | SpaCy | Scikit-Learn NLP Toolkit | Gensim | |
---|---|---|---|---|
What is it? | open-source python platform for handling human language data | open-source python library for advanced natural language processing | machine learning software library for the Python programming language | fastest python library for the training of vector embedding |
Features | – Based on NumPy, SciPy, and Matplotlib – An easy and efficient way to analyze predictive data – Easily accessible and reusable in different contexts | – Simple and efficient tools for machine learning, data mining, and data analysis – Freely available for everyone – Applicable to different application areas, like natural language processing | – Provides ready-to-use models and corpora – Models pre-trained for specific areas such as health care – Processes large amounts of data using streaming data | |
Advantage | – Most well-known and comprehensive NLP libraries with many extensions – offers support in the largest number of languages | – easy to use – fully integrated with Python – compatible with other deep learning frameworks – many already trained statistical models available – applicable to many different languages – high speed and performance – freely available – able to process long texts – platform-independent usable | – Simple and efficient tools for machine learning, data mining, and data analysis – Freely available for everyone – Applicable to different application areas, like natural language processing | – Provides ready-to-use models and corpora – Models pre-trained for specific areas such as health care – Processes large amounts of data using streaming data |
NLP Tasks | – Classification – Tokenization – Stemming – Tagging – Parsing | – Classification – Tokenization – Stemming – Tagging – Parsing – Named Entity recognition – Sentiment Analysis | – Classification – Topic Modeling – Sentiment Analysis | – Text similarity – Text summarization – Topic Modeling |
GitHub stars | 10.4 k | 22.4 k | 49 k | 12.9 k |
Website | nltk.org (accessed on 16 March 2022) | spacy.io (accessed on 16 March 2022) | scikit-learn.org (accessed on 16 March 2022) | radimrehurek.com/gensim/ (accessed on 16 March 2022) |
Reference | Bird et al. [59] | Honnibal [60] | Pedregosa et al. [61], Pinto et al. [62] | Rehurek and Sojka [63] |
2.4. Modeling
In the fourth phase (modeling) of CRISP-DM, various modeling techniques are tested and calibrated by adjusting parameters to achieve optimal results [14]. Steps include: (1) Choose modeling technique (select models/algorithms), (2) Create test designs (evaluate model quality/validity), (3) Build models (use tool to build from dataset, adjust parameters, describe model), and (4) Evaluate models (explain outcomes based on subject knowledge, criteria, test design; rank models, adjust parameters if needed).
Model selection depends on purpose (e.g., forecasting) and data type (unstructured/structured). Models are sets of data, patterns, and statistics, divided into predictive and descriptive categories. Descriptive models identify human-interpretable patterns. Predictive models use known results to forecast unknown/future values. Predictive tasks include classification, prediction, regression, and time series analysis. Descriptive tasks include clustering, association rules, sequence discovery, and summarization (Figure 5). Algorithm selection depends on whether dependent variables are labeled. Supervised learning (Decision trees, Random Forest, SVM) is for labeled data. Unsupervised learning (Clustering, PCA) is for unlabeled data [64],[65].
Figure 5.
Overview illustrating the categories of predictive and descriptive data mining tasks used in analyzing healthcare data.
Predictive and descriptive data mining tasks.
The dataset type is the main distinction between supervised and unsupervised ML. Supervised learning uses labeled input/output data with external supervision for training, aiming to predict new data outcomes. Unsupervised learning uses unlabeled data without supervision, aiming to find hidden patterns and insights from large data volumes. Supervised models are simpler; unsupervised models require large datasets and are computationally complex. Supervised learning applications include diagnosis, fraud detection, image classification. Unsupervised learning applications include anomaly detection, big data visualization, recommended systems [64],[66].
As an example, Zowalla et al. [67] assessed a WebCrawler (StormCrawler) for acquiring health-related web content from the German Health Web. They trained an SVM classifier to distinguish health-related pages using a dataset from this web. The model was tested on an 80/20 split and against a crowd-validated dataset. For cardiovascular disease prediction, ‘Decision Tree’ was best compared to eight other techniques, including Deep Learning, k-NN, and Random Forest [18]. Parameters were optimized for better results.
2.5. Data Model Validation and Testing
This step validates and tests the selected model. Validation ensures accuracy for intended use [68]. Model validation is important because a model fitting training data well doesn’t guarantee accuracy on unseen data. Validation involves predicting outcomes in scenarios outside the training set and computing statistical fit measures. Testing uses test data to compare accuracy with validation results. A model is “ready” when tested data results show a satisfactory statistical match. Dong et al. [69] used training (194 samples) and test (58 samples) datasets of MS raw data to classify tumor/non-tumor samples. They compared CNN, GBDT, SVM, PCA+SVM, LR, and RF, finding CNN had the highest accuracy. ML model validation tools include Apache Spark, Python, R, RapidMiner, and Tableau.
2.6. Evaluation
In the fifth phase (evaluation) of CRISP-DM, a thorough review assesses if the model achieves business objectives [14]. Steps: (1) Assess outcomes (evaluate goal achievement, identify constraints, present final statement), (2) Review process (in-depth project review, quality assurance), and (3) Decide next steps (determine deployment or improvement changes).
After text data analysis, visualizing data meaningfully is key for interpretation and communication. Text visualization uses charts, graphs, maps, timelines, networks, word clouds, etc., to highlight important aspects of large information volumes. Tools make identifying patterns, outliers, trends, and insights straightforward. Effective data visualization offers easy understanding, prompt decision-making, and higher engagement than other methods. Principles for successful visualization: (1) Select style based on purpose, (2) Choose style appropriate for audience, (3) Accompany with effective design [70]. Style selection depends on data and aim. Line/bar charts compare data points. Diverse styles exist (typographic, graph, chart, 3D, maps). Table 7 lists styles and exemplary tools.
Table 7.
Data visualization style with exemplary tools.
Visualization Style | Tool [Reference] |
---|---|
Text marking/highlighting | cite2vec [71], TopicLens [72], SurVis [73], Poemage [74], Overview [75] |
Tags or word cloud | SentenTree [76], InfoVis [77], VisOHC [78], IncreSTS [79], Word storms [80] |
Bar charts | TextTile [81], SentiCompass [82], NewsViews [83], WeiboEvents [84], CatStream [85] |
Scatterplot | PhenoLines [86], SocialBrands [87], TopicPanorama [88], #FluxFlow [89], PEARL [90] |
Line chart | Vispubdata.org [91], GameFlow [92], MultiConVis [93], Contextifier [94], Google+Ripples [95] |
Node-link | NEREx [96], iForum [97], NameClarifier [98], DIA2 [99], Information Cartography [100] |
Tree | OpinionFlow [101], Rule-based Visual Mappings [102], HierarchicalTopics [103], Whisper [104], The World’s Languages Explorer [105] |
Matrix | Interactive Ambiguity Resolution [106], Fingerprint Matrices [107], Conceptual recurrence plots [108], The Deshredder [109], Termite [110] |
Stream graph timeline | VAiRoma [111], CiteRivers [112], ThemeDelta [113], EvoRiver [114], LeadLine [115] |
Flow timeline | TimeLineCurator [116], Interactive visual profiling [117] |
Radial visualization | ConToVi [118], ConVis [119] |
3D visualization | Two-stage Framework [120] |
Maps/Geo chart | Can Twitter save lives? [121], Visualizing Dynamic Data with Maps [122], Spatiotemporal Anomaly Detection [123] |
Beyond these, powerful software visualizes data, such as Microsoft Excel (PivotTables), R, Tableau, Power-BI, datawrapper, and Google Charts. Their interactive interfaces aid clear, dynamic displays, simplifying pattern, outlier, trend, and insight identification. Major data visualization challenges include massive data volume, complexity, and missing/duplicate entries [124].
2.7. Deployment
In the sixth and final phase (deployment) of CRISP-DM [14], knowledge from the project is organized and presented for the project, company, and customer. Complexity varies. Steps: (1) Create deployment plan (formulate strategy), (2) Plan monitoring and maintenance (plan to avoid issues during operational phase), (3) Produce final report (prepare written/verbal report), and (4) Review project (evaluate successes/failures, improvement areas).
3. Conclusions and Future Outlook
The volume of medical text data is rapidly increasing. Data mining, particularly using Artificial Intelligence And Data Mining In Healthcare, can extract new and useful information or knowledge from this data. The CRISP-DM system presented here details each step of data mining, illustrated with medical examples. The authors plan future work to develop an AI-based web crawling system with 4D data visualization, presenting information concisely for researchers, patients, and medical staff education.
Acknowledgments
The authors acknowledge support from the German Research Foundation (DFG) and the Open Access Publication Fund of the University of Göttingen.
Author Contributions
Conceptualization, A.Z., M.A., I.P., S.A.K., A.F.H. and A.R.A.; writing—original draft preparation, A.Z., M.A., I.P. and A.R.A.; writing—review and editing, A.Z., M.A. and A.R.A.; Artificial Intelligence and Machine Learning in Medical Literature Data Mining, M.A., S.A.K. and A.F.H.; Natural Language Processing, A.Z., S.A.K. and A.F.H.; funding acquisition, M.A. and A.R.A. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Funding Statement
This work was Funded by the German Bundesministerium für Bildung und Forschung (BMBF) under the CELTIC-NEXT AI-NET-PROTECT project.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
[1] Fan, J.; Xiao, M.; Li, S.; Zhao, H.; Li, X. A Survey of Text Mining in Biomedical Literature. J. Healthc. Eng. 2022, 2022, 6064570.
[2] Senger, P.L. Pathways to Pregnancy and Parenthood; WSU Printing, Pullman, WA, USA, 2005.
[3] Ristl, R.; Dehmer, M.; Grünhage, F.; Maier, J.; Höss, P.; Eiter, T.; Thalhammer, M. Managing Multidimensional Data in Translational Medicine. In Advances in Conceptual Modeling. ER 2012 Workshops ACMD, WISM, BHMD, SeCoGIS, and MREBA; Eder, J., Fliedl, G., Morimoto, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7608, pp. 93–102.
[4] Office for Human Research Protections (OHRP). Protection of Human Subjects, 45 CFR 46 (2018); U.S. Department of Health and Human Services (HHS): Washington, DC, USA, 2018. Available online: https://www.hhs.gov/ohrp/regulations-and-policy/regulations/45-cfr-46/index.html (accessed on 19 December 2021).
[5] Ristl, R.; Dehmer, M.; Emmert-Streib, F.; Grünhage, F.; Thalhammer, M. Data Protection in Data Mining. In Information Theory and Statistical Learning; Dehmer, M., Emmert-Streib, F., Eds.; Springer: New York, NY, USA, 2013; pp. 231–242.
[6] Medlock, S.; W Wiggins, M.; Névéol, A.; Mercer, R.; Brew, C. Text mining the electronic medical record: Prospects and challenges. Yearbook of Medical Informatics 2011, 20, 182–191.
[7] Bhatia, V.; Batra, M.; Singh, H. Artificial Intelligence and Data Science in Healthcare. In 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS); IEEE: Ghaziabad, India, 2021; pp. 428–432.
[8] Khan, A.; ur Rehman, S.; Khan, Z.A.; Saeed, N. Machine learning in healthcare: A guide for beginners. Int. J. Biomater. 2020, 2020, 8890137.
[9] Press, G. 12 Real-World AI and Natural Language Processing Use Cases in Healthcare. Available online: https://www.forbes.com/sites/gilpress/2019/07/30/12-real-world-ai-and-natural-language-processing-use-cases-in-healthcare/?sh=60f3b32c57e0 (accessed on 16 March 2022).
[10] Liu, S.; Chen, Z.; Chen, W.; Zong, Z. A hybrid method for medical information extraction based on word embeddings and LSTM. BMC Med. Inf. Decis. Mak. 2019, 19, 1–9.
[11] Deng, Y.; Xu, J.; Mi, J.; Wu, Z.; Zhang, J.; Li, Y.; Shen, X.; Wang, F. Generating code and semantic representations from clinical text via concept embedding methods. BMC Med. Inf. Decis. Mak. 2020, 20, 1–13.
[12] Afzal, W.; Amin, A.; Khan, T.; Javed, U.; Masud, M.; Alshamari, M.; Nawaz, R. A machine learning-based framework for evidence extraction, summarization and ranking from biomedical literature. BMC Med. Inf. Decis. Mak. 2020, 20, 1–18.
[13] Pandey, A.; Vo, V.; Koopman, B.; Hogan, W.R.; Liu, H. Automated Methods to Extract Information from Cancer Pathology Reports: A Narrative Review. JCO Clin. Cancer Inform. 2020, 4, 642–657.
[14] Shearer, C. The CRISP-DM Model: The New Blueprint for Data Mining. J. Data Warehous. Manag. 2000, 5, 13–22.
[15] Martins, M.; Pereira, L. The application of data mining in the prediction of cardiovascular diseases: A comparative study. Int. J. Inf. Syst. Serv. Sci. 2020, 12, 48–63.
[16] Delen, D.; Wu, D.; Arias, O.; Zolbanin, H.M.; Liu, S.; Tavana, M. Real-Time Business Intelligence: Leveraging Data into Decision-Making; Springer Nature: Berlin/Heidelberg, Germany, 2021.
[17] The Data Science Process Alliance. Available online: www.datascience-pm.com/crisp-dm-2/ (accessed on 16 April 2022).
[18] Martins, M.; Pereira, L. The application of data mining in the prediction of cardiovascular diseases: A comparative study. Int. J. Inf. Syst. Serv. Sci. 2020, 12, 48–63.
[19] About PubMed. Available online: https://pubmed.ncbi.nlm.nih.gov/about/ (accessed on 11 May 2022).
[20] Guo, W.; Wang, F.; Hu, D. Coronavirus: Detection and analysis from open-source information. J. Med. Internet Res. 2020, 22, e17301.
[21] What is Web Scraping—An Introduction. Available online: https://oxylabs.io/blog/what-is-web-scraping-introduction (accessed on 13 May 2022).
[22] Web Scraping vs. Web Crawling: What’s the Difference? Available online: https://brightdata.com/blog/industry-trends/web-scraping-vs-web-crawling (accessed on 13 May 2022).
[23] What Is Web Crawling and What Are Web Crawlers? Available online: https://octoparse.com/blog/what-is-web-crawling (accessed on 13 May 2022).
[24] What is a Web Scraper? Available online: https://octoparse.com/blog/what-is-web-scraper (accessed on 13 May 2022).
[25] How Does Web Scraping Work? Available online: https://octoparse.com/blog/how-does-web-scraping-work (accessed on 13 May 2022).
[26] Kaur, A.; Chopra, D. A comparative analysis of various text mining tools. Int. J. Eng. Res. Dev. 2012, 3, 50–53.
[27] Chandra, T.; Varanasi, N. Learning Request Module in Python. In Proceedings of the 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI); IEEE: Tirunelveli, India, 2019; pp. 1011–1016.
[28] Kouzis-Loukas, M. Getting Started with Scrapy; Packt Publishing Ltd.: Birmingham, UK, 2013.
[29] Richardson, L. Beautiful soup documentation. Available online: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (accessed on 17 February 2022).
[30] Sharma, D. Selenium, Python, And Testing. In Selenium WebDriver with Java: Navigate, Automate, and Test Web Applications; Springer: Berlin/Heidelberg, Germany, 2019; pp. 401–417.
[31] Jayashree, V.; Vanitha, S. A study on the various data sources in healthcare industry. Int. J. Manag. Inf. Technol. Eng. 2017, 5, 13–18.
[32] Ray, P.; Singh, S.; Sharma, B. Survey of different data sources in healthcare. In 2020 International Conference on Electrical and Electronics Engineering (ICEEE); IEEE: Gorakhpur, India, 2020; pp. 157–162.
[33] About BioLINCC. Available online: https://biolincc.nhlbi.nih.gov/about/ (accessed on 4 April 2022).
[34] About BTRIS. Available online: https://btris.nih.gov/about/ (accessed on 4 April 2022).
[35] Request Study Data from Sponsors. Available online: https://www.clinicalstudydatarequest.com/ (accessed on 4 April 2022).
[36] About SEER. Available online: https://seer.cancer.gov/about/ (accessed on 4 April 2022).
[37] MIMIC-III Clinical Database. Available online: https://mimic.mit.edu/ (accessed on 4 April 2022).
[38] Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, H.-L.H.; Feng, M.; Ghassemi, M.; Moody, B.; Villarroel, P.; DunnMon, L.A.; et al. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 1–9.
[39] Johnson, A.E.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, S.; Lungren, M.P.; Mark, R.G.; Horng, S. MIMIC-CXR, a large publicly available dataset of chest radiographs with free-text reports. Sci. Data 2019, 6, 1–8.
[40] NHANES—National Health and Nutrition Examination Survey. Available online: https://www.cdc.gov/nchs/nhanes/index.htm (accessed on 4 April 2022).
[41] GBDx—Global Burden of Disease Data Exchange. Available online: https://ghdx.healthdata.org/ (accessed on 4 April 2022).
[42] UK Biobank. Available online: https://www.ukbiobank.ac.uk/ (accessed on 4 April 2022).
[43] The Cancer Genome Atlas Program (TCGA). Available online: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga (accessed on 4 April 2022).
[44] Gene Expression Omnibus (GEO) Fact Sheet. Available online: https://www.ncbi.nlm.nih.gov/geo/info/geo_fact.html (accessed on 4 April 2022).
[45] What Is Data Cleansing—Precisely Explained. Available online: https://www.xenonstack.com/insights/what-is-data-cleansing (accessed on 21 May 2022).
[46] Data Cleaning: Best Practices & Tools. Available online: https://towardsdatascience.com/data-cleaning-best-practices-and-tools-f86d14b7a727 (accessed on 21 May 2022).
[47] What is Data Cleaning? Definition, Best Practices, and Examples. Available online: https://www.talend.com/resources/what-is-data-cleaning/ (accessed on 21 May 2022).
[48] What is NumPy? Available online: https://www.tableau.com/learn/articles/what-is-numpy (accessed on 23 May 2022).
[49] Matplotlib. Available online: https://matplotlib.org/ (accessed on 23 May 2022).
[50] What is Pandas in Python and How to Use It. Available online: https://www.techtarget.com/searchdatamanagement/definition/Pandas-in-Python (accessed on 23 May 2022).
[51] About GROBID. Available online: https://grobid.readthedocs.io/en/latest/Grobid-machine-learning/ (accessed on 21 May 2022).
[52] Wikipedia. GROBID. Available online: https://en.wikipedia.org/wiki/GROBID (accessed on 21 May 2022).
[53] Data Transformation. Available online: https://www.tableau.com/learn/articles/data-transformation (accessed on 23 May 2022).
[54] DataCamp. Feature Engineering. Available online: https://www.datacamp.com/blog/feature-engineering (accessed on 23 May 2022).
[55] Krishnan, M. Feature Engineering Tools for Machine Learning. Available online: https://www.upgrad.com/blog/feature-engineering-tools/ (accessed on 23 May 2022).
[56] Vijithananda, V.; Abed, A.; Zowalla, A.; Zowalla, M.; Maqsood, A.; Abdul-Sater, A.; Ahmed, N.; Raza, A. Evaluating Machine Learning Approaches for Non-Invasive Grading of Brain Tumor Utilizing Feature Engineering of ADC-Map. Diagnostics 2022, 12, 246.
[57] What is keyword extraction? Available online: https://www.smokeball.com/blog/what-is-keyword-extraction/ (accessed on 23 May 2022).
[58] FlashText. Available online: https://flashtext.readthedocs.io/en/latest/ (accessed on 23 May 2022).
[59] Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing, Parsing, and Interpreting Text and Speech; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009.
[60] Honnibal, M. SpaCy: Industrial-strength Natural Language Processing in Python. Available online: https://spacy.io/ (accessed on 16 March 2022).
[61] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
[62] Pinto, L.S.; Marujo, L.; Martins, A.M. The role of machine learning in natural language processing. Advances in Natural Language Processing, Artificial Intelligence and Knowledge Engineering: Selected Papers from PROPOR 2012 and RANLP 2013 Workshops; Springer: Berlin/Heidelberg, Germany, 2014; pp. 233–245.
[63] Rehurek, R.; Sojka, P. Software for semantic representation from texts and its application to information retrieval. In Proceedings of the ITAT 2010, Information Technologies—Applications and Theory; Springer: Berlin/Heidelberg, Germany, 2010; pp. 46–51.
[64] Jha, A. Difference Between Supervised and Unsupervised Learning—An Overview. Available online: https://www.geeksforgeeks.org/difference-between-supervised-and-unsupervised-learning/ (accessed on 23 May 2022).
[65] Supervised vs Unsupervised Learning. Available online: https://www.javatpoint.com/supervised-vs-unsupervised-learning (accessed on 23 May 2022).
[66] Supervised vs. Unsupervised Learning: What’s the Difference? Available online: https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning (accessed on 23 May 2022).
[67] Zowalla, M.; Degeling, M.; Görlich, J.; Rehberg, R.; Zowalla, A.; Stüber, M.; Ziemssen, T.; Kümpfel, T.; Emmert, M.Y.; et al. Identification and Evaluation of Health-Related Web Content in the German Health Web Using a Web Crawler, Natural Language Processing, Machine Learning, and Crowd Validation. J. Med. Internet Res. 2021, 23, e27916.
[68] The CRISP-DM Methodology. Available online: https://www.datascience-pm.com/crisp-dm-methodology/ (accessed on 27 May 2022).
[69] Dong, D.; Zhang, J.; Chen, X.; Ge, J.; Tang, L.; Wang, J.; Luo, W.; Cao, S.; Zhou, J.; et al. Development and validation of an interpretable deep learning model for tumor classification based on whole-slide pathological images. Clin. Cancer Res. 2020, 26, 1386–1396.
[70] Top 5 Data Visualization Best Practices. Available online: https://www.tableau.com/learn/articles/data-visualization-best-practices (accessed on 27 May 2022).
[71] Ye, S.; Song, Y.; Zhuang, Y.; Chen, Z.; Li, H. Cite2vec: Citation sequence embedding based on citation contexts. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); IEEE: Kyoto, Japan, 2019; pp. 1270–1275.
[72] Sun, S.; Chen, W.; Maciejewski, R.; Ma, K.-L.; Cleveland, W.S. Topiclens: An analytical visualization framework for topic models. IEEE Trans. Vis. Comput. Graph. 2015, 21, 102–115.
[73] Bach, B.; Wang, X.; Shao, J.; Lee, B.; Henry Riche, N. Survis: A visual survey of survey visualizations. IEEE Trans. Vis. Comput. Graph. 2018, 25, 779–789.
[74] Xia, Y.; Mao, H.; Xu, J.; Chang, R.; Cui, W.; Qu, H. Poemage: Visualizing the sonic structure of poetry. IEEE Trans. Vis. Comput. Graph. 2017, 24, 678–687.
[75] Van Ham, F.; Wattenberg, M.; Viegas, F.B. Mapping text with keywords. IEEE Trans. Vis. Comput. Graph. 2009, 15, 1137–1144.
[76] Rohrer, T. SentenTree: Visualizations of sentence mining for text collections. In International Conference on Information Visualisation; IEEE: London, UK, 2009; pp. 18–25.
[77] Ren, D.; Shi, L.; Yang, Z.; Wen, W.; Shen, B. Infovis: Interactive visualization of knowledge space based on text mining. In International Conference on Human-Computer Interaction; Springer: Berlin/Heidelberg, Germany, 2014; pp. 479–489.
[78] VisOHC. Available online: https://www.visohc.org/ (accessed on 27 May 2022).
[79] Wang, J.; Dou, W.; Wang, X.; Li, F.; Zhang, X. IncreSTS: Incremental text stream visualization. IEEE Trans. Vis. Comput. Graph. 2019, 26, 862–871.
[80] Cui, W.; Wu, Y.; Liu, M.; Su, G.; Ren, K.; Zhou, M.X.; Liang, F. Word-sized graphics: Using small visualizations to represent words in text. IEEE Trans. Vis. Comput. Graph. 2013, 19, 2456–2465.
[81] Liu, M.; Shen, B.; Liang, F. TextTile: An interactive visual analytic system for exploring text collections. IEEE Trans. Vis. Comput. Graph. 2013, 19, 2436–2445.
[82] Wu, Y.; Liu, M.; Cui, W.; Liu, S. SentiCompass: Visualizing and navigating sentiment in large text corpus. IEEE Trans. Vis. Comput. Graph. 2013, 19, 2052–2061.
[83] Cui, W.; Liu, S.; Wu, Y.; Wei, F.; Zhou, M.X. NewsViews: Visualizing news with diverse opinions. IEEE Trans. Vis. Comput. Graph. 2012, 18, 2577–2586.
[84] Xu, S.; Yuan, X.; Guan, Q.; Cai, H.; Qu, H. WeiboEvents: Visual Summarization of Topical Events on Weibo. IEEE Trans. Vis. Comput. Graph. 2015, 21, 124–133.
[85] Dou, W.; Wang, X.; Zhou, M.X.; Wang, J.; Li, F.; Zhang, X. CatStream: Visual analysis of categorical data over time. IEEE Trans. Vis. Comput. Graph. *2019, 26, 648–658.
[86] Zhou, M.X.; Zhang, D.J.; Xu, S.; Qu, H.; Ren, K.; Liang, F. PhenoLines: Quantitative phenotype comparison visualization. IEEE Trans. Vis. Comput. Graph. *2012, 18, 2425–2434.
[87] Wu, Y.; Wei, F.; Liu, S.; Shen, B.; Cui, W.; Zhou, M.X. SocialBrands: Visual analysis of consumer opinions on social media. IEEE Trans. Vis. Comput. Graph. 2011, 17, 2417–2426.
[88] Choo, J.; Lee, S.; Park, S.; Jang, Y.; Han, J.; Seo, J. TopicPanorama: Visual analytic system for monitoring and understanding topical evolution in text streams. IEEE Trans. Vis. Comput. Graph. 2013, 19, 2033–2042.
[89] Thom, D.; Bögl, M.; Mittlboeck, M.; Leitner, P.; Gröller, E.L.; Miksch, S. #FluxFlow: Visual analysis of anomalous information spreading on Twitter. IEEE Trans. Vis. Comput. Graph. 2012, 18, 2435–2444.
[90] Pearl, C. PEARL: A particle model for visualizing keyword associations over time. IEEE Trans. Vis. Comput. Graph. *2009, 15, 1091–1098.
[91] Börner, K.; Penumarthy, S.; Fortunato, S.; Spirin, M.; Duhon, M.; Zoss, A. Vispubdata.org: The information visualization community dataset. IEEE Trans. Vis. Comput. Graph. *2015, 22, 1181–1185.
[92] Hu, K.; Zhang, Y.; Wang, J.; Cao, N.; Wei, F.; Ren, K.; Liang, F.; Zhou, M.X. GameFlow: Considering multiple user interaction aspects for temporal event sequence analysis. IEEE Trans. Vis. Comput. Graph. *2017, 24, 1575–1584.
[93] Dou, W.; Wang, X.; Ma, X.; Wang, R.; Zhou, M.X.; Wang, J.; Li, F.; Zhang, X. MultiConVis: Visual consensus tracking in online multi-modal discussions. IEEE Trans. Vis. Comput. Graph. *2017, 24, 648–657.
[94] Gotz, D.; Gorlatova, N. Contextifier: Interactive visual context modeling for text analysis. IEEE Trans. Vis. Comput. Graph. *2011, 17, 2407–2416.
[95] Wattenberg, M.; Viegas, F.B.; Hollenbach, T. Visualizing ambiguous networks: An example using Google+Ripples. IEEE Trans. Vis. Comput. Graph. *2011, 17, 2294–2301.
[96] Xu, S.; Yuan, X.; Guan, Q.; Qu, H.; Cui, W. NEREx: Named entity relationship extraction and visualization. IEEE Trans. Vis. Comput. Graph. *2015, 22, 449–458.
[97] Wang, F.; Dou, W.; Wang, X.; Zhou, M.X.; Wang, J.; Chen, J.; Li, F.; Zhang, X. iForum: Visual analysis of heterogeneous information in online forums. IEEE Trans. Vis. Comput. Graph. *2018, 25, 33–42.
[98] Cao, N.; Cui, W.; Wu, Y.; Wei, F.; Liu, S.; Ren, K.; Qu, H. NameClarifier: A visual analytic system for author name disambiguation. IEEE Trans. Vis. Comput. Graph. *2011, 17, 2387–2396.
[99] Xu, S.; Yuan, X.; Guan, Q.; Qu, H. DIA2: A visual analytic system for discovering, understanding, and assisting ambiguous identity resolution. IEEE Trans. Vis. Comput. Graph. *2014, 20, 2055–2064.
[100] Skupin, A.; Fabrikant, S.I. From geographic to geographic knowledge visualization: Deep foundations and new directions. Cartogr. Geogr. Inf. Sci. *2008, 35, 81–85.
[101] Dou, W.; Wang, X.; Ma, X.; Wang, R.; Zhou, M.X.; Wang, J.; Li, F.; Zhang, X. OpinionFlow: Visual analysis of opinion evolution in online discussions. IEEE Trans. Vis. Comput. Graph. *2016, 23, 2401–2410.
[102] Satyanarayan, A.; Wong, B.; Heer, J. Trifacta: Interactive visual analysis of tabular data. ACM SIGMOD Rec. *2016, 45, 107–112.
[103] Cui, W.; Liu, S.; Wu, Y.; Wei, F.; Zhou, M.X. HierarchicalTopics: Visual analysis of topic evolution in a text corpus. IEEE Trans. Vis. Comput. Graph. *2011, 17, 2417–2426.
[104] Heer, J.;boyd, d. Whisper: Tracing networked conversations by combining text and social context. In Proceedings of the Human Factors in Computing Systems (CHI); ACM: New York, NY, USA, 2010; pp. 2401–2404.
[105] Stasko, J.; Gribov, A.; Gupta, A.; Kim, B.; O’Dowd, M.; Shen, M.; Zhou, J. The world’s languages explorer: A tool for analyzing and visualizing language data. In International Conference on Human-Computer Interaction; Springer: Cham, Switzerland, 2018; pp. 495–505.
[106] Shen, B.; Wu, Y.; Ren, K.; Liang, F.; Cui, W.; Qu, H. Interactive ambiguity resolution in tag clouds. IEEE Trans. Vis. Comput. Graph. *2012, 18, 2597–2606.
[107] Shneiderman, B. The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the IEEE Symposium on Visual Languages; IEEE: Washington, DC, USA, 1996; pp. 336–343.
[108] Feng, X.; Guo, Z.; Xu, J.; Zhang, C.; Qu, H.; Wu, Y.; Zhang, X. Conceptual recurrence plots: Visualization and analysis of document similarity. IEEE Trans. Vis. Comput. Graph. *2016, 23, 1701–1710.
[109] Al-Musalmi, M.; Elmqvist, N. The Deshredder: A visualization tool for the reconstruction of shredded documents. IEEE Trans. Vis. Comput. Graph. *2010, 16, 1139–1148.
[110] Choo, J.; Jang, Y.; Park, S.; Chung, S.; Kim, J.; Lee, S.; Seo, J. Termite: Visualizing term co-occurrence networks. In Proceedings of the 2011 IEEE Symposium on Visual Analytics Science and Technology (VAST); IEEE: Providence, RI, USA, 2011; pp. 261–262.
[111] Zhao, S.; Chen, Z.; Li, B.; Cao, Y.; Sun, X.; Cui, W. VAiRoma: Visual interactive roadmap for complex problem analysis. IEEE Trans. Vis. Comput. Graph. *2014, 20, 1753–1762.
[112] Dou, W.; Wang, X.; Zhou, M.X.; Zhang, X.; Li, F. CiteRivers: Visual exploration of citation patterns. IEEE Trans. Vis. Comput. Graph. *2013, 19, 2426–2435.
[113] Heer, J.; Viégas, F.B.; Wattenberg, M. Guided exploration of text collections. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; ACM: New York, NY, USA, 2010; pp. 615–624.
[114] Cui, W.; Wu, Y.; Liu, S.; Qu, H.; Zhou, M.X. EvoRiver: Visual analysis of topic evolution. IEEE Trans. Vis. Comput. Graph. *2011, 17, 2427–2436.
[115] Cui, W.; Wu, Y.; Liu, S.; Qu, H.; Zhou, M.X.; Liang, F. LeadLine: Interactive visual analysis of text data with evolving topics. IEEE Trans. Vis. Comput. Graph. *2014, 20, 923–932.
[116] Jo, J.; Seo, J. TimeLineCurator: Interactive visual exploration of event sequences. IEEE Trans. Vis. Comput. Graph. *2017, 24, 688–697.
[117] Ruddle, R.A.; Lessels, S.; Jones, M. An evaluation of interactive visual profiling methods for classifying behavior in large data sets. IEEE Trans. Vis. Comput. Graph. *2012, 18, 2052–2061.
[118] Cui, W.; Wu, Y.; Liu, S.; Qu, H.; Zhou, M.X. ConToVi: Visualizing text with semantic contours. IEEE Trans. Vis. Comput. Graph. *2010, 16, 1108–1117.
[119] Dou, W.; Wang, X.; Ren, K.; Qu, H.; Zhou, M.X. ConVis: Visual consensus analysis on controversy. IEEE Trans. Vis. Comput. Graph. *2012, 18, 2517–2526.
[120] Lu, J.; Yan, H.; Chen, G.; Ye, Y.; Feng, X.; Zhu, X.; Xu, J.; Guo, X. A two-stage framework for 3D spatial text visualization based on clustering and feature point selection. IEEE Trans. Vis. Comput. Graph. *2017, 23, 1951–1961.
[121] Lampe, C.; Wash, R.; Velasquez, A.; Ozdivek, B. Can twitter save lives? A study of the use of twitter during the 2010 Russian wildfires. In Proceedings of the International Conference on Information Technologies and Emergency Management; Springer: Berlin/Heidelberg, Germany, 2011; pp. 256–265.
[122] Wallgrün, J.O.; Fabrikant, S.I.; Laube, P. Visualizing dynamic data with maps and landscape metaphors. J. Spat. Inf. Sci. *2010, 1, 19–41.
[123] Maciejewski, R.; Hafen, R.; Rudolph, S.; Wang, X.; Gooch, B.; Paton, N.; Otto, C.; Blasch, E.; Zhao, S.; et al. Spatiotemporal anomaly detection for geo-located Twitter data. IEEE Trans. Vis. Comput. Graph. *2011, 17, 2521–2528.
[124] Top 10 Challenges of Data Visualization and How to Overcome Them. Available online: https://www.datapine.com/blog/data-visualization-challenges/ (accessed on 27 May 2022).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.