• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Medical Research Institute, College of Medicine, Chungbuk National University, Korea.)
  2. (Dept. of Big Data, Chungbuk National University, Korea.)
  3. (Dept. of Urology, College of Medicine, Chungbuk National University and Chungbuk National University Hospital, Korea.)
  4. (Dept. of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.)
  5. (Institute for Trauma Research, College of Medicine, Korea University, Korea.)



Kidney tumor, Stage classification, Deep learning, Feature extraction

1. Introduction

Analyzing genomic data related to gene expression of biological phenomena is challenging due to the greater number of genes compared to the number of patients. Recently, various studies have been conducted utilizing such bio data. In particular, the efficiency, accuracy, and speed of research are being enhanced using AI technology in the field of biotechnology for data storage, purification, and analysis (1).

We aim to contribute to the early diagnosis, prognosis, and prediction of cancer in patients by extracting significant genes using genetic data from kidney cancer and developing a classification model based on the extracted genes. Kidney cancer is a rapidly increasing cancer and is often referred to as a silent cancer. The likelihood of all symptoms, including flank pain, bloody stool, and abdominal mass, appearing is only 10-15%. Specifically, kidney cancer typically presents with no noticeable symptoms, and in 3 out of 10 cases, it is found to have metastasized to other organs. Therefore, implementing appropriate treatment based on the tumor stage of kidney cancer patients is a crucial task that demands a strategic approach.

Kidney cancer is a primary tumor that occurs in the kidney, and renal cell carcinoma, a malignant tumor, accounts for more than 90% of cases. Kidney cancer typically does not show symptoms in the early stages, often already reaching a progressive stage by the time of diagnosis. According to data released by the National Cancer Information Center in 2022, kidney cancer accounted for 2.4% of all new cancer cases in Korea in 2020, ranking 10th in incidence (2). The incidence of cancer is higher in men than in women, and it most frequently occurs in people in their 60s. Additionally, kidney cancer places a significant disease burden due to a decline in the quality of life resulting from disease symptoms, treatment-related adverse events, and the subsequent increase in medical costs. Risk factors for kidney cancer include environmental habits, lifestyle factors, genetic predispositions, and existing kidney disease. Among these, lifestyle factors like smoking, obesity, high blood pressure, and dietary habits can be contributing causes (3). Recently, domestic researchers developed an algorithm to predict kidney cancer recurrence, and ongoing research is focusing on extracting features and implementing a classification algorithm using neighborhood component analysis and genomic data (4)-(6). Machine learning-related algorithms are being applied to various biodata analyses, including RNA sequencing, DNA methylation analysis in breast invasive carcinoma, thyroid carcinoma, and kidney renal papillary cell carcinoma data from The Cancer Genome Atlas (TCGA) (7). A data mining algorithm was employed to extract cancer-related genes by integrating the data. Additionally, a study predicted the risk of 20 cancers by applying machine learning techniques to analyze genetic big data (8). A Bayesian classifier has been utilized to classify proteins based on sequence and structure information, enhancing the functional prediction performance of genes by integrating diverse protein and gene-related information using a Bayesian network (9). Research efforts have also been directed towards accurately predicting major mutations responsible for spinal muscular atrophy, hereditary nasal polyposis, colorectal cancer, and autism. This is achieved by applying deep learning technology to predict the patient's disease state through the analysis of mutations present in the gene sequence (10). Various methods for extracting features from gene expression data are being studied, and recently, methods using deep learning and statistical techniques are being studied (11)-(13). Machine learning-based deep learning algorithms are widely used as a method for feature extraction from biomedical images (14).

In this study, we extracted significant genes from gene expression datasets using two algorithms: autoencoder (AE) and variational autoencoder (VAE). We then compared and analyzed the tumor stage classification performance of kidney cancer. Classification analysis based on tumor stage allows for the analysis of complex data, such as gene expression data, and improves classification accuracy. This approach can serve as a foundation for analyzing other gene expression data, and various machine learning algorithms can be employed to analyze medical data.

2. Materials and methods

This section describes the dataset as well as all of the techniques applied in this study. Fig. 1 depicts the overall research flow, from dataset collection to model evaluation.

그림. 1. 신장암 단계 결정에 사용된 딥러닝 프레임워크 흐름도

Fig. 1. An end-to-end experimental flow of our deep learning framework used for staging the kidney tumor

../../Resources/kiee/KIEE.2023.72.11.1412/fig1.png

2.1 Dataset

This study made use of a gene expression dataset obtained from TCGA website, which provides access to a wide range of biomedical datasets, including mRNA data. The dataset, which included information from 1,157 kidney cancer patients, was meticulously prepared for analysis. Although the original dataset included both gene expression and clinical data, only gene expression data was used in this study. To ensure accuracy and consistency, the dataset was cleaned and preprocessed to remove missing, duplicate, and invalid values, as well as clinical data. tables 1 and 2 provide detailed statistics about the dataset, such as the total number of datasets for each stage and information on data features. The "Before Cleaning" row refers to the original dataset, which was downloaded from the TCGA website without any preprocessing, whereas the "After Cleaning" row refers to the dataset after various cleaning and preprocessing techniques were applied. This meticulous preparation of the gene expression dataset contributed to the subsequent analyses producing reliable and meaningful results.

표 1. 데이터 정제 전후의 종양 단계 데이터 통계

Table 1. Tumor stage data statistic before and after data cleaning

Condition

Stage1

Stage2

Stage3

Stage4

Invalid

Total

Before Cleaning

528

183

261

146

39

1,157

After Cleaning

477

153

228

115

0

973

표 2. 데이터 정제 전후의 유전자 발현 데이터 통계

Table 2. Gene expression data statistic before and after data cleaning

Condition

Number of sample

Gene expression features

Clinical data features

Total features

Before Cleaning

1,157

60,483

29

60,512

After Cleaning

973

58,722

0

58,722

2.2 Correlation test

Naturally, a gene expression dataset consists of a large number of features, and each feature represents a unique gene of a patient. Among this massive number of features, some are highly correlated, while others are less correlated or not at all. By doing a correlation test on the dataset, we can significantly improve the performance of the classification models (15). Classification models might be hampered by redundant features. As a result, the Pearson correlation coefficient-based method is utilized to identify redundant features, which leads to their removal. We choose only the top 1000 features with the least amount of redundancy after filtering the data using Pearson correlation values. The following formula defines the Pearson correlation values:

(1)

$r=\frac{\sum\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sqrt{\sum\left(x_i-\bar{x}\right)^2 \sum\left(y_i-\bar{y}\right)^2}}$

2.3 Feature selection

After features were filtered using the correlation test, a second method was used to reduce the dimension of the features by applying the feature selection method. Feature selection can help in getting rid of irrelevant data as well as dealing with noise in the dataset (16), (17). Two feature selection techniques were selected and combined to choose a list of highly relevant features from thousands of features. These techniques are Least Absolute Shrinkage and Selection Operator (LASSO) and Analysis of Variance (ANOVA). Lasso is a linear regression in which shrinkage techniques are used to shrink the coefficients of determination toward zero (18). ANOVA is a famous statistical hypothesis test used to determine whether there are any significant differences in the variances of two or more groups (19).

Three different experiments, including using only LASSO, using only ANOVA, and combining both LASSO and ANOVA together, were conducted to find the best feature selection strategy that gave the best classification result. By evaluating the results produced by these three strategies, we can observe that combining LASSO with ANOVA can improve classification performance. We combined these two techniques' results by selecting the first 500 relevant features from the intersection of both results.

2.4 Feature extraction

Two popular methods can be used for dimension reduction when the dataset contains many features. Besides the earlier mentioned feature selection technique, feature extraction is another way to do it. The main goal of feature extraction is to create a smaller version of the original dataset while keeping the meaning of the original dataset (20). Besides dimension reduction, improving training speed and better computation are well-known feature extraction advantages (21)(23). This research used two feature extraction techniques: AE, VAE.

2.4.1 AE

An unsupervised artificial neural network architecture that is used to compress and encode the data before reconstructing the original data from the encoded data (24). It is widely used for dimension reduction or noise removal. This neural network consists of three components: the encoder, bottleneck, and decoder (25). The encoder is responsible for compressing the original dataset into a small-dimension version while trying to keep the original meaning of the dataset. The bottleneck is accountable for storing the compressed version of the data. The decoder compresses the data from the bottleneck to reconstruct the original data. The AE learns by observing the reconstruction error and attempting to minimize it so that the original and reconstructed data look as similar as possible.

그림. 2. AE 모델의 아키텍처 (26)

Fig. 2. The architecture of the AE model (26)

../../Resources/kiee/KIEE.2023.72.11.1412/fig2.png

2.4.2 VAE

It follows an encoder-bottleneck-decoder structure but modifies the bottleneck component to make it a generative model. VAE learns the probability distribution of the data instead of mapping the input to a numeric number (27). Two latent spaces, mean and variance, were learned in the bottleneck component. So, a new dataset could be generated by choosing a random value from that distribution.

그림. 3. 3 VAE 모델의 아키텍처 (26)

Fig. 3. The architecture of the VAE model (26)

../../Resources/kiee/KIEE.2023.72.11.1412/fig3.png

2.5 Synthetic minority over-sampling technique

Our dataset is skewed or imbalanced, which means that the number of observations for each class is not equal or close to each other. table 1 and 2, which describe the data statistics, show that the observations for tumor stage 1 cover around 50% of the dataset, leaving another 50% for the other three tumor stages. Imbalanced data can cause problems for machine learning models by biasing the model toward any class with more datasets (28)(30). To solve this problem, we apply an oversampling technique called Synthetic Minority Over-sampling Technique (SMOTE) to generate a new set of datasets that contain the same amount of observation for all classes (31). Prior to utilizing SMOTE, the train dataset exhibited an unequal distribution across the kidney tumor stages. Stage 1 had the highest count with 312 records, followed by Stage 3 with 189, Stage 2 with 105, and Stage 4 being the least represented with 76 instances. In a bid to rectify this imbalance, we applied the SMOTE technique with the "auto" hyperparameter, which adopts an adaptive sampling strategy. After the SMOTE application, the total count of records for all tumor stages combined equaled 1248, with each individual stage-Stage 1 to Stage 4-having a balanced representation of 312 records.

2.6 Classifiers

We applied the top 9 classification algorithms, allowing us to assess how well they performed against one another. These methods include logistic regression (LR) (32), support vector machine (SVM) (33), decision tree (DT) (34), random forest (RF) (35), k-nearest neighbor (KNN) (36), naïve bayes (NB) (37), AdaBoost (ADA) (38), XGBoost (XGB) (39), and stochastic gradient descent classifier (SGD) (40). For the selected classifiers, the hyperparameter configurations are as follows: For LR, the settings are C=1, penalty="l2", and solver="liblinear". SVM is set with C=1, kernel="rbf", and gamma="scale". DT is configured with max_depth=5. RF uses n_estimators=100 and max_depth=5. KNN utilizes n_neighbors=5. NB operates with default settings. ADA uses n_estimators=50. XGB is set with n_estimators=100 and learning_rate=0.1. For SGD, it's wrapped in a calibration with max_iter=1000 and tol=1e-3.

2.7 Model evaluation metrics

This section discusses the evaluation metrics for all the classifiers listed above. The seven most popular evaluation metrics were used to compare the performance of those classifiers, including accuracy, recall, precision, the f1-score, sensitivity, specificity, and area under the curve (AUC) (41). To balance the model performance across classes, we use macro-averaged precision, recall, and f1-score to get the overall average value for each metric. Besides these metrics, specificity, and sensitivity were also calculated to get a deeper understanding of the rates of true positive and true negative, respectively. In the below formula: TP, TN, FP, and FN are the numbers of true positive, true negative, false positive, and false negative, respectively.

(2)

$\begin{aligned} & \text { Accuracy }=\frac{T P+T N}{T P+T N+F P+F N} \\ & \text { Precision }=\frac{T P}{T P+F P} \\ & \text { Recall }=\frac{T P}{T P+F N} \\ & \text { F1-score }=\frac{2 \times \text { Recall } \times \text { Precision }}{\text { Recall }+ \text { Precision }} \\ & \text { Sensitivity }=\frac{T P}{T P+F N} \\ & \text { Specificity }=\frac{T N}{T N+F P}\end{aligned}$

3. Experiment results

This section presents the result of the experiment from the beginning to the end, including feature extraction with deep learning techniques and classification results with multiple classification algorithms.

3.1 Feature extraction with deep learning

We reduced the total number of features from thousands to 1000 by filtering using a correlation test. Then, we continue reducing the dataset dimension from 1000 to 500 by applying feature selection techniques. From these 500 features, we applied features extraction to reduce the dataset dimension to 50 features. We chose three feature extraction algorithms to conduct the experiment, including AE and VAE. Based on the experiment result, the original and reconstructed dataset was observed to be similar, which shows the efficiency in feature extraction. fig 4shows the models' loss values, and fig 5visualizes the reconstructed data point compared to the original data point when using AE and VAE models. Among these two algorithms, the AE model produced the most similar reconstructed dataset compared to the original dataset.

그림. 4. 특징추출기법에 따른 (a) AE와 (b) VAE의 훈련 및 평가 손실

Fig. 4. Training and validation loss of (a) AE and (b) VAE as feature extraction techniques

../../Resources/kiee/KIEE.2023.72.11.1412/fig4.png

그림. 5. (a) AE와 (b) VAE를 사용한 표본 데이터 재구성

Fig. 5. Sample data reconstruction using (a) AE and (b) VAE

../../Resources/kiee/KIEE.2023.72.11.1412/fig5.png

3.2 Classification results

This section shows the result of 9 machine learning classifiers that is used for classifying patient kidney tumor stages. Before training classifiers, we preprocessed data with correlation analysis, feature selection, feature extraction, and oversampling for the purpose of improving classification performance. table 3 and 4 indicate the evaluation metrics of all classifiers when using an AE and VAE as feature extraction with and without the oversampling technique applied. In these two tables, the "Sampling" column indicates whether the oversampling technique was applied.

table 3 shows the performance of all classifiers when using the AE model to extract the features. In this case, the support vector machine outperformed other classifiers in most evaluation metrics. XGBoost, Naïve Bayes, and Decision Tree are the next algorithms to show good results. We could see that SVM significantly improves the performance after applying the oversampling technique to the dataset.

표 3. 특징추출기법으로 AE를 사용한 예측 모델의 평가

Table 3. Evaluation of prediction models using AE as feature extraction

Classifier

Sampling

Accuracy

Recall

Precision

F1-Score

Sensitivity

Specificity

AUC

LR

Yes

0.890

0.860

0.863

0.858

0.860

0.964

0.968

No

0.873

0.814

0.844

0.825

0.814

0.957

0.955

SVM

Yes

0.984

0.974

0.987

0.980

0.974

0.992

0.985

No

0.959

0.936

0.963

0.948

0.936

0.984

0.973

DT

Yes

0.964

0.974

0.987

0.980

0.974

0.992

0.983

No

0.959

0.936

0.963

0.948

0.936

0.984

0.960

RF

Yes

0.952

0.945

0.945

0.944

0.945

0.983

0.983

No

0.904

0.847

0.903

0.860

0.847

0.966

0.958

KNN

Yes

0.846

0.829

0.814

0.817

0.829

0.950

0.948

No

0.822

0.762

0.794

0.769

0.762

0.939

0.925

NB

Yes

0.973

0.974

0.987

0.980

0.974

0.992

0.983

No

0.959

0.936

0.963

0.948

0.936

0.984

0.960

ADA

Yes

0.753

0.728

0.587

0.630

0.728

0.925

0.930

No

0.829

0.726

0.639

0.669

0.726

0.942

0.912

XGB

Yes

0.963

0.974

0.987

0.980

0.974

0.992

0.983

No

0.959

0.936

0.963

0.948

0.936

0.984

0.963

SGD

Yes

0.870

0.838

0.839

0.831

0.838

0.958

0.953

No

0.884

0.826

0.867

0.838

0.826

0.959

0.950

Moreover, another different result is shown in table 4, which indicates the result when we used a VAE to extract the features. The performance changed when we changed the feature extraction algorithm, which means it decreased in this case compared to the AE model. The table shows that SVM still outperforms most of the classifiers in terms of AUC value, followed by XGBoost, Naïve Bayes, and Decision Tree.

표 4. 특징추출 기법으로 VAE를 사용한 예측 모델의 평가

Table 4. Evaluation of prediction models using VAE as feature extraction

Classifier

Sampling

Accuracy

Recall

Precision

F1-Score

Sensitivity

Specificity

AUC

LR

Yes

0.863

0.810

0.865

0.833

0.810

0.949

0.928

No

0.849

0.812

0.834

0.815

0.812

0.946

0.918

SVM

Yes

0.918

0.886

0.929

0.904

0.886

0.967

0.943

No

0.918

0.886

0.929

0.904

0.886

0.967

0.945

DT

Yes

0.918

0.903

0.929

0.903

0.903

0.968

0.935

No

0.918

0.903

0.929

0.903

0.903

0.968

0.935

RF

Yes

0.870

0.844

0.888

0.830

0.844

0.952

0.935

No

0.901

0.877

0.912

0.880

0.877

0.961

0.938

KNN

Yes

0.678

0.590

0.654

0.581

0.590

0.885

0.843

No

0.634

0.610

0.611

0.588

0.610

0.880

0.818

NB

Yes

0.918

0.903

0.929

0.903

0.903

0.968

0.935

No

0.918

0.903

0.929

0.903

0.903

0.968

0.935

ADA

Yes

0.791

0.653

0.619

0.610

0.653

0.927

0.875

No

0.702

0.653

0.559

0.553

0.653

0.906

0.875

XGB

Yes

0.918

0.903

0.929

0.903

0.903

0.968

0.915

No

0.918

0.903

0.929

0.903

0.903

0.968

0.935

SGD

Yes

0.860

0.794

0.869

0.810

0.794

0.949

0.915

No

0.829

0.772

0.808

0.774

0.772

0.941

0.901

tables 3 and 4 show the evaluation result of all classifiers as an average of four tumor classes. fig 6can bring more detail to the Area Under the Curve (AUC) value listed in the above tables by showing the Receiver operating characteristic curve (ROC) and AUC value for each tumor stage without calculating it as an average value. We only show the ROC curve when AE is used as feature extraction since we already observed that AE is outperformed the other two feature extraction methods.

그림. 6. AE를 사용한 특징 추출과 SMOTE를 사용한 오버샘플링의 ROC와 AUC

Fig. 6. ROC and AUC for each classifier when using AE as the feature extraction and SMOTE as an oversampling

../../Resources/kiee/KIEE.2023.72.11.1412/fig6.png

4. Discussion and conclusion

The survival rate of kidney cancer patients who received early treatment is higher than those who received treatment in the serious stage (42). That is why technology is widely used to help patients receive health information and treatment as soon as possible by using patient data such as clinical and biomedical data (43), (44).

In this study, we used different algorithms to detect and classify the stage of kidney tumors. An open dataset published on the TCGA portal was cleaned, preprocessed, and used to build classification models. A correlation analysis, feature extraction, feature selection, and oversampling technique were also used for the purpose of boosting the classifiers’ performance.

With this gene expression dataset, we observed from the results that the AE outperformed the VAE as a feature extraction technique. Classifier performance has significantly improved in most evaluation metrics, including accuracy, recall, precision, and f1-score. So, feature extraction and feature selection can improve classification models by reducing the dimensionality of the input data, removing irrelevant and redundant features, and highlighting the most important features for the specific task. This leads to more efficient and accurate models and faster training and inference times. The above result shows the enhancement of the classification model when different techniques, including feature selection, feature extraction, and resampling, are applied.

In conclusion, the classification of kidney tumor stages using feature extraction techniques such as AE and VAE in conjunction with various classifiers such as LR, RF, DT, SVM, and others has been shown to be an effective method for accurately classifying kidney tumors. The combination of feature extraction techniques and classifiers has been used to extract the most important features from medical imaging data and build models that can accurately classify tumors into different stages.

The results obtained from these methods have been promising, with high accuracy and precision in classifying tumor stages. This is particularly important for the early detection and treatment of kidney tumors, which can improve patient outcomes and reduce healthcare costs. Furthermore, using multiple classifiers such as LR, RF, DT, and SVM allows us to compare the performance of the different models and select the best one for a specific dataset.

To continue contributing to the medical sector, we plan to improve classification performance by combining both clinical and biomedical data of patients for training classifiers. Moreover, we plan to detect and identify a list of genes that contributes the most to kidney cancer. As a future study, we intend to utilize explainable artificial intelligence (XAI) to explain the features of gene expression data that contributed to predictive models, so that users can understand and trust the results of machine learning algorithm analysis. XAI can be used to describe AI models, expected impacts and potential biases. This will be useful for characterizing model accuracy, fairness, transparency, and end result in AI-based decision making. With an accurate list of essential genes, a doctor could pay more attention to those genes rather than spending time on other less essential genes.

Acknowledgements

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2023-00245300, No. 2020R1I1A1A 01065199, 2020R1I1A3062508) and by "Regional Innovation Strategy (RIS)" through the NRF funded by the MOE (2021RIS-001)

References

1 
L. A. Gottlieb, A. Kontorovich, R. Krauthgamer, 2016, Adaptive metric dimensionality reduction, Theoretical Computer Science, Vol. 620, pp. 105-118DOI
2 
17 August 2023, National Cancer Center. Available online, https://ncc.re.kr/indexGoogle Search
3 
H. Chi, I. H. Chang, 2018, The Overdiagnosis of Kidney Cancer in Koreans and the Active Surveillance on Small Renal Mass, Korean J Urol Oncol, Vol. 16, No. 1, pp. 15-24DOI
4 
A. M. Ali, H. Zhuang, A. Ibrahim, O. Rehman, M. Huang, A. Wu, Nov. 2018, A machine learning approach for the classification of kidney cancer subtypes using miRNA genome data, Appl Sci, Vol. 8, No. 2422, pp. 1-14DOI
5 
H. M. Kim, S. J. Lee, S. J. Park, I. Y. Choi, S. Hong, 2021, Machine Learning Approach to Predict the Probability of Recurrence of Renal Cell Carcinoma After Surgery: Prediction Model Development Study, JMIR Med Inform, Vol. 9, No. 3DOI
6 
A. J. Peired, R. Campi, M. L. Angelotti, G. Antonelli, C. Conte, E. Lazzeri, F. Becherucci, L. Calistri, S. Serni, P. Romagnani, 2021, Sex and Gender Differences in Kidney Cancer: Clinical and Experimental Evidence, Cancers, Vol. 13, No. 18, pp. 4588-DOI
7 
17 August, 2023, Genomic Data Commons. Available online, https://portal.gdc.cancer.govGoogle Search
8 
B. J. Kim, S. H. Kim, 2018, Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method, Proc Natl Acad Sci USA, Vol. 115, No. 6, pp. 1322-1327DOI
9 
O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, D. Botstein, 2003, A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae), Proc Natl Acad Sci USA, Vol. 100, No. 14, pp. 8348-8353DOI
10 
N. E. M. Khalifa, M. H. N. Taha, D. E. Ali, A. Slowik, A. E. Hassanien, 2020, Artificial intelligence technique for gene expression by tumor RNA-Seq data: a novel optimized deep learning approach, IEEE Access, Vol. 8, pp. 22874-22883DOI
11 
H. S. Shon, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of Kidney Cancer Data based on Feature Extraction Methods, The Transactions of the Korean Institute of Electrical Engineers, Vol. 69, No. 7, pp. 1061-1066DOI
12 
H. S. Shon, E. Batbaatar, E. J. Cha, T. G. Kang, S. G. Choi, K. A. Kim, 2022, Deep Autoencoder based Classification for Clinical Prediction of Kidney Cancer, The Transactions of the Korean Institute of Electrical Engineers, Vol. 71, No. 10, pp. 1393-1404DOI
13 
H. S. Shon, E. Batbaatar, K. O. Kim, E. J. Cha, K. A. Kim, 2020, Classification of kidney cancer data using cost-sensitive hybrid deep learning approach, Symmetry, Vol. 12, No. 1, pp. 1-21DOI
14 
Y. Bengio, E. Laufer, G. Alain, J. Yosinski, 2014, Deep generative stochastic networks trainable by backprop, Proceeding of the 31st International Conference on Machine Learning, Vol. 32, pp. 226-234DOI
15 
B. Kalaiselvi, M. Thangamani, 2020, An efficient Pearson correlation based improved random forest classification for protein structure prediction techniques, Measurement, Vol. 162DOI
16 
I. Jain, V. K. Jain, R. Jain, 2018, Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification, Applied Soft Computing, Vol. 62, pp. 203-215DOI
17 
Z. M. Hira, D. F. Gillies, 2015, A review of feature selection and feature extraction methods applied on microarray data, Adv Bioinformatics, Vol. 2015DOI
18 
R. Tibshirani, 1996, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical B, Vol. 58, No. 1, pp. 267-288DOI
19 
E. R. Girden, 1992, ANOVA: Repeated measures, sageGoogle Search
20 
I. Guyon, M. Nikravesh, S. Gunn, L. A. Zadeh, 2008, Feature extraction: foundations and applications, SpringerGoogle Search
21 
A. Nakra, M. Duhan, 2020, Feature Extraction and Dimensionality Reduction Techniques with Their Advantages and Disadvantages for EEG-Based BCI System: A Review, IUP Journal of Computer Sciences, Vol. 14, No. 2020, pp. -DOI
22 
X. Zhang, W. Yang, X. Tang, J. Liu, 2018, A fast learning method for accurate and robust lane detection using two-stage feature extraction with YOLO v3, Sensors, Vol. 18, No. 12, pp. 4308-DOI
23 
M. Oravec, 2014, Feature extraction and classification by machine learning methods for biometric recognition of face and iris, Proceedings of ELMAR-2014DOI
24 
P. Baldi, 2011, Autoencoders, unsupervised learning, and deep architectures, Proceedings of machine learning research, Vol. 27, pp. -DOI
25 
M. Sewak, S. K. Sahay, H. Rathore, 2020, An overview of deep learning architecture of deep neural networks and autoencoders, Journal of Computational and Theoretical Nanoscience, Vol. 17, No. 1, pp. 182-188DOI
26 
L. Weng, From Autoencoder to Beta-VAE, Available from: https://lilianweng.github.io/posts/2018-08-12-vae/DOI
27 
D. P. Kingma, M. Welling, 2013, Auto-encoding variational bayes, arXiv preprint arXiv:13126114DOI
28 
S. M. A. Elrahman, A. Abraham, 2013, A review of class imbalance problem, Journal of Network and Innovative Computing, Vol. 1, pp. 332-340DOI
29 
K. M. Hasib, M. Iqbal, F. M. Shah, J. A. Mahmud, M. H. Popel, M. Showrov, S. Ahmed, O. Rahman, 2020, A survey of methods for managing the classification and solution of data imbalance problem, arXiv preprint arXiv:201211870DOI
30 
D. Li, C. Liu, S. C. Hu, 2010, A learning method for the class imbalance problem with medical data sets, Computers in biology and medicine, Vol. 40, No. 5, pp. 509-518DOI
31 
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, 2002, SMOTE: synthetic minority over-sampling technique, Journal of artificial intelligence research, pp. 321-357DOI
32 
D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, M. Klein, 2002, Logistic regression, SpringerDOI
33 
W. S. Noble, 2006, What is a support vector machine?, Nature biotechnology, Vol. 24, No. 12, pp. 1565-1567DOI
34 
B. Charbuty, A. Abdulazeez, 2021, Classification based on decision tree algorithm for machine learning, Journal of Applied Science and Technology Trends, Vol. 2, No. 1, pp. 20-28DOI
35 
Y. Qi, 2012, Random forest for bioinformatics. Ensemble machine learning: Methods and applications, Springer, pp. 307-323DOI
36 
N. S. Altman, 1992, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, Vol. 46, No. 3, pp. 175-185DOI
37 
K. P. Murphy, 2006, Naive bayes classifiers, University of British Columbia, Vol. 18, No. 60, pp. 1-8DOI
38 
T. Hastie, S. Rosset, J. Zhu, H. Zou, 2009, Multi-class adaboost, Statistics and its Interface, Vol. 2, No. 3, pp. 349-360DOI
39 
2016, Xgboost: A scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. -DOI
40 
S. Ruder, 2016, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:160904747DOI
41 
2020, Performance evaluation of supervised machine learning algorithms in prediction of heart disease, 2020 IEEE International Conference for Innovation in Technology, IEEE, pp. -DOI
42 
M. Viscaino, J. T. Bustos, P. Muñoz, C. A. Cheein, F. A. Cheein, 2021, Artificial intelligence for the early detection of colorectal cancer: A comprehensive review of its advantages and misconceptions, World Journal of Gastroenterology, Vol. 27, No. 38, pp. 6399-DOI
43 
A. N. Richter, T. M. Khoshgoftaar, 2018, A review of statistical and machine learning methods for modeling cancer risk using structured clinical data, Artificial intelligence in medicine, Vol. 90, No. , pp. 1-14DOI
44 
S. R. Stahlschmidt, B. Ulfenborg, J. Synnergren, 2022, Multimodal deep learning for biomedical data fusion: a review, Briefings in Bioinformatics, Vol. 23, No. 2, pp. bbab569-DOI

저자소개

손호선 (Ho Sun Shon)
../../Resources/kiee/KIEE.2023.72.11.1412/au1.png

2010 : Ph.D. in Computer Science, Chungbuk National University, Korea.

2012 to present : Visiting professor in Medical Research Institute, School of Medicine, Chungbuk National University, Korea.

Kong Vungsovanreach
../../Resources/kiee/KIEE.2023.72.11.1412/au2.png

2022 to present : Ph.D. student in Big Data, Chungbuk National University, Korea.

Research interest : AI-driven techniques in object detection, recognition, segmentation, classification, and data analytics.

윤석중(Seok Joong Yun)
../../Resources/kiee/KIEE.2023.72.11.1412/au3.png

2004 : Ph.D. in Medicine, Chungbuk National University, Korea.

2005 to present : Professor in Department of Urology, College of Medicine, Chungbuk National University, Korea.

2008-2009: Visiting Professor, Department of Cancer Biology, MD Anderson Cancer Center, Houston, Texas.

오진우 (Jin Woo Oh)
../../Resources/kiee/KIEE.2023.72.11.1412/au4.png

2020 to present : Graduate student in Department of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.

강태건 (Tae Gun Kang)
../../Resources/kiee/KIEE.2023.72.11.1412/au5.png

2000 : Ph.D. in Industrial Engineering, Dongguk University, Korea

2021 to present : Research professor in Institute for Trauma Research, College of Medicine, Korea University, Korea.

김경아 (Kyung Ah Kim)
../../Resources/kiee/KIEE.2023.72.11.1412/au6.png

2001 : Ph.D. in Biomedical Engineering, Chungbuk National University, Korea.

2005 to present : Professor in Department of Biomedical Engineering, College of Medicine, Chungbuk National University, Korea.