Pneumonia Classification of Thorax Images using Convolutional Neural Networks

The digital image processing technique is a product of computing technology development. Medical image data processing based on a computer is a product of computing technology development that can help a doctor to diagnose and observe a patient. This study aimed to perform classification on the image of the thorax by using Convolutional Neural Network (CNN). The data used in this study is lung thorax images that have previously been diagnosed by a doctor with two classes, namely normal and pneumonia. The amount of data is 2.200, 1.760 for training, and 440 for testing. Three stages are used in image processing, namely scaling, gray scaling, and scratching. This study used Convolutional Neural Network (CNN) method with architecture ResNet-50. In the field of object recognition, CNN is the best method because it has the advantage of being able to find its features of the object image by conducting the convolution process during training. CNN has several models or architectures; one of them is ResNet-50 or Residual Network. The selection of ResNet-50 architecture in this study aimed to reduce the loss of gradients at certain network-level depths during training because the object is a chest image of X-Ray that has a high level of visual similarity between some pathology. Moreover, several visual factors also affect the image so that to produce good accuracy requires a certain level of depth on the CNN network. Optimization during training used Adaptive Momentum (Adam) because it had a bias correction technique that provided better approximations to improve accuracy. The results of this study indicated the thorax image classification with an accuracy of 97.73%. Keywords— pneumonia; pneumonia classification; thorax image; CNN.


I. INTRODUCTION
Pneumonia is an acute respiratory infection that occurs in the alveolar. The infection that occurs in the alveolar is caused by bacteria, fungal viruses, or parasites [1]. This disease can cause high fever, shortness of breath, or difficulty breathing and pale when experienced a lack of oxygen. Many factors cause pneumonia, namely individual conditions, lifestyle behaviors, and the most important is the state of the environment. Pneumonia is the biggest cause of infant mortality in 20% in Indonesia (Riskesdas, 2019). The highest pneumonia occurs between the ages of 56-65 years group. This happens due to the change in anatomy physiology due to an aging process that raises momentous consequences reserve functional pulmonary and decrease in endurance. Generally, the diagnosis of pneumonia is done clinically or physical symptoms by a doctor. Besides, further tests that can be done are an x-ray of the thorax to obtain images of parts of the lungs that are experiencing abnormalities. A chest x-ray examination is an imaging test using a type of radiation called an electromagnetic wave [2]. The results of the examination will produce an image or also called a thorax image that displays a cross-section of the human internal organs, especially in the chest cavity.
The thorax image is an efficient investigation in revealing changes in pathology [3]. A more thorough examination is needed for any changes to the organ in the chest cavity, especially the lung. However, manually observed images are less effective in determining a diagnosis, so it caused difficulty in the visual analysis of the organ or object. This is influenced by several factors such as the high visual similarity between some pathologies [4], sensor noise, electronic interference, and the position of the patient that can change the visualization of the true thorax image [5]. The emergence of problems in complex reasoning requires careful observation and special expertise about the principles of anatomy, physiology, and pathology [6]. So, it takes a long time before medical personnel or doctor diagnoses the illness suffered by the patient.
Based on those problems, it needs a technique that can facilitate doctor to examine an abnormality quickly and precisely. The digital image processing technique is a product of the development of computing technology. Computerbased medical image data processing or a system that can classify thorax images into certain classes will greatly assist the doctor in completing disease diagnosing and observing the condition of the patient. The concept of machine learning can be used to solve the problem. Machine learning applies the concept of computers that learn unknown data [7]. The method used to process data is Convolutional Neural Network (CNN). CNN is a development of Multi-Layer Perceptron (MLP) and belongs to the type of Deep Neural Network because of the high level of network depth, and it is one of the most widely used neural network models for image processing [8].
The classification process is carried out to find a model or function that explains and characterizes a concept or class for a particular purpose [9]. The CNN classification method has several popular architecture models that are widely used including AlexNet, DenSeNet, GoogLeNet, VGGNet, and ResNet. Several previous studies have been carried out to resolve classification cases using this method model. Wang, Xiaosong, et al. [10] compared four CNN models at once, namely AlexNet, GoogLeNet, VGGNet, and ResNet. The four models are used to correct the classification of 14 pathology classes on ChestX-ray images. Dimpy Varshni et al. [11] conducted a study of pneumonia detection using the DeseNet CNN model with 75% accuracy. The CNN ResNet model has been used [12] for the detection of pneumonia on x-ray with an accuracy of 94.23%.
However, in previous studies, the CNN architecture used and still needs to be improved inaccuracy, so it needs improvement. In this study, several techniques will be used to improve the accuracy of a CNN architecture. One of them is the Adaptive Moment (Adam). Adaptive Moment (Adam) is an optimization algorithm for adaptive learning level [13] The CNN model to be optimized is the ResNet architecture because the research [10] and [11] in the ResNet architecture model shows the highest average accuracy rate.
This study aims to classify the thorax image by using the CNN method that is optimized with Adam. Classification is done on the chest image with two classes of pneumonia and normal. Both of the CNN architectures (Alexnet and ResNet) are used as a comparison to find out the performance of the optimized ResNet architecture.

II. RESEARCH METHODOLOGY
The research framework in figure 1 explains the workflow in this study. In outline, this research consists of 3 stages, namely image pre-processing, learning process, and classification with Adam optimization.
The data set used in this study is a chest image in the form of a digital image from an x-ray machine of 2,200 chest images that have been diagnosed by a doctor. The data for training consists of 1,760 samples with the details of 880 normal lung thorax images, 880 samples of pneumonia lung thorax. For testing 440 samples with details of 220 normal thorax image samples and 220 thorax pneumonia image samples. The distribution of data in the data set based on the condition label that is presented in table I. The thorax image data set obtained in digital form is raw data so that the image pre-processing stage is needed to prepare the data set before conducting the learning process with CNN. The stages of image pre-processing carried out in this study include scaling, grayscaling, and scratching, which are presented in Figure 2. Scaling is needed to adjust the pixel size of the image. The more number of pixels, the more time for image processing. After the scaling process, the image is changed to grayscale. This step aims to synchronize gray colors on the image that will be processed. The last stage is contrast scratching. This stage aims to get a new, better contrast image than the original image. Image improvement with contrast stretching is a point

Scaling Grayscaling
Contras scretching operation on the original image. In other words, this process relies on the intensity value (gray level) of one pixel and does not depend on other pixels around it. The process of contrast stretching is done by increasing the gray level dynamic field on the image that will be processed. Image pre-processing modeling is presented in Figure 3.
Image data through the training process is an image whose quality has been improved. The training process model for feature extraction is built on the architecture of ResNet-50 and Alexnet. Based on the ResNet-50 architecture modeling in Figure 4, the first step in the CNN architecture is the convolution layer. At this stage, it is done by using a carnal or filter of a certain size. The features produced on each object depend on the number of carnal or filters used. The unity of the CNN architecture model is also called a learning feature. The learning feature works by translating an input image into a feature based on the characteristics of the input in the form of numbers in a vector. For example, in ResNet-50 architecture, the first convolution network uses 64 kernels, which produces a map feature of 64, as shown in Figure 6.
In this study, the ResNet-50 architecture model is chosen as the standard architecture because it is a CNN classification architecture that has a deep network with residual neural network features that take advantage of skip connections or bypass networks to jump over several layers [14]. This network serves to avoid the problem of loss of gradient, by reusing activation from the previous network until the adjacent layer can learn its weight. Also, ResNet-50 architecture performs well in some cases of image classification [10]. ResNet-50 architecture modeling is presented in Figure 4. The Alexnet architecture is used as a comparison architecture for the standard ResNet-50 architecture. Alexnet architecture modeling is presented in Figure 5.    In the training phase, firstly, convolution is carried out on the input image. The results of the convolution will form 64 pieces of filter. The convolution operation is presented in equation (1). Each input is an image with the size W1 x H1 x D1, the output of that layer is a new "image" with the size W2 x H2 x D2,

+1=H2
(1) D2=K K = number of filters F = spatial size of the filter (width / height) S = stride, or large filter shift in convolution P = padding, the number of zeroes added to the edges of the image.
At this stage, the model has not entered the residual module. Once this stage is complete, the image will enter the pooling layer. The pooling layer is the layer that is responsible for reducing the resolution of a processed image. The pooling layer serves to reduce the noise in the image. There are two types of pooling, namely max pooling and average pooling. The illustration of the operation is presented in figure 7. At the residual module stage, three residual modules are stacked on top of each layer, and each layer will study 64 filters for each convolution process. Spatial dimensions in the image will be reduced, and they will be stacked back three residual modules. Each module will study 128 filters. The next step is stacking up three residual modules, with each module studying 256 filters. In the last step, the spatial dimension will be reduced again by stacking three residual modules. Each layer will study 512 filters. The results of each of these filters will pass through average pooling and enter the fully connected network with the softmax activation function to find out the classification results.
In the fully connected process, a classification process is carried out according to the class declared. At this stage, all layers of neurons are connected. Softmax is used to calculate the probability of class selection. Softmax is used to change the output of the last layer in the fully connected layer into a probability distribution. The advantage of using softmax is the range of output values given between 0 to 1, and the sum of all the probabilities is equal to one. The results of the ResNet-50 architecture classification are optimized using Adam (adaptive moments). Adam is a combination of RMSProp and momentum with several important differences. First, the momentum is combined directly as an estimate of the firstorder moment (with an exponential weight) of the gradient. Second, Adam incorporated bias corrections into the estimation of first-order moments (momentum terms) and second-order moments (not centralized) to explain the origin initiation. The output of choosing the number of optimization epochs is increasing accuracy and decreasing loss function in learning data even by using the same CNN. The testing technique is done by the K-Fold cross-validation process. The principle works by dividing the data as many as k sub-sets, k is the value of the fold. Each subset is used as test data from the results of the classification produced from k-1 other subsets. This study uses a fold value of 4 so that of the 2,200 data is divided into 4 blocks with the same amount of learning that is 15 epochs. Each dataset becomes test data once and becomes training data 3 times (k-1). Evaluation models in this study are used to measure the accuracy of the CNN architecture in classification. The entire performance of the CNN architecture for classification was evaluated using a confusion matrix to obtain accuracy, precision, and recall values. The implementation of this research framework uses Matlab 2018a software.

III. RESULT AND DISCUSSION
The training process uses 2,200 thorax image data from patients at the Jemursari Islamic Hospital in Surabaya who had been diagnosed by a doctor. The data is shared for the training 1,760 samples with the details of 880 normal lung thorax images, 880 samples of pneumonia lung thorax. For testing as many as 440 samples with details of 220 normal thorax image samples and 220 thorax pneumonia image samples. The first stage of learning data is processed by the CNN method with the standard architecture model of ResNet-50 and Alexnet.
The architecture of ResNet-50 produces an average class precision of 94.1% and an average class recall of 94.3%. As shown in Table II, the accuracy value obtained from the ResNet-50 architecture model is 94.1% with a computational duration of 35 minutes 40 seconds.   A comparison of the accuracy of the ResNet-50 architecture model and the Alexnet Architecture model are presented in table 4. The Alexnet architecture model provides better values than the ResNet-50 architecture model, with an accuracy value reaching 96.8% with a duration of 10 minutes 15 seconds. This is not in line with [12] which states that ResNet-50 is superior to Alexnet with a ResNet-50 accuracy value of 94.23% compared to Alexnet of 92.86%. The study classified x-ray images with two classes, pneumonia and normal. The ResNet-50 architecture model gets good accuracy because it has a high layer depth and uses a network called residual blocks. This network can avoid the problem of loss of gradient at a certain deep level in a network so that it can improve accuracy. In comparison, Alexnet has a network with a simple depth level with eight layers. Alexnet has the advantage of faster computing duration because it can work on two different GPUs. The results of the ResNet-50 classification are lower than Alexnet as shown in Table IV. This is in line with [10]; it can be caused by ResNet-50 is not so accurate on data that has very large visual variations in certain classes with a small amount of data. In the study in the "mass" pathology class with a total of 2,139 data, ResNet-50 has a lower accuracy than Alexnet. Therefore, in this research optimization is carried out on the ResNet-50 architecture model using the Adam algorithm by comparing several epochs to get maximum accuracy. Adam's optimization is used because it has a bias correction technique that provides better approximation compared to other optimization algorithms.
The second stage is the application of optimization to the ResNet-50 standard architecture model. The application of optimization refers to determining the number of epochs during the learning process by taking into account the duration of computation. Choosing the optimal number of epochs can form the ideal optimization for solving this problem. To get the best number of epochs, optimization testing is done with the number of epochs 5, 10, and 15. The first test will be done with Adam's optimization algorithm, while the second test, as a comparison, will use another optimization algorithm, namely RMSProp.  The first testing results on the ResNet-50 architecture model with Adam optimization are presented in Table V with the number of epochs 15 in scheme 1, the best results are obtained with an accuracy of 97.73% with a computing time of 64 minutes 35 seconds. Computational time increases when the number of epochs is increased, although the accuracy value obtained does not always increase.
The results of the second test on the ResNet-50 architecture model with RMSProp optimization are presented in table VI. The best results with the number of epoch 15 in scheme 3 reach 97.0% accuracy with a computing duration of 56 minutes 31 seconds. These results are still below the accuracy of the ResNet-50 architecture model with Adam optimization. The confusion matrix of the ResNet-50 architecture model using Adam optimization with the best number of epochs is presented in Table 7.
Based on the evaluation of the first and second optimization testing process using K-fold cross-validation with a scenario of k = 4, it can be said that Adam's optimization method is proven to improve accuracy. In the ResNet-50 architectural model with an ideal combination in choosing the right number of epochs even though the computing time is running longer. This is in line with [15] that Adam optimization can improve accuracy because it has a bias correction technique that provides better approximations that can improve accuracy. The ResNet-50 architecture model using Adam optimization can improve the accuracy of the thorax image classification by 3.63% from the previous accuracy of 94.1% by using the standard ResNet-50 architecture model.
Based on Table VII, the results of precision and recall indicate the desired criteria, which is the classification in this study that prioritizes the occurrence of false-positive results rather than false-negative or at least has the same value. That is because the diagnosis of pneumonia or normal classification will be preferably considered normal pneumonia so that clinical treatment or further examination is given.
Comparison of confusion matrices with standard architectural models (ResNet-50 and Alexnet) and the best optimization method (ResNet-50 + Adam) is presented in Table 8, based on Table 2, Table 3, and Table 7. From Table 8, we can draw a comparison chart of the classification accuracy of the architectural model presented in Figure 8. Based on Figure 8, the graphic shows the training process in the three architectural models of Alexnet, ResNet-50, and ResNet-50, with Adam optimization experiencing a significant increase in accuracy on epoch 2. In the next epoch, all networks have increased accuracy slowly, but in ResNet-50, the increase not significant only on epoch 11 and 12, which have high accuracy. The final result has an average accuracy of 94.1%. In Alexnet the highest accuracy increase occurs in epoch 3, but in the next epoch, accuracy tends to go up and down. However, Alexnet accuracy is still above the average accuracy of ResNet-50 with an average final accuracy of 96.8%. ResNet-50 with Adam's optimization shows the most consistent accuracy improvement graph for each epoch among the three architectural models with an average accuracy of 97.7%. These results indicate that the ResNet-50 architecture model with Adam optimization can improve higher accuracy and have a more consistent level of accuracy in each epoch compared to the standard Alexnet and ResNet-50 architectural models.
The results of the testing on ResNet-50 architecture with Adam's optimization are also compared to the results of architectural testing models from several previous studies such as [10], [11] and [12] using the same data. The results of the comparison are presented in Table IX. DOI: http://dx.doi.org/10.25139/inform.v0i1. 2707   TABLE IX   THE COMPARISON RESULT OF RESNET-50+ADAM WITH INCEPTIONV3,  AND DENSNET201