Design of Expert System for Digestive Diseases Identification Using Naïve Bayes Methodology for iOS-Based Application

Shown symptoms in digestive diseases might be similar, resulting in patient’s suspected diseases before and after diagnosis attempt might turn out to be different. This paper aims to build a design of an expert system for digestive disease identification using Naïve Bayes methodology for iOS-based applications. The result from this paper helps medical interns to increase the accuracy in predicting patient’s suspected digestive disease. A precise prediction in suspected disease identification can minimalize unnecessary diagnosis attempts, which saves time and reduces cost. Naïve Bayes is chosen because it has a higher accuracy level than other classification methods. This research includes collecting data through literature reviews on digestive diseases and their symptoms, processing the data to be turned into a knowledge base for the expert system, conducting data training using Naïve Bayes by the designed expert system application through this research. The result from the conducted data training using Naïve Bayes methodology shows that the expert system application has a higher accuracy level, which is 84%. Keywords— digestive disease; expert system; naïve Bayes; iOS; knowledge-based


I. INTRODUCTION
Symptoms like nausea, vomiting, diarrhea, and flatulence are common in digestive diseases [1]. Those symptoms might be shown in the patient who suffers Crohn's disease, gallstone, irritable bowel syndrome (IBS), gastritis, gastroesophageal reflux disease (GERD), peptic ulcer, or ulcerative colitis. Nausea and vomiting are likely to appear in Crohn's disease, gallstone, irritable bowel syndrome (IBS), gastritis, and gastroesophageal reflux disease (GERD) [1]. On the other hand, diarrhea and flatulence are likely to appear in three or more types of digestive diseases [1]. This similarity in symptoms causing the initially suspected disease might turn out different from the actual digestive disease the patient suffers, because making a clinical diagnosis is a great challenge that involves complex, fast, and accurate decision-making [2]. False prediction in determining suspected disease might cause unnecessary medical diagnosis. Hence, an expert system is required to help medical interns in determining the suspected disease the patient might suffer in the early-stage of examination to make an effective diagnosis. With the help of an expert system, the less precise medical diagnosis can be minimalized, which also helps in saving time and reducing costs.
Expert systems are a branch of Artificial Intelligence (AI) [3], in the form of a system that adopts an expert ability to solve problems [4]. An expert is someone who has proficient skills and knowledge in a certain field, which common people don't have [3]. The function of an expert system is to solve a problem or being a support tool in the decision-making process [5].
Various expert systems have been developed and implemented in medical fields since 1961 to date for different healthcare purposes [2]. It was claimed that both logic rules (sets theory and Boolean algebra) and inference calculations (Bayes rules) are required to help doctors to model diagnosis processes [2]. The designed expert system application in this research is to identify suspected digestive disease patients might have. Digestive diseases included in this research scope are Irritable Bowel Syndrome (IBS), ulcerative colitis, Crohn's disease, peptic ulceration, gallstone, gastroesophageal reflux disease (GERD), and gastritis.
II. RESEARCH METHODOLOGY The research was started by a literature review to discover problems and define the research aim. The discovered problem was unprecise diagnosis attempt might occur in diagnosing digestive diseases due to similar symptoms shown by patients. The difference in prior suspected digestive disease to the actual digestive disease the patient suffers may lead to an unnecessary diagnosis. Hence, this research aims to build an expert system application to help medical interns in determining a more precise patient's suspected digestive diseases using probability calculation. Details of the research methodology in this research are shown in Figure 1 in the form of a flowchart.

A. Datasets
Dataset to be used as the knowledge base in the designed expert system application consists of relations between symptoms and seven digestive diseases that are included in the scope, which are: • Irritable Bowel Syndrome (IBS), a functional gastrointestinal disorder which affects 10-20% of the population, characterized by pain or discomfort in the abdomen, followed by altered bowel habit [6]. • Ulcerative colitis, a chronic colorectum inflammatory disorder [7]. • Crohn's disease, a chronic gastrointestinal tract inflammatory disease which is progressive and might cause bowel damage and disability [8]. • Peptic ulcer, an acid peptic injury of the digestive tract, which leads to mucosal break into the submucosa [9] • Gallstone, a disease suffered by 10-15% of the population in developed countries in the form of solid material formed in the gallbladder, commonly consists of cholesterol [10]. • Gastroesophageal reflux disease (GERD), a chronic condition where gastric content refluxes to the oesophagus and mostly causing heartburn [11]. Each record contains disease term, symptom term, and frequency of both terms to appear in PubMed citation. PubMed is a full-text archive developed and maintained by the United States National Library of Medicine at the National Institutes of Health, and it provides more than 30 million citations for biomedical literature and life science journals. PubMed is free and can be accessed through its official website at https://pubmed.ncbi.nlm.nih.gov/.
The retrieved file from Supplementary Data 3 is then filtered, resulting in 573 records consisting of 188 symptom terms associated with seven digestive diseases that are included in the research scope. The obtained 573 records from the filtering process can be seen in the attached Supplementary Data A file.
An expert system has important parts playing a crucial role, which are knowledge base containing an expert's knowledge, and an expert's concept of reasoning stored in the inference engine [13]. In order to give the right solution, users need to explain the problem to the expert system [13]. Hence, the designed expert system application has workflow as follows: • The Application shows options of symptoms related to the digestive diseases based on the dataset • Medical intern (user) selects the patient's symptoms based on the shown symptom options • The Application calculates the probability based on the selected symptoms using Multinomial Naïve Bayes method and shows the suspected disease the patient might suffer • The calculation result could be saved as patient data in the Application The designed expert system application was developed using Swift programming language in XCode Integrated Development Environment. The resulting Application then went through two stages of tests, the accuracy-test, and the user acceptance test. Lastly, conclusions were drawn. The resulting Application runs in iOS-based devices and can be downloaded through the App Store.

B. Multinomial Naïve Bayes Method
The calculation of probability used in the designed expert system application is based on the methodology of Multinomial Naïve Bayes. Multinomial Naïve Bayes is chosen because it has good enough performance in text classification [14]. Multinomial Naïve Bayes calculates probability based on words' frequency to appear in a document [15]. The general Multinomial Naïve Bayes equation is shown by Equation (1). The prior probability of class c(P(c)) is obtained through the calculation shown by Equation (2).

( ) =
(2) where: Nc: Number of class c documents N: Total documents Probability of term i in class c (P(wi|c)) is obtained through the calculation shown by Equation (3).

C. Business Process
The Application shows symptoms related to the seven digestive diseases based on the research scope. A medical intern, as the application user, selects any symptoms that the patient has. The designed Application calculates probability based on symptoms selected by the users. The suspected disease from the calculation result can then be saved as patient data, along with the selected symptoms and a supplementary note if needed. The detail of the business process in Figure 2 in the form of a flowchart. Users open the Application, and options of symptoms are shown by the Application. Users select any from the provided options based on symptoms shown by the patient. The Application receives the selected symptoms as inputs and does calculations based on the stored dataset using Multinomial Naïve Bayes. The calculation results-which are the suspected digestive disease the patient might suffer and the probabilities are then shown to the users as a consideration to make a diagnosis. Users could save the resulting results as patient data in the Application.
A new patient data form needs to be filled in order to save the results to a new patient. Data that needs to be inputted in the new patient data form are the patient's name, age, and gender. While saving the results to an already stored patient data can be done by selecting the patient's name from the Patient List. The results that are stored contains information about symptoms the patient had, the date when the calculation was conducted, and the suspected digestive disease the patient might suffer based on probability calculation.

III. RESULT AND DISCUSSION
The filtered 573 records consisting of 188 symptoms dataset are used as the knowledge base of the designed expert system application to calculate the probability of suspected digestive diseases using Multinomial Naïve Bayes. The used dataset for this research can be seen in the attached Supplementary Data A file. The obtained dataset was fitted to the use of Multinomial Naïve Bayes.

A. Prior Probability (P(c))
Calculation of prior probability is by the equation of ( ) = , with the number of documents for class c (Nc) is obtained by the sum of PubMed occurrences in class c, and total documents (N) is obtained by the sum of PubMed occurrences in every class. The prior probability for each class is in Table I.

B. Number of Words in Class (count(c))
In the used dataset, every document consists of two words, which are a symptom term and a disease term, the number of words for each class in Table II.

C. Number of Unique Words (|V| )
A word that is not conjunction is regarded as a unique word. The dataset that is used in the research does not contain any conjunction. Hence, the number of unique words is obtained by the sum of disease terms and symptom terms. The number of disease terms based on the research scope is 7, and the number of symptom terms based on the filtering process done previously is 188. Hence, the number of unique words in the used dataset is 195 words.
The following is an example of the implementation of a patient with symptoms of nausea, vomiting, and chest pain. Data of each symptom can be seen in Table III.  TABLE III  DATA OF NAUSEA, VOMITING, AND CHEST PAIN IN DATASET   Symptom   PubMed Occurrences  Crohn's  disease  Gallstone  Gastritis  GERD  IBS  Peptic ulcer  Ulcerative  colitis   Nausea  24  15  17  19  7  9  17   Vomiting  22  37  37  180  5  15  13 Chest pain 3 5 6 226 0 5 4 Probability of Crohn's Disease as the Suspected Disease. The probability calculation of Crohn's disease as the suspected disease of patients with symptoms of nausea, vomiting, and chest pain using Multinomial Naïve Bayes.
( ℎ ′ | , , ℎ ) Probability of Gallstone as the Suspected Disease. The probability calculation of gallstone as the suspected disease of patients with symptoms of nausea, vomiting, and chest pain using Multinomial Naïve Bayes. Based on the seven calculations, gastroesophageal reflux disease (GERD) has the highest probability. Hence, the patient with symptoms of nausea, vomiting, and chest pain might suffer from GERD.
There were two tests conducted during this research, one for an internist as an expert and one for medical interns as users. The first test was conducted to measure the accuracy of the resulting suspected diseases calculated by the designed Application by comparing them with the internist's data. The second test was conducted to measure the acceptance of medical interns as the user towards the designed Application.

D. Accuracy Test
The accuracy test uses patient symptoms data, suspected disease data estimated by an internist, and diagnosis results. The test compares suspected diseases-both from internist estimation and calculation output from the designed Application-to diagnosis result. The test was conducted by the help of Dr. Didiet Pratignyo, SpPD, FINASIM, an internist in RSUD Kota Cilegon, Banten, Indonesia. The comparison results in Table IV. Based on Table IV, 72% suspected diseases estimated by the internist were correct, which is 18 out of 25. On the other hand, 84% suspected diseased from the calculation result of the designed expert system application was correct, which is 21 out of 25. Hence, by the use of the designed expert system application, the suspected disease has higher accuracy.

E. User Acceptance Test: User Acceptance Test (UAT)
The process to ensure the solution works for its users by gathering input from those who have experience with the business processes and will be using the designed system [16]. UAT is conducted to measure the level of user acceptance towards the designed expert system application. A simple experiment with 10-20 samples could be successful through strict control [17]. The test was done with the help of 13 medical interns. Medical interns were asked to do certain user scenarios and finished by filling the provided questionnaire. Likert scale is used as the rating system in the provided questionnaire. The Likert scale is a popular scale used to measure five points of attitudes, ranging from strongly agree, agree, neutral, disagree, and strongly disagree [18]. The provided questionnaire uses numbered scales ranges from 1 to 5, which 1 represents strongly disagree and five represents strongly agree. The interval for each score was calculated using the calculation shown by Equation (4). The total point used in the provided questionnaire for this test is 5. Hence, the interval is 20. Details of the score interval are shown in Table V.  The how-to-use information of the Application is precise and easy to understand 85% Strongly Agree 4 The user interface is attractive 86% Strongly Agree 5 The Application is useful 89% Strongly Agree