By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email
No need to pay just yet!
About this sample
About this sample
Words: 2799 |
Pages: 6|
14 min read
Published: Feb 13, 2024
Words: 2799|Pages: 6|14 min read
Published: Feb 13, 2024
Chronic kidney disease (CKD) is a condition characterized by a gradual loss of kidney function over time. It includes risk of cardiovascular disease and end-stage renal disease. The approximate prevalence of CKD is 800 per million population (pmp) [1]. In this paper we have used a Machine Learning approach for predicting CKD. In this paper we have presented a comparative analysis of 7 different machine learning algorithms. This study starts with 24 parameters in addition to the class attribute and 25% of the data set is used to test the predictions. Data is evaluated by using fivefold cross-validation and performance of the system is assessed using classification accuracy, confusion matrix, specificity matrix and sensitivity matrix.
Chronic kidney disease (CKD) is a permanent reduction in kidney function that can progress to end stage renal disease (ESRD) requiring either ongoing dialysis or a kidney transplant to maintain life. CKD also affects how many medications are eliminated from the body [2]. In routine practice, a laboratory serum creatinine value is used to estimate kidney function by incorporating it into a formula to estimate the glomerular filtration rate and establish whether a patient has CKD. It is becoming a major threat in the developing and undeveloped countries. Its main cause for occurrence is diseases like diabetes, high blood-pressure. Other risking circumstances causing chronic kidney disease include heart disease, obesity, and a family history of chronic kidney disease. Its medications which are dialysis or kidney transplant are very costly and so we need an early detection. In the United States (US) [3], about 117,000 patients developed end-stage renal disease (ESRD) requiring dialysis, while more than 663,000 prevalent patients were on dialysis in 2013. 5.6% of the total medical budget was spent for ERDS in 2012 which is about $28 billion. In India, CKD is widespread among 800 per million populations and ESRD is 150–200 per million populations. We consider seven machine learning classifiers, namely Logistic Regression, Support Vector Machine, K-nearest Neighbour, Naïve Bayes, Stochastic Gradient Descent classifier, Decision Trees Random Forest. Finally, a set of standard performance metrics is used to design the computer-aided diagnosis system for estimating the performance of each machine learning and artificial intelligence classifier. The metrics we used include confusion matrix, classification accuracy, specificity and sensitivity.
Our research uses a CKD dataset [4], which is openly accessible at UCI machine learning laboratory. The CKD data set consists of 24 attributes (i.e. predictors) in addition to the binary class attribute. Out of 24 attributes, 11 are numerical attributes, two categorical with five levels, while the remaining parameters are binary and been coded as zero for abnormal instances and one for normality. In class attribute one is coded for presence of CKD and zero represents CKD is not present. This dataset contains 400 instances with 150 samples without kidney disease (not present) and 250 samples with kidney disease (present). Total 400 instances of the dataset in which 300 of them are used for the training of classification algorithms and 100 instances are used to test the result of these models. The attributes in the dataset are Age, blood pressure, specific gravity, albumin, sugar, red blood cells, pus cell, pus cell clumps, bacteria, blood glucose random, blood urea, serum creatinine, sodium, potassium, haemoglobin, Packed cell volume, White Blood Cell Count, Red Blood Cell Count, hypertension, Diabetes Mellitus, appetite, Pedal Edema, Anaemia, and Class.
Logistic Regression (LR) is the linear regression model [ref-1]. LR computes the distribution between the example X and boolean class label Y by P(X|Y). Logistic regression classifies boolean class label Y as follows:
P(Y=1│X)=1/(1+exp(w_0+∑_(i=1)^n▒〖w_i X_i)〗)
P(Y=0│X)=1/(1+exp(w_0+∑_(i=1)^n▒〖w_i X_i)〗)
For the classification problem, the support vector machine (SVM) is the popular data mining method used to predict the category of data [ref-2]. The main idea of SVM is to find the optimal hyperplane between data of two classes in the training data. SVM finds the hyperplane by solving optimization problem.
K-nearest neighbors (KNN) is the classification method for classifying unknown examples by searching the closest data in pattern space [ref-3]. KNN predicts the class by using the Euclidean distance defined as follows:
d(x,y)=√(∑_(i=1)^k▒〖〖(x〗_i-y_i)〗^2 )
The Euclidean distance d(x,y) is used to measure the distance for finding the k closest examples in the pattern space. The class of the unknown example is identified by a majority voting from its neighbors.
Naïve Bayes are probabilistic classifiers, which are based on Bayes Theorem [ref-4]. In Naïve Bayes each value is marked independent of the other values and features. Each value contributes independently to the probability. The higher the probabilistic value the higher are the chances of data point belonging to that class or category. Naïve Bayes algorithm uses the concept of Maximum Likelihood for prediction. This algorithm is fast and can be used for making real time predictions such as sentiment analysis.
It is a Logistic Regression Classifier based on Stochastic Gradient Descent Optimisation.Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i): [ref-5]
θ= θ- η .∇_θ J(θ;x^i;y^i)
Decision tree is the classification method frequently used in data mining task [ref-6]. A decision tree is a structure that includes a root node, branches, and leaf nodes. It divides the data into classes based on the attribute value found in training sample.
Random Forest (RF) is a variant of ensemble classifier consisting of a collection of tree-structured classifiers h (x, yk), which is defined as multiple tree predictors yk such that each tree relies upon the estimations of an arbitrary vector inspected independently and with a similar distribution for all trees in the forest. The randomization is done by randomselectionofinput attributes for producing individual base decision trees [ref-7]. Random forests become different in a way from other methods that a modified tree learning algorithm is utilized that chooses the differentiable candidate in the learning procedure, a random subset of the features. The cause for doing this is the relationship of the trees in a standard bootstrap sample. For example, if one or a couple of features are extreme indicators for the response variable (target output), these features will be chosen in a considerable lot of the decision trees, reasoning them to end up correlated.
It is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.
Predicted Negative | Predicted Positive | |
Negative cases | TN | FP |
Positive cases | FN | TP |
Table 1 Confusion Matrix (CM)
Then we may define some evaluation measures.
(A) Accuracy = TN + TP
TN + TP + FN + FP
(R) Recall, Sensitivity = TP
TP + FN
(S) Specificity = TN
TN + FP
The results of seven machine learning algorithms are compared in the experiments. All machine learning techniques are trained and tested by the proposed method. The confusion matrix of each algorithm is shown in Table 2.
Model | Not CKD | CKD |
Logistic Regression | 38 (TN) 0 (FP) | 0 (FN) 62 (TP) |
Support Vector Machine | 36 (TN) 2 (FP) | 2 (FN) 60 (TP) |
K-Nearest Neighbor | 38 (TN) 0(FP) | 2 (FN) 60 (TP) |
Naïve Bayes | 38 (TN) 0 (FP) | 3 (FN) 59 (TP) |
Stochastic Gradient Descent | 38 (TN) 0 (FP) | 0 (FN) 62 (TP) |
Decision Trees | 38 (TN) 0 (FP) | 3 (FN) 59 (TP) |
Random Forest | 38 (TN) 0 (FP) | 0 (FN) 62 (TP) |
Table 2. Confusion matrices of all the algorithms.
Figure 2 shows the average accuracy of seven classifiers. From the experimental results, it can be seen that the logistic regression classifier, random forest and SGD classifier gives the highest accuracy than the others with 100% while Decision Tree, SVM classifier, Naive Bayes and KNN can produce the average accuracy of 97.0%,96.0 % ,97.0% and 97.0% respectively. The accuracy of each class is also important because if the classifier predicts incorrectly, it may be a detriment to the patient. Therefore, the sensitivity and specificity value is used in the experiments for evaluating the performance of the proposed methods.
We trained 7 seven different models of machine learning to predict the presence of chronic kidney disease. Of all the other models compared, Logistic regression, SGD Classifier and Random forest provide the best results. These have surpassed other classifiers and are able to detect the chronic kidney disease more precisely. If these models are trained using a varied and extensive range of attributes, they may result in more accurate predictions. The results would be more assuring with an increase in dataset. Hospitals and diagnostic centres can use this for faster and digitized analysis for predicting the chronic kidney disease.
Browse our vast selection of original essay samples, each expertly formatted and styled