close
test_template

A Comparative Analysis of Seven Algorithms Using a Comprehensive Dataset

AI-Generated
download print

About this sample

About this sample

close
AI-Generated

Words: 2799 |

Pages: 6|

14 min read

Published: Feb 13, 2024

Words: 2799|Pages: 6|14 min read

Published: Feb 13, 2024

Table of contents

  1. Methodology
  2. Dataset
    Machine Learning Techniques
  3. Evaluation Parameters
  4. Confusion Matrix
    Result
  5. Conclusion
  6. References

Chronic kidney disease (CKD) is a condition characterized by a gradual loss of kidney function over time. It includes risk of cardiovascular disease and end-stage renal disease. The approximate prevalence of CKD is 800 per million population (pmp) [1]. In this paper we have used a Machine Learning approach for predicting CKD. In this paper we have presented a comparative analysis of 7 different machine learning algorithms. This study starts with 24 parameters in addition to the class attribute and 25% of the data set is used to test the predictions. Data is evaluated by using fivefold cross-validation and performance of the system is assessed using classification accuracy, confusion matrix, specificity matrix and sensitivity matrix.

'Why Violent Video Games Shouldn't Be Banned'?

Chronic kidney disease (CKD) is a permanent reduction in kidney function that can progress to end stage renal disease (ESRD) requiring either ongoing dialysis or a kidney transplant to maintain life. CKD also affects how many medications are eliminated from the body [2]. In routine practice, a laboratory serum creatinine value is used to estimate kidney function by incorporating it into a formula to estimate the glomerular filtration rate and establish whether a patient has CKD. It is becoming a major threat in the developing and undeveloped countries. Its main cause for occurrence is diseases like diabetes, high blood-pressure. Other risking circumstances causing chronic kidney disease include heart disease, obesity, and a family history of chronic kidney disease. Its medications which are dialysis or kidney transplant are very costly and so we need an early detection. In the United States (US) [3], about 117,000 patients developed end-stage renal disease (ESRD) requiring dialysis, while more than 663,000 prevalent patients were on dialysis in 2013. 5.6% of the total medical budget was spent for ERDS in 2012 which is about $28 billion. In India, CKD is widespread among 800 per million populations and ESRD is 150–200 per million populations. We consider seven machine learning classifiers, namely Logistic Regression, Support Vector Machine, K-nearest Neighbour, Naïve Bayes, Stochastic Gradient Descent classifier, Decision Trees Random Forest. Finally, a set of standard performance metrics is used to design the computer-aided diagnosis system for estimating the performance of each machine learning and artificial intelligence classifier. The metrics we used include confusion matrix, classification accuracy, specificity and sensitivity.

Methodology

Dataset

Our research uses a CKD dataset [4], which is openly accessible at UCI machine learning laboratory. The CKD data set consists of 24 attributes (i.e. predictors) in addition to the binary class attribute. Out of 24 attributes, 11 are numerical attributes, two categorical with five levels, while the remaining parameters are binary and been coded as zero for abnormal instances and one for normality. In class attribute one is coded for presence of CKD and zero represents CKD is not present. This dataset contains 400 instances with 150 samples without kidney disease (not present) and 250 samples with kidney disease (present). Total 400 instances of the dataset in which 300 of them are used for the training of classification algorithms and 100 instances are used to test the result of these models. The attributes in the dataset are Age, blood pressure, specific gravity, albumin, sugar, red blood cells, pus cell, pus cell clumps, bacteria, blood glucose random, blood urea, serum creatinine, sodium, potassium, haemoglobin, Packed cell volume, White Blood Cell Count, Red Blood Cell Count, hypertension, Diabetes Mellitus, appetite, Pedal Edema, Anaemia, and Class.

Machine Learning Techniques

  • Logistic Regression

Logistic Regression (LR) is the linear regression model [ref-1]. LR computes the distribution between the example X and boolean class label Y by P(X|Y). Logistic regression classifies boolean class label Y as follows:

P(Y=1│X)=1/(1+exp⁡(w_0+∑_(i=1)^n▒〖w_i X_i)〗)

P(Y=0│X)=1/(1+exp⁡(w_0+∑_(i=1)^n▒〖w_i X_i)〗)

  • Support Vector Machine

For the classification problem, the support vector machine (SVM) is the popular data mining method used to predict the category of data [ref-2]. The main idea of SVM is to find the optimal hyperplane between data of two classes in the training data. SVM finds the hyperplane by solving optimization problem.

  • K-Nearest Neighbors

K-nearest neighbors (KNN) is the classification method for classifying unknown examples by searching the closest data in pattern space [ref-3]. KNN predicts the class by using the Euclidean distance defined as follows:

d(x,y)=√(∑_(i=1)^k▒〖〖(x〗_i-y_i)〗^2 )

The Euclidean distance d(x,y) is used to measure the distance for finding the k closest examples in the pattern space. The class of the unknown example is identified by a majority voting from its neighbors.

  • Naïve Bayes

Naïve Bayes are probabilistic classifiers, which are based on Bayes Theorem [ref-4]. In Naïve Bayes each value is marked independent of the other values and features. Each value contributes independently to the probability. The higher the probabilistic value the higher are the chances of data point belonging to that class or category. Naïve Bayes algorithm uses the concept of Maximum Likelihood for prediction. This algorithm is fast and can be used for making real time predictions such as sentiment analysis.

  • SGD Classifier

It is a Logistic Regression Classifier based on Stochastic Gradient Descent Optimisation.Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i): [ref-5]

θ= θ- η .∇_θ J(θ;x^i;y^i)

  • Decision Tree

Decision tree is the classification method frequently used in data mining task [ref-6]. A decision tree is a structure that includes a root node, branches, and leaf nodes. It divides the data into classes based on the attribute value found in training sample.

  • Random Forest

Random Forest (RF) is a variant of ensemble classifier consisting of a collection of tree-structured classifiers h (x, yk), which is defined as multiple tree predictors yk such that each tree relies upon the estimations of an arbitrary vector inspected independently and with a similar distribution for all trees in the forest. The randomization is done by randomselectionofinput attributes for producing individual base decision trees [ref-7]. Random forests become different in a way from other methods that a modified tree learning algorithm is utilized that chooses the differentiable candidate in the learning procedure, a random subset of the features. The cause for doing this is the relationship of the trees in a standard bootstrap sample. For example, if one or a couple of features are extreme indicators for the response variable (target output), these features will be chosen in a considerable lot of the decision trees, reasoning them to end up correlated.

Evaluation Parameters

Confusion Matrix

It is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

Predicted Negative Predicted Positive
Negative cases TN FP
Positive cases FN TP

Table 1 Confusion Matrix (CM)

Then we may define some evaluation measures.

(A) Accuracy = TN + TP

TN + TP + FN + FP

(R) Recall, Sensitivity = TP

TP + FN

(S) Specificity = TN

TN + FP

Result

The results of seven machine learning algorithms are compared in the experiments. All machine learning techniques are trained and tested by the proposed method. The confusion matrix of each algorithm is shown in Table 2.

Model Not CKD CKD
Logistic Regression 38 (TN) 0 (FP) 0 (FN) 62 (TP)
Support Vector Machine 36 (TN) 2 (FP) 2 (FN) 60 (TP)
K-Nearest Neighbor 38 (TN) 0(FP) 2 (FN) 60 (TP)
Naïve Bayes 38 (TN) 0 (FP) 3 (FN) 59 (TP)
Stochastic Gradient Descent 38 (TN) 0 (FP) 0 (FN) 62 (TP)
Decision Trees 38 (TN) 0 (FP) 3 (FN) 59 (TP)
Random Forest 38 (TN) 0 (FP) 0 (FN) 62 (TP)

Table 2. Confusion matrices of all the algorithms.

Figure 2 shows the average accuracy of seven classifiers. From the experimental results, it can be seen that the logistic regression classifier, random forest and SGD classifier gives the highest accuracy than the others with 100% while Decision Tree, SVM classifier, Naive Bayes and KNN can produce the average accuracy of 97.0%,96.0 % ,97.0% and 97.0% respectively. The accuracy of each class is also important because if the classifier predicts incorrectly, it may be a detriment to the patient. Therefore, the sensitivity and specificity value is used in the experiments for evaluating the performance of the proposed methods.

Get a custom paper now from our expert writers.

Conclusion

We trained 7 seven different models of machine learning to predict the presence of chronic kidney disease. Of all the other models compared, Logistic regression, SGD Classifier and Random forest provide the best results. These have surpassed other classifiers and are able to detect the chronic kidney disease more precisely. If these models are trained using a varied and extensive range of attributes, they may result in more accurate predictions. The results would be more assuring with an increase in dataset. Hospitals and diagnostic centres can use this for faster and digitized analysis for predicting the chronic kidney disease.

References

  1. R. Xi, N. Lin and Y. Chen, “Compression and Aggregation for Logistic Regression Analysis in Data Cubes,” IEEE Transl. Knowledge and Data Engineering, vol. 21, pp. 479 - 492, April 2009.
  2. R. G. Brereton, and G. R. Lloyd, “Support Vector Machines for classification and regression,” Analyst, vol. 135, no. 2, pp. 230-267, 2010.
  3. S. Galit, R. P. Nitin, and C. B. Peter, Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner: Wiley Publishing, 2010.
  4. http://www.statsoft.com/textbook/naive-bayes-classifier
  5. Sebastian Ruder, “An overview of gradient descent optimisation algorithms” arXiv:1609.04747 , June 2017.
  6. J. R. Quinlan, C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc., 1993.
  7. Anbarasi,M.S.,&Janani,V(2017).EnsembleclassifierwithRandomForestalgorithmtodeal with imbalanced healthcare data. In International Conference on Information Communication and Embedded Systems (ICICES) (pp. 1–7), Chennai.
Image of Prof. Linda Burke
This essay was reviewed by
Prof. Linda Burke

Cite this Essay

A Comparative Analysis of Seven Algorithms using a Comprehensive Dataset. (2024, February 13). GradesFixer. Retrieved April 30, 2024, from https://gradesfixer.com/free-essay-examples/a-comparative-analysis-of-seven-algorithms-using-a-comprehensive-dataset/
“A Comparative Analysis of Seven Algorithms using a Comprehensive Dataset.” GradesFixer, 13 Feb. 2024, gradesfixer.com/free-essay-examples/a-comparative-analysis-of-seven-algorithms-using-a-comprehensive-dataset/
A Comparative Analysis of Seven Algorithms using a Comprehensive Dataset. [online]. Available at: <https://gradesfixer.com/free-essay-examples/a-comparative-analysis-of-seven-algorithms-using-a-comprehensive-dataset/> [Accessed 30 Apr. 2024].
A Comparative Analysis of Seven Algorithms using a Comprehensive Dataset [Internet]. GradesFixer. 2024 Feb 13 [cited 2024 Apr 30]. Available from: https://gradesfixer.com/free-essay-examples/a-comparative-analysis-of-seven-algorithms-using-a-comprehensive-dataset/
copy
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    clock-banner-side

    Get Your
    Personalized Essay in 3 Hours or Less!

    exit-popup-close
    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now