By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email
No need to pay just yet!
About this sample
About this sample
Words: 450 |
Page: 1|
3 min read
Published: Mar 28, 2019
Words: 450|Page: 1|3 min read
Published: Mar 28, 2019
The treatment of missing values (MVs) is an important issue in data pre-processing in data mining. One reason is that attributes from data can be aggregated from different sources. Cases may not exist in all the data sources. The other reason is because of reporting omission. The simplest way of dealing with MVs is to discard the cases that contain at least one MV. However, this is practical only when the data contain a small number of cases with MVs and when the analysis of the complete cases will not lead to serious bias results for inference. For example, in our study, 10%-30% students are missing their high school GPA or SAT scores. It is impossible to simply discard these students, as most of them are international students or transfer students which constitute an important subset of the populations. It is also not practical to discard these variables, as they are proved to be important predictors for predicting students’ performance. Thus, it is important to apply appropriate imputation strategy on the data.
There are also a variety of data mining methods. Unlike traditional explanatory models where the goal is to explore the relationship between an outcome variable and explanatory variables, the goal of data mining model is to make predictions on a new data set. There is a target variable, which can be either continuous or categorical. There are also predictors, called features, which measure a set of characteristics of the sample members. By applying different data mining models, a prediction model can be built based on current data. The model can be applied to new data, where a new set of characteristics values are used to make predictions. Different data mining methods have different algorithms and thus will result in different prediction performance. Based on Luengo, imputation methods can improve data mining methods for different categories, as there may be an interaction between imputation strategies and data mining methods. We would like to explore how this works on our data. In this chapter, we will first introduce the imputation strategies applied in this dissertation. Then, we will introduce the data mining methods applied on our data.
Third, a commonly used over-sampling method SMOTE will be introduced to deal with the imbalanced data issue. Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally. For example, in our data set, there are around 3000 students in total, with 90% of them are labeled as pass students and the remaining 10% of them are labeled as failure students. Most machine learning methods do not work well on an imbalanced data. Thus, techniques need to be used to tackle imbalanced data issue. SMOTE is one of them.
Browse our vast selection of original essay samples, each expertly formatted and styled