About this sample
About this sample
Words: 1227 |
7 min read
Published: Sep 18, 2018
Words: 1227|Pages: 3|7 min read
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
In this paper we are going to make the predictive analysis of what sorts of people were likely to survive and using some tools of machine learing to predict which passengers survived the tragedy with accuracy.. IndexTerms - Machine learning.
Machine learning means the application of any computer-enabled algorithm that can be applied against a data set to find a pattern in the data. This encompasses basically all types of data science algorithms, supervised, unsupervised,segmentation, classification, or regression”.few important areas where machine learning can be applied are Handwriting Recognition:convert written letters into digital letters Language Translation:translate spoken and or written languages (e.g. Google Translate) Speech Recognition:convert voice snippets to text (e.g. Siri, Cortana, and Alexa)ü Image Classification:label images with appropriate categories (e.g. Google Photos) Autonomous Drivin:genable cars to drive (e.g. NVIDIA and Google Car) some features of machine learning algorithms are : Features are the observations that are used to form predictions For image classification, the pixels are the features For voice recognition, the pitch and volume of the sound samples are the features For autonomous cars, data from the cameras, range sensors, and GPS are features Extracting relevant features is important for building a model Source of mail is an irrelevant feature when classifying images Source is relevant when classifying emails because SPAM often originates from reported sources
Every machine learning algorithm works best under a given set of conditions. Making sure your algorithm fits the assumptions requirements ensures superior performance. You can’t use any algorithm in any condition. Instead, in such situations, you should try using algorithms such as Logistic Regression, Decision Trees, SVM, Random Forest etc. Logistic Regression?
Logistic Regression is a classification algorithm. It is used to predict a binary outcome given a set of independent variables. To represent binary categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.
Peformance of Logistic regression model: AIC (AkaikeInformation Criteria) –The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value Null Deviance and Residual Deviance –Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model. Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. McFadden R2 is called as pseudo R2. Whenanalyzingdata with a logistic regression, an equivalent statistic to R-squared does not exist. However, to evaluate the goodness-of-fit of logistic models, several pseudo R-squareds have been developed accuracy=truepostives + true negatives
Decision tree is a hierarchical tree structurethat can be used to divide up a large collection of records into smaller sets of classes by applying a sequence of simple decision rules. A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous(mutually exclusive) classes.The attributes of the classes can be any type of variables from binary, nominal, ordinal, and quantitative values, while the classes must be qualitative type (categorical or binary, or ordinal). In short, given a data of attributes together with its classes, a decision tree produces a sequence of rules (or series of questions) that can be used to recognize the class. One rule is applied after another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node.With each successive division, the members of the resulting sets become more and more similar to each other. Hence, the algorithm used to construct decision tree is referred to as recursive partitioning Decision tree applications : prediction tumor cells as benign or maligant classify credit card transaction as legitimate or fradulent classify buyers from non -buyers decision on whether or not to approve a loan diagnosis of various diseases based on symptoms and profiles
Our approach solves the problem:
The data we collected is still rawdata which is very likely to contains mistakes,missing values and corrupt values. before drawing any conclusions from the data we need to do some data preprocessing which involves data wrangling and feature engineering . data wrangling is the process of cleaning and unify the messy and complex data sets for easy access and analysis feature engineering process attempts to create additional relevant features from existing raw features in the data and to increase the predictive power of learing algorithms
Data set description: The original data has been split into two groups :training dataset(70%) and test dataset(30%).The training set should be used to build your machine learning models.. The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
Results after training with the algorithms, we have to validate our trained algorithms with test data set and measure the algorithms performance with godness of fit with confusion matrix for validation. 70% of data as training data set and 30% as training data set confusion matrix for decision tree trained data set test data set
References predictions 0 1 0 395 71 1 45 203
References predictions 0 1 0 97 20 1 12 48
Confusion matrix for logistic regression trained data test data
References predictions 0 1 0 395 12 1 21 204
References predictions 0 1 0 97 12 1 21 47
Enhancements and reasoning predicting the survival rate with others machine learing algorithms like random forests, various Support Vector machines may improve the accuracy of prediction for the given data set.
Conclusion: The analyses revealed interesting patterns across individual-level features. Factors such as socioeconomic status, social norms and family composition appeared to have an impact on likelihood of survival. These conclusions, however, were derived from findings in the dataThe accuracy of predicting the survival rate using decision tree algorithm(83.7) is high when compared with logistic regression(81.3) for a given data set
Browse our vast selection of original essay samples, each expertly formatted and styled
Where do you want us to send this sample?
Be careful. This essay is not unique
This essay was donated by a student and is likely to have been used and submitted before
Download this Sample
Free samples may contain mistakes and not unique parts
Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.
Please check your inbox.
We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!