Abstract : The sinking of the RMS Titanic is one of the most

infamous shipwrecks in history. On April 15, 1912, during her maiden

voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of

2224 passengers and crew. This sensational tragedy shocked the international

community and led to better safety regulations for ships.In this paper we are

going to make the predictive analysis of

what sorts of people were likely to survive and using some tools of machine learing to predict which passengers survived the

tragedy with accuracy..

IndexTerms

– Machine learning .

________________________________________________________________________________________________________

I. Introduction

Machine learning means the application of any

computer-enabled algorithm that can be applied against a data set to find a

pattern in the data. This encompasses

basically all types of data science algorithms, supervised, unsupervised,segmentation,

classification, or regression”.few important areas where machine learning can

be applied are

Handwriting Recognition:convert

written letters into digital letters

Language Translation:translate spoken

and or written languages (e.g. Google Translate)

Speech Recognition:convert voice

snippets to text (e.g. Siri, Cortana, and Alexa)ü

Image Classification:label images with

appropriate categories (e.g. Google Photos)

Autonomous Drivin:genable cars to

drive (e.g. NVIDIA and Google Car)

some features

of machine learning algorithms are :

Features are the observations that are used to form predictions

For image classification, the pixels

are the features

For voice recognition, the pitch and

volume of the sound samples are the features

For autonomous cars, data from the

cameras, range sensors, and GPS are features

Extracting relevant features is important for building

a model

Source of mail is an irrelevant feature when

classifying images

Source is relevant when classifying emails because

SPAM often originates from reported sources

2.Literature

survey

Every

machine learning algorithm works best under a given set of conditions. Making

sure your algorithm fits the assumptions / requirements ensures superior

performance. You can’t use any algorithm in any condition.

Instead, in such situations, you should try using

algorithms such as Logistic Regression, Decision Trees, SVM, Random Forest etc.

Logistic Regression ?

Logistic

Regression is a classification algorithm. It is used to predict a binary outcome

(1 / 0, Yes / No, True / False) given a set of independent variables. To

represent binary / categorical outcome, we use dummy variables. You can also

think of logistic regression as a special case of linear regression when the

outcome variable is categorical, where we are using log of odds as dependent

variable. In simple words, it predicts the probability of occurrence

of an event by fitting data to a logit function.

Peformance of Logistic

regression model:

AIC (AkaikeInformation Criteria) –The analogous metric of adjusted R² in logistic

regression is AIC. AIC is the measure of

fit which penalizes model for the number of model

coefficients. Therefore, we always prefer model with minimum AIC

value

Null Deviance and Residual Deviance –Null Deviance indicates the response predicted

by a model with nothing but an

intercept. Lower the value, better the model. Residual

deviance indicates the response predicted by a model on adding

independent variables. Lower the value, better the

model.

Confusion Matrix:

It is nothing but a tabular representation of Actual vs Predicted values.

This helps us to find the accuracy

of the model and avoid overfitting.

McFadden R2

is called as pseudo R2. Whenanalyzingdata with a logistic regression, an

equivalent statistic to R-squared does not exist. However, to evaluate the

goodness-of-fit of logistic models, several pseudo R-squareds have been

developed.

accuracy=truepostives + true negatives/

(truepostivies+true negatives+false positives+false negatives)

Decision Trees

Decision tree is a hierarchical tree structurethat can

be used to divide up a large collection of records into smaller sets of classes by applying a

sequence of simple decision rules. A decision tree model consists of a set of

rules for dividing a large heterogeneous population into smaller, more

homogeneous(mutually exclusive) classes.The attributes of the classes can be

any type of variables from binary, nominal, ordinal, and quantitative values,

while the classes must be qualitative type (categorical or binary, or ordinal).

In short, given a data of attributes together with its classes, a decision tree

produces a sequence of rules (or series of questions) that can be used to

recognize the class.

One rule is applied after another, resulting in a hierarchy

of segments within segments. The hierarchy is called a tree, and each segment

is called a node.With each successive division, the members of the resulting

sets become more and more similar to each other.

Hence, the algorithm used to construct decision tree

is referred to as recursive partitioning

Decision tree applications :

prediction tumor cells as benign or

maligant

classify credit card transaction as legitimate or

fradulent

classify buyers from non -buyers

decision on whether or not to approve

a loan

diagnosis of various diseases based on

symptoms and profiles

3.Methodolgy:

our approach to solve the problem:

1. collect the raw data need to solve the problem.

2. improt the dataset into the working environment

3.Data preprocessing which

includes data wrangling and feature engineering .

4.explore the data and prepare a model for performing analysis using

machine learing algorithms

5.Evaluate the model and re-iterate till we get satisfactory model

performance

6.Compare the results and select a model which gives a more accurate

result.

the data we collected is

still rawdata which is very likely to

contains mistakes ,missing values and corrupt values. before drawing any

conclusions from the data we need to do some data preprocessing which involves

data wrangling and feature engineering .

data wrangling is the process of cleaning and unify the messy and

complex data sets for easy access and analysis

feature engineering process attempts to create additional

relevant features from existing raw features in the data and to increase the

predictive power of learing algorithms

4 Experimental Analysis and Discussion

a) Data set description:

The original data has been split into two

groups :training dataset(70%) and test dataset(30%).The training

set should be used to build your machine learning models..

The test

set should be used to see how well your model performs on unseen data. For

the test set, we do not provide the ground truth for each passenger. It is your

job to predict these outcomes. For each passenger in the test set, use the

model you trained to predict whether or not they survived the sinking of the

Titanic.

b) Measures

Data

Dictionary

Variable

Definition

Key

survival

Survival

0 = No, 1 = Yes

pclass

Ticket class

1 = 1st, 2 =

2nd, 3 = 3rd

sex

Sex

Age

Age in years

sibsp

# of siblings /

spouses aboard the Titanic

parch

# of parents /

children aboard the Titanic

ticket

Ticket number

fare

Passenger fare

cabin

Cabin number

embarked

Port of

Embarkation

C = Cherbourg, Q

= Queenstown, S = Southampton

Variable

Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is

estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

c) Results

after training with the algorithms , we have to validate our trained

algorithms with test data set and measure the algorithms performance with

godness of fit with confusion matrix for validation. 70% of data as training

data set and 30% as training data set

confusion matrix for decision tree

trained data set test

data set

References

predictions

0

1

0

395

71

1

45

203

References

predictions

0

1

0

97

20

1

12

48

confusion matrix for logistic regression trained data test

data

References

predictions

0

1

0

395

12

1

21

204

References

predictions

0

1

0

97

12

1

21

47

d) Enhancements and reasoning

predicting the survival

rate with others machine learing algorithms like random forests , various Support

Vector machines may improve the accuracy

of prediction for the given data set.

5. Conclusion:

The

analyses revealed interesting patterns across individual-level features.

Factors such as socioeconomic status, social norms and family composition

appeared to have an impact on likelihood of survival. These conclusions,

however, were derived from findings in the dataThe accuracy of predicting the

survival rate using decision tree algorithm(83.7) is high when compared with

logistic regression(81.3) for a given

data set