Surviving The Titanic Disaster

Starting Our Travel

A.I Hub
8 min readJun 19, 2024

In this article, we will exploring the titanic dataset and we also understanding and analyzing the titanic incident from every corner, In 1812, the world’s biggest ship titanic was drown, In this guide, we will trying to finding out the reasons and investigate the every element of the ship from close.

Image owned by ITG

Table of Content

  • A closer look at the titanic
  • Conducting data inspection
  • Understanding the data

A Closer Look At The Titanic

Image by Linked-In

The Titanic was a British passenger ship that sank on its first voyage in the
North Atlantic in April 1912. The tragic event caused by striking an iceberg

resulted in more than 1,500 fatalities the estimate by US officials was 1,517
and by the British investigating committee, it was 1,503 from the 2,224

total number of crew and passengers. Most of the casualties were part of the
crew followed by third class passengers. How was this possible? The Titanic was considered an unsinkable vessel

when it was built using state-of-the-art technology in the early 20th century. This confidence was the recipe for disaster. As we know, it did sink, as the
contact with the iceberg damaged several water tight compartments enough to compromise its integrity. The ship was originally designed to

carry 48 lifeboats but only 20 were present on board and most of those were
carrying less than 60% of their full capacity when they were lowered into the
water. The Titanic was 269 meters in length and had a maximum breadth of 28

meters. It had seven decks identified with letters from A to G A and B were
for first class passengers, C was mostly reserved for crew and D to G were

for second and third class passengers. It also had two additional decks, the

boat deck from where the boats were lowered into the water and the Orlop
deck below the waterline. Although third class and second class amenities

were not as luxurious and comfortable as those in first class all classes had

common leisure facilities, like a library, smoking rooms and even a
gymnasium. Passengers could also use open air or indoor promenade areas.
The Titanic was advanced in terms of comfort and amenities compared to
other liners of the era. The Titanic started its voyage from Southampton and had two other stops
scheduled, one in Cherbourg, France and one in Queenstown, Ireland. The

passengers were shuttled with special trains from London and Paris to
Southampton and Cherbourg, respectively. The crew on the Titanic consisted

of around 885 people for this first trip. The majority of the crew were not
sailors but stewards, who took care of the passengers, firemen, stockers and
engineers who were in charge of the engines of the ship.

Conducting Data Inspection

Image by upsilon

The story of the Titanic is fascinating. For those interested in data
exploration, the data about the tragedy is also captivating. Let’s start with a
short introduction to the competition data. The dataset from Titanic

Machine Learning from Disaster contains three CSV comma-separated

values files, as in many Kaggle competitions that you will encounter.

  • train.csv
  • test.csv
  • sample_submission.csv

We will start by loading these files into a new notebook. You learned how to
do this in the previous sections, in the Basic capabilities section. You can also

create a notebook by forking one that already exists. In our case, we will
start a new notebook from scratch. Usually, notebooks start with a cell in which we import packages. We will do
the same here. In one of the next cells, we would like to read train and test

data. In general, the CSV files that you need have similar directories as in
this example.

After we load the data, we will manually inspect it, looking at what each
column contains that is, samples of data. We will do this for each file in the
dataset, but mostly, we will focus on the train and test files for now.

Understanding The Data

Image by upsilon

In Figures 1.1 and 1.2, we get a glimpse of a selection of values. From this
visual inspection, we can already see some characteristics of the data.

Let’s try to summarize them. The following columns are common to both train
and test files.

  • PassengerId — A unique identifier for each patient.
  • Pclass — The class in which each passenger was traveling. We know
    from our background information that possible values are 1, 2 or 3.
    This can be considered a categorical data type. Because the order of the

    class conveys meaning and is ordered, we can consider it as ordinal or
    numerical.
  • Name — This is a text type of field. It is the full name of the passenger,
    with their family name, first name and in some cases, their name
    before marriage, as well as a nickname. It also contains their title
    regarding social class, background, profession or in some cases,
    royalty.
  • Sex — This is also a categorical field. We can assume that this was
    important information at the time, considering that they prioritized
    saving women and children first.
  • Age — This is a numerical field. Also, their age was an important feature
    since children were prioritized for saving.
  • SibSp — This field provides the siblings or the spouse of each passenger.
    It is an indicator of the size of the family or group with which the
    passenger was traveling. This is important information since we can

    safely assume that one would not board a lifeboat without their
    brothers, sisters or partner.
  • Parch — This is the number of parents for child passengers or children
    for parent passengers. Considering that parents would wait for all
    their children before boarding a lifeboat, this is also an important
    feature. Together with SibSp, Parch can be used to calculate the size of
    the family for each passenger.
  • Ticket — This is a code associated with the ticket. It is an
    alphanumerical field, neither categorical nor numerical.
  • Fare — This is a numerical field. From the sample we see, we can
    observe that Fare values varied considerably (with one order of
    magnitude from class 3 to class 1 but we can also see that some of the
    passengers in the same class had quite different Fare values.
  • Fare — This is a numerical field. From the sample we see, we can
    observe that Fare values varied considerably with one order of
    magnitude from class 3 to class 1 but we can also see that some of the
    passengers in the same class had quite different Fare values.
  • Cabin — This is an alphanumerical field. From the small sample that we

    see in Figures 1.1 and 1.2, we can see that some of the values are
    missing. In other cases, there are multiple cabins reserved for the same

    passenger presumably a well-to-do passenger traveling with their
    family. The name of a cabin starts with a letter C, D, E, or F. We
    remember that there are multiple decks on the Titanic so we can guess

    that the letter represents the deck and then that is followed by the cabin
    number on that deck.
  • Embarked — This is a categorical field. In the sample here, we only see
    the letters C, S, and Q, and we already know that the Titanic started

    from Southampton and had a stop at Cherbourg, France and one at
    Queenstown today, this is called Cobh, the port for Cork, Ireland. We
    can infer that S stands for Southampton the starting port, C stands for

    Cherbourg and Q for Queenstown.

The train file contains a Survived field as well, which is the target feature.
This has either a value of 1 or 0 where 1 means the passenger survived and
0 means they sadly didn’t.

Figure 1.1 - Sample of the train data file

The test file does not include the target feature, as you can see in the
following sample.

Figure 1.2 - Sample of the test data file

Once we have had a look at the columns in the train and test files we can
continue with a few additional checks to find the dimensions of the datasets
and the feature distribution.

  1. Check the shape of each dataset train_df and test_df, using the
    shape() function. This will give us the dimension of the train and test
    files, number of rows and columns.
  2. Run the info() function for each dataset. This will give us more

    complex information, such as the amount of non null data per column and the type of the data.
  3. Run the describe() function for each dataset. This only applies to
    numerical data and will create a statistic on the data distribution,
    including minimum, maximum and first 25%, 50% and 75% values, as

    well as the average value and standard deviation.

The preceding checks give us preliminary information on the data
distribution for the numerical values in the train and test datasets. We can
continue later in our analysis with more sophisticated and detailed tools, but
for now, you may consider these steps a general preliminary approach for
investigating any tabular dataset that you put your hands on.

Conclusion

Finally, In this section, besides the competition approach, we will introduce our
systematic approach to exploratory data analysis and apply it to get familiar

with the data, understand it in more detail and extract useful insights. We
will also provide a short introduction to the process of using the results of

data analysis to build model training pipelines. Before diving into the actual
data, it is useful to understand the context and ideally, define the possible

objectives of the analysis.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet