On many conferences where AI, ML or data science is mentioned, a simple dataset is used to underpin the storyline of the sessions. The Titanic dataset is such a relatively small and simple example. Until now a Fact Oriented Model was absent. So, let's see how we can build one out of the dataset by verbalizing it.

On many conferences where AI, ML or data science is mentioned, a simple dataset is used to underpin the storyline of the sessions. The Titanic dataset is such a relatively small and simple example. Until now a Fact Oriented Model was absent. So, let's see how we can build one out of the dataset by verbalizing it.

Here's the first Data Science example page we'll use to find the data, some description and build our information model.

titanic data

When opening the data file (CSV or XLS) there's something very obvious that jumps out. The passengers are numbered by Passenger ID, and they all have a name. That'll be the start of the first set of fact expressions:

Passenger Name
"Passenger 1252 is called Sage, Master. William Henry."
"Passenger 160 is called Sage, Master. Thomas Henry."
"Passenger 558 is called Robbins, Mr. Victor."
"Passenger 951 is called Chaudanson, Miss. Victorine."

Apparently, not all passengers are registered to embark in the same port as the ticket is registered for. Therefor the single letter "C" which is short for Cherbourg, we state separate facts:

Embarking
"Passenger 1034 embarked in Cherbourg."

Related to these passengers, we can easily spot some trivial data which is obviously related to them:

Passenger Age
"Passenger 1252 has the age of 14,5 years."

Passenger Gender "Passenger 160 is male."

Passenger Hometown
"Passenger 160 is from the hometown Peterborough, Cambridgeshire, England."

And as we all know by the history books, a lot of people lost their lives. The dataset has this fact listed as a number where 0 means they've died, and 1 means they survived. For readability we simply verbalize these facts as:

Passenger Survived
"Passenger 160 survived? No."

Then the data becomes a little nebulous, hence the difficulties in dataset cleansing and understanding. We find fare prices. The data doesn't explain in which currency this is listed. Neither does the dataset explain if this is the price per passenger or per full ticket. Studying the numbers, we can derive the fact that this must be a price per person listed on a ticket. Similarly, tickets seem to be listing a single class value per individual ticket. Here are the facts we can find for tickets:

Fare
"Fare for ticket PC 17608 is 262.375."

Passage Class
"Ticket CA. 2343 is for class 3."

Ticket Boarding "Ticket CA. 2343 boards in Southampton." Ticket Destination "Ticket CA. 2343 is set for destination Jacksonville, Florida, US.

Ticket for Cabin "Passengers with ticket PC 17608 are assigned to cabin B57." "Passengers with ticket PC 17608 are assigned to cabin B59." "Passengers with ticket PC 17608 are assigned to cabin B61." "Passengers with ticket PC 17608 are assigned to cabin B63." "Passengers with ticket PC 17608 are assigned to cabin B66."

Tickets are registered for one or more passengers:

Ticket for Passenger
"Ticket CA. 2343 lists passenger 1252

The data also lists the family constellations, by listing two numbers for parents/children and sibling/spouse. Also the total amount of passengers per ticket is mentioned:

Parents/Children
"Passenger 1034 has 3 parents/children on board."

Sibling/Spouce
"Passenger 1034 has 1 spouse or sibling on board."

Ticket Count "Ticket PC 17608 is for 7 passengers.

The following information is derivable or added using the wiki pages to list the ports in the order the Titanic traveled and where the ports are. These are not from the dataset themselves, but make it easier to comprehend:

Port Country
"Ports Queenstown lies in Ireland."

Titanic Port Order
"Boarding 1 happened in Southampton."
"Boarding 2 happened in Cherbourg."
"Boarding 3 happened in Queenstown."

Family Size
; this is calculated by counting the passenger themselves plus the sibling/spouce, plus parents/children "The family size of passenger 1034 is 5.

 Now when classified and qualified into an information model, a visualization of can be made:

titanic

For those who appreciate a more condensed visualization, an  ERD diagram may be more helpful. CaseTalk will generate this automatically. To many, this illustrates how nebilous a simple dataset can be. There are many relations which are not instantly obvious from just looking at rows and columns full of data:

titanic.erd

Finally, the most high level diagram which shows barely any details of any meaning illustrates the broad information in the raw data file:

titanic.map

Having a solid Fact Oriented Model may clarify the structures in the dataset, and by having this data verbalized makes everyone can verify if assumptions made on this data are accurate or not. Without facts, any AI/ML model will be based on information which is only in the mind of the data scientist. To make AI/ML transparent and achieve responsible AI, we need to make raw source data verified and document it in an way everyone can understand.

Download the source model here from our Github account.

Add comment