Learning to data science - a beginner project

Even though I'm an experienced software developer, there's still lots of things I want to learn more about. Game design, AI, security, data science, image recognition, software optimization, ... There's just no end to the thing to learn about!

One concrete thing I've wanted to get better with is data science and statistics. I'm planning to take a few statistic courses at an open university at some point, but before that, I've decided to just sit down and try to approach data science with a beginner mindset - I know nothing, am starting from scratch, and just want to reproduce some basic things and learn about the common pitfalls and problems in data science and statistics.

This post isn't a tutorial to data science, rather it's a documentation about the things I've been doing, the approach I've taken and the tools I've chosen to learn about the basics of data science. Still, I hope that this text can be useful as a general description of how a one might approach learning about a new subject on their own.

What's.. Data science and how do I start?

So, I started from the very beginning. "What is data science", I typed to Google, and read through a bunch of articles as well as the Wikipedia page on data science.

My own summary at this stage would be that data science is the art of extracting useful, actionable information from sets of data.

I googled for "data science basics", "tutorial to data science", "visualizing data", and so on - and again, I read through a lot of articles.

After that initial reading, I decided to start doing some sort of a mock project.

First, I needed to decide on the language and frameworks I would use, so - you guessed it - I embarked on more googling!

The most popular languages used for data science appear to be Python and R. Since I am already familiar with Python, I decided to go with Python. I was already aware that Numpy is probably used a lot, but I regardless googled for other frameworks people used.

I ended up also using Pandas and Seaborn to further help me visualize and analyze the data.

Next I needed a data set. I wanted something where the results would be verifiable and where the data and the correlations in that data are already decently well established.

I decided to go with heart health. I wanted to establish the correlation between common variables such as age, weight and blood pressure and the existence of a heart condition.

I googled for data banks and I found Kaggle and ended up going with this particular data set: https://www.kaggle.com/sulianova/eda-cardiovascular-data/notebook

I decided on the scope of the project. I would want to establish the variables with the highest correlation with heart disease and I would like to create a small program that would predict the likelihood of heart disease in a patient given the measurements also available in the data set.

Starting to work with the data

From my earlier reading, I figured that what I would need to establish first is the quality of the data, so the first piece of code I did simply visualized the data points, one by one.

import pandas as pd
df = pd.read_csv('cardio_train.csv', sep=';')

Here I am visualizing the label "ap_hi" which stands for "Systolic blood pressure".

I am not a doctor, but I am quite sure that neither the systolic nor the diastolic blood pressure can be 15 000.

Next I did df.desscribe() (`df` stands for DataFrame, a Pandas concept. DataFrame has the data points from a particular data range, which, in this case, is the whole data set) and it showed this:

(Too lazy to sort that as a text paste, sorry)

So not only are the maximum blood pressures way off, but also the minimums go into the negatives.

Also the minimum age is roughly 30 (the ages there are in days) while the minimum weight is 10kg.

The shortest adult human alive weights 15kg, according to Google.

I googled for the highest and lowest recorded blood pressures and I figured that I'll just cut off the shortest/lightest and tallest/heaviest 0,1% of the people in the sample.

df.drop(df.index[df['ap_hi'] > 370]
        | df.index[df['ap_hi'] < 60]
        | df.index[df['ap_lo'] > 360]
        | df.index[df['ap_lo'] < 30]
        | df.index[df['weight'] < 40]
        | df.index[df['height'] < df['height'].quantile(0.001)]
        | df.index[df['weight'] < df['weight'].quantile(0.001)]
        | df.index[df['height'] > df['height'].quantile(0.999)]
        | df.index[df['weight'] > df['weight'].quantile(0.999)]
        ,inplace = True)

After this, my data appeared more sensible:

I do have some other concerns about the quality of this data but I'll write more about that later..

Baby's first regression

Time to see if we can spot any correlations with the variables and the existence of heart disease.

I started to work with a subset of the data, namely the first thousand rows. The data appears to already be in a random order so it should be alright to just take the first thousand elements.. Okay, I suppose I shouldn't be trusting that, so I randomized the data:

df = df.sample(frac=1).reset_index(drop=True)
df.to_csv("cardio_train_cleaned.csv", sep=";")

Now I could also skip removing the outliers every time I ran the code.

I decided I wanted to actually learn at the math of linear regressions. I do feel that when you're trying to learn something new, it is important to look "under the hood" so to say and try to understand how the things you want to learn to use actually work.

To Google it is, once again!

I read the Wikipedia page on regressions and also found this very helpful explanation: https://towardsdatascience.com/linear-regression-explained-1b36f97b7572 and this implementation in Python: https://www.geeksforgeeks.org/linear-regression-python-implementation/

Armed with this knowledge, I went on to code my own linear regression. And then I scrapped it after testing that it gave the same result as Seaborn's linear regression functionality. What I've found is that it really is very, very easy to get even simple mathematic formulas wrong in code, so you shouldn't use your own formulas unless there's no well-tested practical alternative.

You can still check the linear regression out in the custom_linear_regression.py file if you want to, though I didn't really bother to clean it out or make it properly generic.

I then ran linear regression for all the labels.

Weight, age and blood pressure clearly correlated with the presence of cardiovascular disease:

Interestingly, smoking apparently had little correlation:

This was rather unexpected - smoking is a well known predictor of heart disease, so wonder what's up?

Normalizing is very, very important - and our first paradox

I figured that maybe smokers were, on the average, lighter weight and that might explain it, so I fitted linear regression for smoking and weight:

Nope, actually in this data it's the opposite??

Oh wait a minute.

Maybe this is a form of the Simpon's Paradox that I read earlier about.. For example, perhaps women are less likely to both smoke and less likely to have heart disease?

So I removed women from the sample and.. Nope.

A big problem here is that the original source of the data is not made evident from the Kaggle kernel. I don't know where the data was actually collected from. Since the data is mostly older people - with only few people under 30 and most people above 40 - perhaps it comes from a hospital. It might make sense that for people already in a hospital, smoking is not a predictor for heart disease.

But really it's unfortunately impossible to tell..

Oh well, lets ignore this for now and continue on - but let's try to remember that it's always worth it to try to normalize the data when looking at the effect that a single variable has.

Predicting with logistic regression

Again I started with googling about logistic regression and looked at some Python examples. I wrote a simple logistic regression on top of my earlier linear regression code, but this only predicted a single value Y from a single variable X.

This time the code was so messy that I'll just leave it out altogether. Instead we'll use the tools from scikit.

First, I fitted a single variable to a logistic regression model:

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('cardio_train_cleaned.csv', sep=';')

X = df['ap_hi'].values.reshape(-1, 1)
y = df['cardio'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

score = clf.score(X_test, y_test)


The score was 0.710, which indicates that there is some predictive potential with systolic blood pressure and heart disease.

I started to look for the cutoff point, where the model starts predicting the existence of heart disease via, for example, print(clf.predict([[127]])).

I found the cutoff between 126 and 127; Systolic blood pressure at or above 127 predicts the existence of heart disease and below 127 predicts no heart disease.

This sounds reasonable, given that the recommended blood pressure is below 120/80.

I then ran logistic regression for all the available variables by defining X as:

X = df.drop(['cardio', 'id'],axis=1).values

Interestingly enough, the score only improved to 0.712.

I tried to randomly remove various features, but there was no real change to the score; the only thing that changed the score was removing ap_hi, which reduced the scored to under 0.7.

Predicting with .. a neural network cuz why not!

Okay okay, I know, neural networks are kinda overhyped.


They're really cool!

I've previously built my own very simple neural network in Rust, which I think eventually did work, though I am not entirely sure.. I quickly learned that doing these things by hand needs a lot of troubleshooting and verifying.

I've also toyed, mostly a bit tongue-in-cheek, with some super basic neural networks like here.

I decided to go with Keras, which I've previously used and which I find very intuitive and easy to use.

So first I tried to predict with a single variable:

X = df['ap_hi'].values.reshape(-1, 1)
y = df['cardio'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = Sequential()
model.add(Dense(20, input_dim=1, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, verbose=1)

score = model.evaluate(X_test, y_test, verbose=0)
print('Accuracy ',score[1])

This actually gave me a clearly worse result than with logistic regression..

And then I remembered - standardize your data! Uff! I had forgotten to standardize the data, so let's quickly do that too:

from sklearn.preprocessing import MinMaxScaler

X = scaler.transform(X)

There we go - accuracy is now 0.71, matching that of the logistic regression.

Next I did the same for all the variables:

X = df.drop(['cardio', 'id'],axis=1).values

And in mere 10 epochs, we got an accuracy score of 0.73. That's a bit better than the logistic regression - cool.

I tried playing around with the amount of neurons, adding a hidden layer, changing the optimizer, changing the activation functions etc, but the results didn't really improve from the initial.

model = Sequential()
model.add(Dense(30, input_dim=11, activation='tanh'))
model.add(Dense(15, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

With the above and 100 epochs, I could reliably get the score of 0.735, mildly better than with logistic regression.

Next I wanted to see the amount of false positives and false negatives, so I generated a confusion matrix:

predicted = model.predict(X_test)
predicted = (predicted > 0.5).astype(np.int_)
print(confusion_matrix(y_test, predicted))

Normally my neural network gives a number indicating how likely it is that a person has a heart disease and this number is on the scale of 0 to 1. So I first binarized it.

The output of the matrix was:

[[7388 1305]
 [3535 4915]]

If f read that right, it's 3535 false negatives and 1305 false positives.

That's quite awkward. The ratio of false positives to true positives is 0.265 while the ratio of false negatives to true negatives is 0.47.

The system is a bit too likely to misdiagnoses a person who might have a heart problem as a person who doesn't have a heart problem.

What I'd really want to do is to minimize false negatives - in other words, I'd like for the sensitivity to be optimized for more aggressively than the specificity.

I set weights for the test data as such that no heart disease has a weight of 1 and heart disease has a weight of 2.

sample_weight = (1 + y_train)
model.fit(X_train, y_train, epochs=10, verbose=1, sample_weight=sample_weight)

This significantly improved the sensitivity at the cost of overall accuracy, with the confusion matrix now looking like so:

[[4448 4245]
 [1127 7323]]

Overall though I would say that this isn't good enough.

Final remarks

I would say that using the above methodology, I was able to extract a few useful clues about the data; blood pressure and weight clearly correlate with heart disease.

But a lot of correlations that are nowadays established were missing from the data - e.g. for smoking.

I am quite doubtful about the quality of the data. The kernel does not specify where the data comes from and I do have a sneaking suspicion that it might not even be real data; it could be synthesized.

Or the data might be from a very specific subpopulation. The age range is odd (30-65 years, mostly 40-50 year olds) and completely lacks young and old people.

Also there's a huge difference between how many of the women smoke and how many of the men smoke, much larger than I would expect from Europe or USA, so it's also possible the data isn't from either.

The predictions were not good enough. When the model was optimized to limit the amount of false negatives, the overall accuracy got so low that it wasn't really very useful model anymore.

There's a few things I learned about while writing this and wont be placing in to the main text anymore, such as that I could use the heatmap from Seaborn to generate this:

Which would very quickly show me a lot of the possislbe correlations in the data.

Overall there's a lot more for me to learn. At some point I'll go through those basic statistics courses.

There are a lot of pitfalls around gathering and analyzing data.

The data source might be bad; the data might be noisy; you may have a lot of outliers; the data might not actually have any real correlations..

Anyhow, this was a pretty fun project. If you're interested in the code, go check it out at Github: https://github.com/tzaeru/Data-science-learning-project---predicting-heart-disease

Oh and!

I used Spyder as my IDE, which is honestly pretty awesome. That's why my code has no plot.show()'s. Spyder can show me all the plots in a nice scrollable window, in action appearing like this:

Cool stuff.

Jalmari Ikävalko

Read more posts by this author.