Wheat Seeds Analysis¶
0. Description¶
https://archive.ics.uci.edu/datasets classficiation, clustering
We have taken the dataset from Kaggle, data source link Wheat Seed. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. The objective is to demonstrate in R to analyze all relevant wheat seed data and cluster into groups.
We are importing the Wheat seed (kernels) data and performing preliminary analysis.
seeds classifcation and cluster analysis classification, clustering
data preparation data dictionnary
EDA k-means clustering Hierarchical clustering Gaussian Mixture Model clustering
https://rstudio-pubs-static.s3.amazonaws.com/606831_876e370c02ec44fd9e338182aca4897c.html
pip install --upgrade jupyterlab-git
1. Environment set up¶
Configuration¶
Libraries imports¶
2. Data Dictionnary¶
3. Data Preparation¶
3. Data Pre-processing¶
Changing the Shape of Variables¶
Many machine learning algorithm require that the variables in input should be approximately normally distributed. We already covered the normal distribution in the statistical module, so you know what that means.
The reason of such assumption is often related to the symmetry of a normal distribution with can be a desired property to have. We also saw that skewness is a measure of symmetry and a perfectly symmetric distribution will have a skewness of 0.
We can ask the question is there anything that we can do to change the distribution so that a variable approximately resembles a normal distribution? The answer is yes, and we can do that, by applying a transformation.
Most of real-world data are right skew, like the distribution on the slide and it is very common, for example in financial data.
from pandas import read_csv
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# Read the data
filename = "./data/pima-indians-diabetes.data"
orig_df = read_csv(filename, sep=',', decimal='.',header = 0)
#Converting zeros into NaN and replace missing values with the mean
na_df = orig_df[['glucose','dBP','skinfold','insuline','BMI']].replace(0,np.NaN)
na_df[['tpregnant','dpf','age','isDiabetic']] = orig_df[['tpregnant','dpf','age','isDiabetic']]
df = na_df.fillna(na_df.mean())
Changing Shape to Achieve 'Normality'¶
Display the skewness of each variable
df.skew()
Insuline and dpf are highly skewed. Maybe, we should also consider age.
sns.displot(df['dpf'],bins=20)
You can see how rightskew the variable is (skew=1.92). Let's try to apply a log transformation
ax = sns.displot(np.log(df['dpf']),bins=20)
ax.set(xlabel='log(dpf)')
The transformation seemed to work. This 'almost' dual peak bothers me as we want the variable to be unimodal. But probably it is ok as it is not a very well-defined dual peak.
Let's derive the new variable **log_dpf** and this will be the variable that you would like to input in any machine learning algorithm.
df['log_dpf']=np.log(df['dpf'])
print('Skewness of log_dpf: ', df['log_dpf'].skew())
Checking Normality¶
If the new variable log_dpf is approximately normal, than the normal probability plot (pplot) should exhibit a linear relationship. Let's verify that.
To this end, we will use a Q-Q Plot which is a scattered plot in which the two sets of quantiles coming from a perfectly normal distribution and the series of data we want to test are plotted against each other.
If the two series of quantiles are coming from the same distribution (normal) then they should form an approximately straight line.
# Getting the number of point to generate from the normal distribution
numOfPoints = len(df['log_dpf'])
# Getting the mean and standard deviation from the distribution of the transformed variable
mean = df['log_dpf'].mean()
sdev = df['log_dpf'].std()
# Generate specified number of point from a 'True' normal distribution with the same mean and standard deviation
ncurve = np.random.normal(mean,sdev,numOfPoints)
# Generate a sample of 100 percentiles to compare
percs = np.linspace(0,100,100)
# Generate the series of quintiles using the sample
qn_transformed = np.percentile(df['log_dpf'], percs)
qn_normal = np.percentile(ncurve, percs)
# Generate the Quintile-Quintile Plot (QQPlot)
plt.plot(qn_transformed,qn_normal, ls="", marker="o")
# Display the ideal line for reference
x = np.linspace(np.min((qn_transformed.min(),qn_normal.min())), np.max((qn_transformed.max(),qn_normal.max())))
plt.plot(x,x, color="k", ls="--")
plt.show()
If the df['log_dpf'] variable was distributed normally, all points would line up with the dotted line. However, we see that toward the two ends the points are starting to diverge, showing non-normality behavior. However, most of the points are within acceptable distance from the line and therefore, we can accept this transformation, as we don't need to be that precise.
Comparison with the original variable¶
Just for curiosity, let's generate the QQPlot for the original variable
# Getting the number of point to generate from the normal distribution
numOfPoints = len(df['dpf'])
# Getting the mean and standard deviation from the distribution of the transformed variable
mean = df['dpf'].mean()
sdev = df['dpf'].std()
# Generate specified number of point from a 'True' normal distribution with the same mean and standard deviation
ncurve = np.random.normal(mean,sdev,numOfPoints)
# Generate a sample of 100 percentiles to compare
percs = np.linspace(0,100,100)
# Generate the series of quintiles using the sample
qn_transformed = np.percentile(df['dpf'], percs)
qn_normal = np.percentile(ncurve, percs)
# Generate the Quintile-Quintile Plot (QQPlot)
plt.plot(qn_transformed,qn_normal, ls="", marker="o")
# Display the ideal line for reference
x = np.linspace(np.min((qn_transformed.min(),qn_normal.min())), np.max((qn_transformed.max(),qn_normal.max())))
plt.plot(x,x, color="k", ls="--")
plt.show()
We can see, how much deviation we have from normality in the original variable: the transformation definitely worked!
Every time you need to check for the normality assumption, either for the variable in input or for other aspects of the processing (e.g., normality assumption of residuals and we will see this later) you can use the QQ-Plot to formaly test for that assumption.
3.3 Transforming Categorical to Numerical¶
Some algorithms, like regression algorithms, only accept in input numerical variables. Then, we should ask the question of what to do, in this case, with categorical variables. Should we resign to the fact that we cannot use these variables?
The asnwer is a definite NO!
In this example we will use the "Auto Imports Database" to transform the categorical variables body-style into numerical flag variables.
Dataset Information:
-- Creator/Donor: Jeffrey C. Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
-- Date: 19 May 1987
-- Sources:
1. 1985 Model Import Car and Truck Specifications, 1985 Ward's
Automotive Yearbook.
2. Personal Auto Manuals, Insurance Services Office, 160 Water
Street, New York, NY 10038
3. Insurance Collision Report, Insurance Institute for Highway
Safety, Watergate 600, Washington, DC 20037
Attribute names located at: https://archive.ics.uci.edu/ml/datasets/automobile
from pandas import read_csv
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
NOTE: This dataset does not contain a header and therefore we need to specify the columns name. Luckly, the author made the list of columns readily available to us.
# Read the data
columns = ["symboling", "norm_losses", "make", "fuel_type", "aspiration",
"num_doors", "body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
"city_mpg", "highway_mpg", "price"]
# Let's specify the delimiter, set the decimal point character, set the columns names and convert the '?' as NaN
filename = "./data/imports-85.data"
df = read_csv(filename, sep=',', decimal='.',names=columns, na_values="?")
df.head(5)
Let's take a look at the variables types and display the categorical variables
df.dtypes
df.select_dtypes(include=['object'])
For simplicity let's focus on the body_style by displaying the count of unique values.
df['body_style'].value_counts()
It seems we do not have any missing values: let's verify.
df[df['body_style'].isnull()]
For any missing value we would have replace it with the mode, which in this case was 'sedan'; however, no missing value for this columns.
Pandas supports the conversion of categorical into flag or dummies variables via the get_dummiers function.
newdf = pd.get_dummies(df, columns=['body_style'], prefix=['style'])
newdf.head(5)
You can transform as many as categorical variables at once.
You also noticed that Pandas create as many flag variables as the number of categories instead of k-1 as we saw in the lecture: a little bit more verbose, but it does not have any ill effects.
NOTE: The body_style variables is automatically removed from the dataset's columns
Custom Mapping¶
Sometimes you might want to have a simple binary encoding so that a car is either a sedan or not. In this case, you will only need of a single binary variable that take the value of 1 for sedan and 0 otherwise
df['is_sedan'] = np.where(df['body_style'].str.contains('sedan'),1,0)
df[['body_style','is_sedan']].head(10)
Another way is to define a mapping dictionary in which we specify the exact mapping we want.
df['num_doors'].value_counts()
doors_map={'two':2,
'four':4}
df['num_doors'] = df['num_doors'].map(doors_map)
NOTE: This mapping makes sense as the categorical variable has a natural ordering of its element and therefore we can simply map that information using integer values that mantain the same ordering. You can apply a similar mapping to the num_cylinders field for example.
df.head(5)
df['num_doors'] = df['num_doors'].fillna(df['num_doors'].mode()[0])
df['num_doors'].isnull().sum()
df.astype({'num_doors': 'int32'}).dtypes
Numeric to Categorical - Equal Width Binning¶
If we look at the highway_mpg variable we have cars with a highway mileage from 16 to 54 mpg. Let's create a four bin categorical values with an equal size strategy:
- Low
- Average
- High
- Very High
df['highway_mpg'].describe()
# Because we have 4 bins we need 5 intervals
categories = ['low','average','high','very high']
# You can specify your own intervals, I used linspace to be more precise
bins = np.linspace(df['highway_mpg'].min(), df['highway_mpg'].max(), len(categories)+1)
print(bins)
df['mpg_bins'] = pd.cut(df['highway_mpg'], bins=bins, labels=categories, include_lowest=True)
# Let's sort the value so the categories are displayed from 'low' to 'very high'
df.sort_values('highway_mpg', inplace=True)
plt.hist(df['mpg_bins'], bins=4)
Numeric to Categorical - Equal Frequency Binning¶
Let's define a function that generates the label intervals and they do not have the same meaning as per the equal width case.
def get_labels(df, column, k=2):
intervals = list(set(pd.qcut(df[column], q=k, precision=1).tolist()))
labels = []
for interval in intervals:
sinterval = ''
if interval.closed_left:
sinterval+='['
else:
sinterval+='('
sinterval+=str(interval.left)+','+str(interval.right)
if interval.closed_right:
sinterval+=']'
else:
sinterval+=')'
labels.append(sinterval)
return labels
categories = get_labels(df, 'highway_mpg',4)
print(categories)
df['mpg_bins'] = pd.qcut(df['highway_mpg'], q=4, precision=1, labels=categories)
# Let's sort the value so the bins intervals are displayed from the smallest to the highest
df.sort_values('mpg_bins', inplace=True)
ax = plt.hist(df['mpg_bins'], bins=4, edgecolor = "black")
As you can see each category as approximately the same number of records.
As last comment, binning by the equal frequency is less common then binning by same width.
Categorical to Single Binary Variable¶
Sometimes, a categorical variable has only two values and deriving two flag variables would be a waste as the information can be encoded using a single variable. By specifying drop_first=True Pandas will drop the first column.
df['aspiration'].value_counts()
Data Preprocessing
We cannot jump directly at mining the data without cleaning the data as we saw in the CRISP-DM methodology.
Data Pre-Processing differs quite a lot when applied to numerical or textual data, so we will cover both domains separately.
For numerical data you will become proficient in dealing with missing values, identifying outliers, and transforming the data to normalize its values.
Handling Missing Data
¶
Missing values are simply values that are not available and cells contain "holes". Other time missing values are in the form of encoded values (e.g. 99999) representing the absance of information.
In this lab we will use the Pima Indians Diabetes Dataset.
Source: Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.
This is an interesting case, as the the author's page indicated that there were no missing values in the dataset. However, this cannot be true: there are zeros in places where they are biologically impossible, such as the blood pressure attribute. It seems very likely that zero values encode missing data. Consequently we have to use our best judgement in using this data.
Here the fields description:
tPregnant: Number of times pregnant
glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
dBP: Diastolic Blood Pressure (mm Hg)
skinfold: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)2)
dpf: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
age: Age (years)
isDiabetic: Class variable (0 if non-diabetic, 1 if diabetic)
Data cleaning¶
Duplicate records¶
Missing values¶
l¶
import pandas as pd
import numpy
filename = "./data/pima-indians-diabetes.data"
dataset = pd.read_csv(filename, sep=',', decimal='.',header = 0)
print(dataset.describe())
# Counting Missing values for each column
dataset.isnull().sum()
# Counting Missing values for each row
dataset.isnull().sum(axis=1)
If we check for null values, this dataset seems to be quite ok as we cannot find any null values!
NOTE: A traditional approach to missing values is to check for null values. However, while this might work well in most situation, you should check for other values as well. In our case the value 0 is used to encode for absence of information and if you don't check your analysis might generate sub-optimal results. As you can see many variables have a min of 0 which does not make biological sense: e.g. glucose, diastolic BP, skinfold, insuline, BMI, etc.
We have to deal with these values else, our analysis and modeling phase will be incorrect.
dataset.head(10)
print ("Before removal: "+str(len(na_ds)))
na_ds_removed = na_ds.dropna()
print ("After removal: "+str(len(na_ds_removed)))
print ("Total records removed: "+str(len(na_ds)-len(na_ds_removed)))
Removing = records containing missing values
¶
As we said this is a waste and I don’t recommend it. For small number of record this could be a solution, but this is rarely the case.
#Converting zeros into NaN
na_ds = dataset[['glucose','dBP','skinfold','insuline','BMI']].replace(0,numpy.NaN)
na_ds[['tpregnant','dpf','age','isDiabetic']] = dataset[['tpregnant','dpf','age','isDiabetic']]
na_ds.head(10)
print ("Before removal: "+str(len(na_ds)))
na_ds_removed = na_ds.dropna()
print ("After removal: "+str(len(na_ds_removed)))
print ("Total records removed: "+str(len(na_ds)-len(na_ds_removed)))
We removed a total of 376 records, losing important information. Better to replace the missing values with other strategies.
Inputing missing values with the mean
¶
You can replace the missing value using the field mean, in the case of numerical value, or the mode, for categorical variables.
This method works better compared to the constant method. Always keep in mind that you are fabricating data to fill the holes in the data set. Having said that, this method can work quite well and largely used.
However, don’t use it if the missing value are quite numerous since you might end up with confidence intervals which could be quite over-optimistic.
na_mean_ds = na_ds.fillna(na_ds.mean())
na_mean_ds.head(10)
We can see that insuline, skinfold are the variables affected the most by missing values, while dpf and age do not contain missing values.
Sometimes, in Python, it is easier to deal with missing values if they are encoded as 'NaN' (Not a Number) as many function deal with this particular type of data automatically.
#Converting zeros into NaN
na_ds = dataset[['glucose','dBP','skinfold','insuline','BMI']].replace(0,numpy.NaN)
na_ds[['tpregnant','dpf','age','isDiabetic']] = dataset[['tpregnant','dpf','age','isDiabetic']]
na_ds.head(10)
Removing records containing missing values
¶
As we said this is a waste and I don’t recommend it. For small number of record this could be a solution, but this is rarely the case.
print ("Before removal: "+str(len(na_ds)))
na_ds_removed = na_ds.dropna()
print ("After removal: "+str(len(na_ds_removed)))
print ("Total records removed: "+str(len(na_ds)-len(na_ds_removed)))
We removed a total of 376 records, losing important information. Better to replace the missing values with other strategies.
Mean = Inputing missing values with the mean
¶
You can replace the missing value using the field mean, in the case of numerical value, or the mode, for categorical variables.
This method works better compared to the constant method. Always keep in mind that you are fabricating data to fill the holes in the data set. Having said that, this method can work quite well and largely used.
However, don’t use it if the missing value are quite numerous since you might end up with confidence intervals which could be quite over-optimistic.
na_mean_ds = na_ds.fillna(na_ds.mean())
na_mean_ds.head(10)
# Let's confirm we don't have any missing values
print(na_mean_ds.isnull().sum())
What is the effect on the mean and standard deviation of the original dataset?
dataset[['dBP','skinfold']].describe()
na_mean_ds[['dBP','skinfold']].describe()
Random = Inputing missing values with the random values from the distribution
¶
This method could be superior to the mean substitution, since, the measures of center and spread should remain closer to the original. However, there is no guarantee that the produced value, or better the combination of values if you are utilizing this method for multiple fields, make sense.
# Function to replace NaNs with random samples from the column
def replace_nans_with_samples(column):
# Remove NaNs from the column
non_nan_values = column.dropna()
# Randomly sample from non-NaN values and replace NaNs
column.fillna(pd.Series(np.random.choice(non_nan_values, size=len(column.index))), inplace=True)
# Apply the function to each column
na_random_ds = na_ds.copy() # create a copy of the dataframe
na_random_ds.apply(lambda col: replace_nans_with_samples(col))
na_random_ds.head(5)
dataset[['dBP','skinfold']].describe()
na_random_ds[['dBP','skinfold']].describe()
Visualizaiton Outliers & detection¶
Part I - Visualizing Outliers¶
Outliers are extreme values outside the range of what is considered normal. They do not necesserely represent error, like for example a cholesterol level of 400, which is extreme, but still a possibility. Obviously a blood pressure of 1500 is definitely an error.
It is important to identify outlier because there are many statistical methods which are very sensitive to outliers which will over-influence the output. In the case of a linear regression, for example as shown in the picture, the slope of the regression line is highly affected by outliers which could completely distort the prediction of this model.
Outlier detection using histogram¶
Using this method, the outliers will be at the far tails and good candidates are the ones with low frequencies and usually disconnected from other bins
from pandas import read_csv
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# Read the data
filename = "./data/pima-indians-diabetes.data"
orig_df = read_csv(filename, sep=',', decimal='.',header = 0)
#Converting zeros into NaN and replace missing values with the mean
na_df = orig_df[['glucose','dBP','skinfold','insuline','BMI']].replace(0,np.NaN)
na_df[['tpregnant','dpf','age','isDiabetic']] = orig_df[['tpregnant','dpf','age','isDiabetic']]
df = na_df.fillna(na_df.mean())
print(df.describe())
%matplotlib inline # % magic function
# TODO create a better visualization with the count of each bar / so we can see better the outliers
# //! Important information
sns.set_style(style='white')
ax = sns.displot(df["skinfold"],bins=20)
# How to plot percentage with seaborn distplot / histplot / displot
# link https://stackoverflow.com/questions/63373194/how-to-plot-percentage-with-seaborn-distplot-histplot-displot
fg = sns.displot(data=data, x='age', stat='percent', col='sex', height=3.5, aspect=1.25)
for ax in fg.axes.ravel():
# add annotations
for c in ax.containers:
# custom label calculates percent and add an empty string so 0 value bars don't have a number
labels = [f'{w:0.1f}%' if (w := v.get_height()) > 0 else '' for v in c]
ax.bar_label(c, labels=labels, label_type='edge', fontsize=8, rotation=90, padding=2)
ax.margins(y=0.2)
plt.show()
# jointplot of log_value_confirmed and log_value_deaths
# //! needs review
# TODO
sns.jointplot(x=’log_value_confirmed’,y=’log_value_deaths’,
data=df_cases_scatter)
#show the plot
plt.show()
## section 1
data = np.random.default_rng(123).rayleigh(1, 70)
counts, edges, bars = plt.hist(data)
## section 2
fig, ax = plt.subplots()
counts, edges, bars = ax.hist([data, data * 0.3], histtype='barstacked')
for b in bars:
ax.bar_label(b)
https://seaborn.pydata.org/generated/seaborn.displot.html
import the necessary python packages¶
import pandas as pd import seaborn as sns import numpy as np
read the dataset using pandas read_csv¶
function¶
data = pd.read_csv(r"path to\tips.csv")
group the multi level categorical variables¶
and reset_ the index to flatten the index¶
groupedvalues = data.groupby('day').sum().reset_index()
use sns barplot to plot bar plot¶
between days and tip value¶
ax = sns.barplot(x='day', y='tip', data=groupedvalues, errwidth=0)
now simply assign the bar values to¶
each bar by passing containers method¶
to bar_label function¶
ax.bar_label(ax.containers[0])
# // TODO
# //! This is alert
# TODO
# * * Important information
# //* This is highlight
# //? This is query
Outlier Detection using Scattered Plot¶
Using a scattered plot it is possible to identify a possible outlier candidate for skinfold.
ax = df.plot.scatter(x='age',y='skinfold')
Using the Box Plot¶
let’s turn our attention to a method that is not sensitive to outliers like the Interquartile Range (IQR).
This is a very robust measure which I use quite extensively in the detection of outliers and defined as the difference between the third and first quartile, that is IQR=Q3-Q1, which represent the spread of the middle 50% of data.
Any data point located 1.5 times the IQR below Q1 or 1.5 times the IQR above Q3 is considered an outlier.
Let's display the distribution of the 'skinfold' or 'dBP' and the corresponding Box Plot. In the Box Plot the whiskers are drawn at:
- Lower Wisker: Q1-1.5*IQR # //! important
- Upper Wisker: Q3+1.5*IQR
Anything below the lower wisker and about the upper wisker are outlier candidates.
field = 'skinfold'
plt.figure(figsize=(10,8))
plt.subplot(211)
plt.xlim(df[field].min()-1, df[field].max()*1.1)
ax = df[field].plot(kind='kde')
plt.subplot(212)
plt.xlim(df[field].min()-1, df[field].max()*1.1)
ax = sns.boxplot(x=df[field])
You can see that the Box Plot is quite useful in identify outlier candidates.
I am also displaying the probability density function estimated by the KDE method and you will see how the probability associated to the points below and above the whiskets is quite low and therefore outside the normal expected values. Remember these are candidates outlier and it is up to us to decide what to do with them.
If you want to be quite aggressive, you can remove all these points from the dataset; otherwise you can just remove the most extreme values. All depends, on the type of methods you choose to adopt in resolving the business problem.
Data Transformation¶
Numeric variables will usually have different ranges: some small and other much larger.
For example, the GPA score for a student is a variable ranging from 0 to 4, while the salary variable can assume values as large as millions of dollars.
Some algorithms are very sensitive and can be greatly affected by differences in such ranges as larger values might have more influence on the output. For example, neural networks are notoriously sensitive to the range of variables and do so algorithms that use any distance measure as well.
Because of this, we have to eliminate this source of distortion and bring all the variables within the same range. If the targeted range is [0,1] then we say that we are normalizing the data.
Another way of controlling for this effect is using standardization which is used to standardize the scale of effect each variable has on the output.
The main difference between the two is that any normalization technique will result in values within a specified interval while standardization usually does not, rather it will compare values using the same unit of measure.
Part I - Normalizing & Standardizing your Data¶
Reading the data¶
By now, you should be very familiar with reading data using Pandas. Here, in addition to reading data we are replacing the missing values with the mean as we learned in the previous video.
from pandas import read_csv
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import math
# Will display any number with 4 decimal points instead of the scientific notation
pd.options.display.float_format = '{:,.4f}'.format
# Read the data
filename = "./data/pima-indians-diabetes.data"
orig_df = read_csv(filename, sep=',', decimal='.',header = 0)
#Converting zeros into NaN and replace missing values with the mean
na_df = orig_df[['glucose','dBP','skinfold','insuline','BMI']].replace(0,np.NaN)
na_df[['tpregnant','dpf','age','isDiabetic']] = orig_df[['tpregnant','dpf','age','isDiabetic']]
df = na_df.fillna(na_df.mean())
Let's display the histogram along with the estimated probability density function related to the variables skinfold and dBP. Because there is a long tail for the skinfold distribution, we are filtering out any values larger than 60 (does it sound like an outlier filtering?)
print(f"'dBP' statistics: mean ({df['dBP'].mean():.4f}), std ({df['dBP'].std():.4f})") # //! display 4 decimals points / pandas
print(f"'skinfold' statistics: mean ({df['skinfold'].mean():.4f}), std ({df['skinfold'].std():.4f})")
fig, ax = plt.subplots()
sns.histplot(df["dBP"],bins=20, ax=ax)
sns.histplot(df["skinfold"][df['skinfold']<60],bins=20, ax=ax)
As you can see the two distribution are different as skinfold as mean and std respectively of 29.15 and 8.79, and dBP has mean of 72.40 and std of 12.10 which can be hard to compare. So let's apply some transformation and see what happens.
Min-Max Transformation¶
Let's apply the transformation min-Max
df['mMskinfold']=(df['skinfold']-df['skinfold'].min())/(df['skinfold'].max()-df['skinfold'].min())
df['mMdBP']=(df['dBP']-df['dBP'].min())/(df['dBP'].max()-df['dBP'].min())
fig, ax = plt.subplots() plt1 = sns.histplot(df["mMskinfold"],bins=20, ax=ax) plt2 = sns.histplot(df["mMdBP"],bins=20, ax=ax)
Z-Score Transformation¶
df['zskinfold'] = (df['skinfold']-df['skinfold'].mean())/df['skinfold'].std()
df['zdBP'] = (df['dBP']-df['dBP'].mean())/df['dBP'].std()
fig, ax = plt.subplots()
sns.histplot(df["zskinfold"][df['zskinfold']<4],bins=20, ax=ax)
sns.histplot(df["zdBP"],bins=20, ax=ax)
As you might have notice, that the Z-Score distribution does not change the shape of the original distribution, rather changes its unit of measure. So, do not make the mistake to think that because you applied the Z-Score transformation the resulting variable is normally distributed: that is not correct!
Decimal Scaling¶
def get_ndigits(maxvalue):
return round(math.log(maxvalue,10))
n=get_ndigits(df['skinfold'].abs().max())
df['dskinfold'] = df['skinfold']/10**n
n=get_ndigits(df['dBP'].abs().max())
df['ddBP'] = df['dBP']/10**n
fig, ax = plt.subplots()
sns.histplot(df["dskinfold"],bins=20, ax=ax)
sns.histplot(df["ddBP"],bins=20, ax=ax)
Displaying all Transformations¶
df[['skinfold','mMskinfold','zskinfold','dskinfold']].head(10)
Part II - Numeric Method to Identify Outliers¶
Using ZScore to identify Outliers¶
Because a ZScore transformation changes the unit of measure, if your data is normally distributed then you also have the extra information that 99.6% of the values will be within 3 standard deviations from the mean.
This means that values outside 3 standard deviations will be quite rare and far from what we should expect. In other words, these values can be considered outliers as they represent something that we would not expect.
# Let's display candidate outliers with the zscore method
coutliers = df['zskinfold'][(df['zskinfold']<-3) | (df['zskinfold']>3)]
print(coutliers.tolist())
We have identified several candidates, let's take a further look.
What is the original value corresponding to 7.94528970815517?
mean = df['skinfold'].mean()
stdev = df['skinfold'].std()
print('Original_value of 7.94528970815517 is ', (7.94528970815517*stdev)+mean)
df['skinfold'][df['skinfold']>=99]
df['skinfold'].sort_values()
Ok, 99 seems to be a true outlier and should be removed.
def standardize(df, column):
mean = df[column].mean()
stdev = df[column].std()
df['z_'+column] = (df[column]-mean)/stdev
return df
def get_zoutliers(df):
columns = df.columns
result = {}
for column in columns:
# Standardize Column
df = standardize(df, column)
coutliers = df['z_'+column][(df['z_'+column]<-3) | (df['z_'+column]>3)]
result[column]=coutliers.tolist()
return result
coutliers = get_zoutliers(df)
print(coutliers)
Using Interquartile Range (IQR)¶
def iqr_outliers(df, column):
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
outliers = df[column][(df[column] < (q1 - 1.5 * iqr)) | (df[column] > (q3 + 1.5 * iqr))]
return outliers
columns = df.columns
outlier_candidates = {}
for c in columns:
cos = iqr_outliers(df, c)
outlier_candidates[c] = {'n':len(cos), 'candidates':list(cos)}
print(outlier_candidates)
4. EDA (Exploration data analysis)¶
4.1 Correlation between Variables¶
4.2 Outlier & Relationships between between target and other variables¶
5. K-Means Clustering¶
5.1 Finding Optimal number cluster¶
Elbow Method¶
Silhouette Method¶
Dunn's Method¶
Find K-means Cluster¶
6. Hierarchical Clustering¶
6.1 Finding Optimal number of clusters¶
Elbow Method¶
Dunn's Method¶
Cluster Dendrogram¶
7. Gaussian Mixture Model Clustering¶