Data Normalisation and Transformation¶

When one or more datasets affects results disproportionately, normalization or scaling gives a level playing field. In this activity, we will apply different data methods for data normalisation and transformation. We first read the dataset that is used for the first part of the analysis.

import pandas as pd
import numpy as np

df = pd.read_csv( 'wine_data.csv', header = None, usecols = [0,1,2])
df.columns=['Class label', 'Alcohol', 'Malic acid']
df.head()

df.describe()

As we can see in the tables above, the features, Alcohol (percent/volumne) and Malic acid (g/l) are measured on different scales, so scaling is necessary prior to any comparison or combination of data.

df.Alcohol.mean() / df["Malic acid"].mean() # difference is factor of ~5x

1. Z-Score Normalisation (standardisation):¶

We use scikit-learn linrary for standardise data (mean=0, SD=1). The class you are going to use is the StandardScaler class. More reading materials can be found here.

The task here it to standardise the values of Alcohol and Malic Acid, and append the standard variables to the DataFrame "df" as follows

Smiley face

from sklearn import preprocessing

std_scale = 
df_std =
df_std[0:5]

# put it alongside data... to view
df['Ascaled'] = df_std[:,0] # so 'Ascaled' is Alcohol scaled
df['MAscaled'] = df_std[:,1] # and 'MAscaled' is Malic acid scaled
df.head()

Now, compute and display the normalised values for both features. Let's check if they have mean of 0 and SD= 1.

df.describe() # check that μ = 0 and σ = 1... approx

Or you can print out values:

print('Mean after standardisation:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_std[:,0].mean(), df_std[:,1].mean()))
print('\nStandard deviation after standardisation:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_std[:,0].std(), df_std[:,1].std()))

Compare the variables before and after normalization with plots¶

In order to investigate how the normalization actually affect the data, we can visualize the data by plotting the variable values.

Firstly, plot the original data, i.e., data before normalization

%matplotlib inline

df["Alcohol"].plot(), df["Malic acid"].plot()

Now, we plot the standardized data, and observe the range and the centre of the distribution for the standardised features.

# or split them from the others
df["MAscaled"].plot(), df["Ascaled"].plot()

You can see from above graphs that both original and standardized data are in the same shape but shifted.

df["Ascaled"].plot(), df["Alcohol"].plot()

df["MAscaled"].plot(), df["Malic acid"].plot()

2. MinMax Noramlisation:¶

In this section, we discuss a different type of normalization for reshaping the range of data. We process the same data we used in the previous section. We can implement this either Scikit-Learn or manually.

2.1 Using scikit-learn:¶

please refer to section 4.3.1.1 "Scaling features to a range" for more detailed discussion. Similar to what you have done with the StandardScaler, here you are going to use the MinMaxScaler.

minmax_scale = 
df_minmax = 
df_minmax[0:5]

2.2 Manually:¶

Of course, you can implement the Min-Max normalization according to the formulas discussed in the lecture.

Firstly, find the min and max of "df.Alcohol".

minA = df.Alcohol.min()
maxA = df.Alcohol.max()
minA, maxA

Manually apply the min-max normalization to the first value of "df.Alcohol",

a = df.Alcohol[0] # the first value, for practice
#Write you code here
mma = (a - minA) / (maxA - minA)
mma

and then compare the manually computed value with the one given by the MinMaxScaler above.

df_minmax[0][0]

The two values should be the same. Now, let's look at the normalization of the max value in "df.Alcohol".

a = df[df.Alcohol == df.Alcohol.max()].Alcohol
mma = (a - minA) / (maxA - minA)
mma

The normalized value of max must be 1.0 exactly, think about the reason! Then, how about the min value of "df.Alcohol"?

print('Min-value after min-max scaling:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_minmax[:,0].min(), df_minmax[:,1].min()))
print('\nMax-value after min-max scaling:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_minmax[:,0].max(), df_minmax[:,1].max()))

2.3 Plot the original, standardised and normalised data values.¶

# and plot
%matplotlib inline

from matplotlib import pyplot as plt

def plot():
    f = plt.figure(figsize=(8,6))

    plt.scatter(df['Alcohol'], df['Malic acid'],
            color='green', label='input scale', alpha=0.5)

 #   plt.scatter(df_std[:,0], df_std[:,1], color='red',
 #           label='Standardized [$$N  (\mu=0, \; \sigma=1)$$]', alpha=0.3)
    plt.scatter(df_std[:,0], df_std[:,1], color='red',
             label='Standardized u=0, s=1', alpha=0.3) # can't print: μ = 0, σ = 0
    
    plt.scatter(df_minmax[:,0], df_minmax[:,1],
            color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)

    plt.title('Alcohol and Malic Acid content of the wine dataset')
    plt.xlabel('Alcohol')
    plt.ylabel('Malic Acid')
    plt.legend(loc='upper left')
    plt.grid()
    plt.tight_layout()
    #f.savefig("z_min_max.pdf", bbox_inches='tight')

plot()
plt.show()

The plot above includes the wine datapoints on all three different scales:¶

the input scale where the alcohol content was measured in volume-percent (green),
the standardized features (red), and
the normalized features (blue).

In the following plot, we will zoom in into the three different axis-scales while dispalying class values.¶

fig, ax = plt.subplots(3, figsize=(6,14))

for a,d,l in zip(range(len(ax)),
               (df[['Alcohol', 'Malic acid']].values, df_std, df_minmax),
               ('Input scale',
                'Standardized [u=0 s=1]',
                'min-max scaled [min=0, max=1]')
                ):
    for i,c in zip(range(1,4), ('red', 'blue', 'green')):
        ax[a].scatter(d[df['Class label'].values == i, 0],
                  d[df['Class label'].values == i, 1],
                  alpha=0.5,
                  color=c,
                  label='Class %s' %i
                  )
    ax[a].set_title(l)
    ax[a].set_xlabel('Alcohol')
    ax[a].set_ylabel('Malic Acid')
    ax[a].legend(loc='upper left')
    ax[a].grid()

plt.tight_layout()

plt.show()

3. Data Transformation:¶

Another way to reshape data is to perform data transformation. We will display an example of data that is with right skew (positive skew). We will need to compress large values. We first read the data used for this activity.

import pandas as pd
data = pd.read_csv("bmr.csv")

data.head()

plt.scatter(data["BMR(W)"], data["Mass(g)"]) # before

So, which transformation type will suit this data?¶

In Tukey's ladder of power, we discussed different kind of transformation. Here you are going to compare the following three kinds of transformations

Root transformation
Square power transformation
Log transformation

The implementation of Root transformation is given as follows. You need to finish the other two kinds of transformation.

3.1 Root transformation:¶

import math
data['lmr'] = None
i = 0
for row in data.iterrows():
    #Write you code below
    
    i += 1
data.head()

data['lbm'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1  
data.head()

plt.scatter(data.lbm, data.lmr) # and after

Does it give a better spread of the data? Let's try something else.¶

3.2 Square power transformation:¶

import math
data['lmr'] = None
i = 0
for row in data.iterrows():
    #write you code below
    i += 1

    
data.head()

data['lbm'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1

    
data.head()

plt.scatter(data.lbm, data.lmr) # and after

Can you justify the output of this figure?¶

3.3 Log transformation:¶

import math
data['lmr'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1

    
data.head()

data['lbm'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1

    
data.head()

plt.scatter(data.lbm, data.lmr) # and after

Apparently, the best transformation for this data is log transformation. As the data is positively skewed. we will need to compress large values. That means we need to move down the ladder of powers to spread out data that is clustered at lower values. Therefore, logarithmic is the appropriate transformation in this case.

Some materials used in this tutorial are based on http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

4. Home work:¶

Consider the following dataset:

body_mass = [32000, 37800, 347000, 4200, 196500, 100000, 4290, 
32000, 65000, 69125, 9600, 133300, 150000, 407000, 115000, 67000, 
325000, 21500, 58588, 65320, 85000, 135000, 20500, 1613, 1618]

metabolic_rate = [49.984, 51.981, 306.770, 10.075, 230.073, 
148.949, 11.966, 46.414, 123.287, 106.663, 20.619, 180.150, 
200.830, 224.779, 148.940, 112.430, 286.847, 46.347, 142.863, 
106.670, 119.660, 104.150, 33.165, 4.900, 4.865]