When one or more datasets affects results disproportionately, normalization or scaling gives a level playing field. In this activity, we will apply different data methods for data normalisation and transformation. We first read the dataset that is used for the first part of the analysis.
import pandas as pd
import numpy as np
df = pd.read_csv( 'wine_data.csv', header = None, usecols = [0,1,2])
df.columns=['Class label', 'Alcohol', 'Malic acid']
df.head()
df.describe()
As we can see in the tables above, the features, Alcohol (percent/volumne) and Malic acid (g/l) are measured on different scales, so scaling is necessary prior to any comparison or combination of data.
df.Alcohol.mean() / df["Malic acid"].mean() # difference is factor of ~5x
We use scikit-learn linrary for standardise data (mean=0, SD=1). The class you are going to use is the StandardScaler class. More reading materials can be found here.
The task here it to standardise the values of Alcohol and Malic Acid, and append the standard variables to the DataFrame "df" as follows

from sklearn import preprocessing
std_scale =
df_std =
df_std[0:5]
# put it alongside data... to view
df['Ascaled'] = df_std[:,0] # so 'Ascaled' is Alcohol scaled
df['MAscaled'] = df_std[:,1] # and 'MAscaled' is Malic acid scaled
df.head()
Now, compute and display the normalised values for both features. Let's check if they have mean of 0 and SD= 1.
df.describe() # check that μ = 0 and σ = 1... approx
Or you can print out values:
print('Mean after standardisation:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
.format(df_std[:,0].mean(), df_std[:,1].mean()))
print('\nStandard deviation after standardisation:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
.format(df_std[:,0].std(), df_std[:,1].std()))
In order to investigate how the normalization actually affect the data, we can visualize the data by plotting the variable values.
Firstly, plot the original data, i.e., data before normalization
%matplotlib inline
df["Alcohol"].plot(), df["Malic acid"].plot()
Now, we plot the standardized data, and observe the range and the centre of the distribution for the standardised features.
# or split them from the others
df["MAscaled"].plot(), df["Ascaled"].plot()
You can see from above graphs that both original and standardized data are in the same shape but shifted.
df["Ascaled"].plot(), df["Alcohol"].plot()
df["MAscaled"].plot(), df["Malic acid"].plot()
In this section, we discuss a different type of normalization for reshaping the range of data. We process the same data we used in the previous section. We can implement this either Scikit-Learn or manually.
please refer to section 4.3.1.1 "Scaling features to a range" for more detailed discussion. Similar to what you have done with the StandardScaler, here you are going to use the MinMaxScaler.
minmax_scale =
df_minmax =
df_minmax[0:5]
Of course, you can implement the Min-Max normalization according to the formulas discussed in the lecture.
Firstly, find the min and max of "df.Alcohol".
minA = df.Alcohol.min()
maxA = df.Alcohol.max()
minA, maxA
Manually apply the min-max normalization to the first value of "df.Alcohol",
a = df.Alcohol[0] # the first value, for practice
#Write you code here
mma = (a - minA) / (maxA - minA)
mma
and then compare the manually computed value with the one given by the MinMaxScaler above.
df_minmax[0][0]
The two values should be the same. Now, let's look at the normalization of the max value in "df.Alcohol".
a = df[df.Alcohol == df.Alcohol.max()].Alcohol
mma = (a - minA) / (maxA - minA)
mma
The normalized value of max must be 1.0 exactly, think about the reason! Then, how about the min value of "df.Alcohol"?
print('Min-value after min-max scaling:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
.format(df_minmax[:,0].min(), df_minmax[:,1].min()))
print('\nMax-value after min-max scaling:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
.format(df_minmax[:,0].max(), df_minmax[:,1].max()))
# and plot
%matplotlib inline
from matplotlib import pyplot as plt
def plot():
f = plt.figure(figsize=(8,6))
plt.scatter(df['Alcohol'], df['Malic acid'],
color='green', label='input scale', alpha=0.5)
# plt.scatter(df_std[:,0], df_std[:,1], color='red',
# label='Standardized [$$N (\mu=0, \; \sigma=1)$$]', alpha=0.3)
plt.scatter(df_std[:,0], df_std[:,1], color='red',
label='Standardized u=0, s=1', alpha=0.3) # can't print: μ = 0, σ = 0
plt.scatter(df_minmax[:,0], df_minmax[:,1],
color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)
plt.title('Alcohol and Malic Acid content of the wine dataset')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.legend(loc='upper left')
plt.grid()
plt.tight_layout()
#f.savefig("z_min_max.pdf", bbox_inches='tight')
plot()
plt.show()
fig, ax = plt.subplots(3, figsize=(6,14))
for a,d,l in zip(range(len(ax)),
(df[['Alcohol', 'Malic acid']].values, df_std, df_minmax),
('Input scale',
'Standardized [u=0 s=1]',
'min-max scaled [min=0, max=1]')
):
for i,c in zip(range(1,4), ('red', 'blue', 'green')):
ax[a].scatter(d[df['Class label'].values == i, 0],
d[df['Class label'].values == i, 1],
alpha=0.5,
color=c,
label='Class %s' %i
)
ax[a].set_title(l)
ax[a].set_xlabel('Alcohol')
ax[a].set_ylabel('Malic Acid')
ax[a].legend(loc='upper left')
ax[a].grid()
plt.tight_layout()
plt.show()
Another way to reshape data is to perform data transformation. We will display an example of data that is with right skew (positive skew). We will need to compress large values. We first read the data used for this activity.
import pandas as pd
data = pd.read_csv("bmr.csv")
data.head()
plt.scatter(data["BMR(W)"], data["Mass(g)"]) # before
In Tukey's ladder of power, we discussed different kind of transformation. Here you are going to compare the following three kinds of transformations
The implementation of Root transformation is given as follows. You need to finish the other two kinds of transformation.
import math
data['lmr'] = None
i = 0
for row in data.iterrows():
#Write you code below
i += 1
data.head()
data['lbm'] = None
i = 0
for row in data.iterrows():
#write you code below
i += 1
data.head()
plt.scatter(data.lbm, data.lmr) # and after
import math
data['lmr'] = None
i = 0
for row in data.iterrows():
#write you code below
i += 1
data.head()
data['lbm'] = None
i = 0
for row in data.iterrows():
#write you code below
i += 1
data.head()
plt.scatter(data.lbm, data.lmr) # and after
import math
data['lmr'] = None
i = 0
for row in data.iterrows():
#write you code below
i += 1
data.head()
data['lbm'] = None
i = 0
for row in data.iterrows():
#write you code below
i += 1
data.head()
plt.scatter(data.lbm, data.lmr) # and after
Apparently, the best transformation for this data is log transformation. As the data is positively skewed. we will need to compress large values. That means we need to move down the ladder of powers to spread out data that is clustered at lower values. Therefore, logarithmic is the appropriate transformation in this case.
Some materials used in this tutorial are based on http://sebastianraschka.com/Articles/2014_about_feature_scaling.html
Consider the following dataset:
body_mass = [32000, 37800, 347000, 4200, 196500, 100000, 4290,
32000, 65000, 69125, 9600, 133300, 150000, 407000, 115000, 67000,
325000, 21500, 58588, 65320, 85000, 135000, 20500, 1613, 1618]
metabolic_rate = [49.984, 51.981, 306.770, 10.075, 230.073,
148.949, 11.966, 46.414, 123.287, 106.663, 20.619, 180.150,
200.830, 224.779, 148.940, 112.430, 286.847, 46.347, 142.863,
106.670, 119.660, 104.150, 33.165, 4.900, 4.865]