Normalization and Standardization

Suppose if you have any use case, so the most important thing for the use case is data. Initially, you will be collecting the data so if you have collected the data that data have many features so those features may contain independent feature and dependent feature so with the help of the independent we will try to predict dependent feature in supervised machine learning.

so when you consider these features this has 2 important properties.

1. Unit 

2. Magnitude

let's have features like personage, height, weight, etc. so if I consider the feature age the unit basically no of years and the magnitude is basically value. 

For Ex: Suppose if I say 25years then 25 is magnitude and years is unit.

Each feature is calculated with unit and magnitude so if you have many features so it will get computed with different units. so this unit and magnitude vary between different features. so it is very necessary that for the machine learning algorithm the data we provide that we should try to scale down the data with a particular scaling value.

For this type of problem, we use 2 main techniques.

1. Normalization:  Normalization helps you to scale down your feature between 0 to 1.

2. Standardization: Standardization helps you to scale down your feature based on a standard normal distribution. Usually(Mean is 0 and Standard deviation is 1).

lets us discuss Normalization and Standardization.


Normalization(Min-Max normalization)

In this approach we will scale down the values of features between 0 to 1.

X norm = (X - X min) /(X max - X min) 

from sklearn.preprocessing import MinMaxScaler 

scaling = MinMaxscaler()

scaling.fit_transform(['age', Height'])

From above I am trying to scale down the age and height feature between 0 to 1. 
Sklearn has a library called MinMaxscaler which will transform the data by applying the above X norm formula to each and every element in age and height.


Standardization(Z-score normalization)

here all the features will be transformed in such a way that they will have the properties of a standard normal distribution on mean=0 and standard deviation=1.

Z = (X - Mean) / Standard Deviation

from sklearn.preprocessing import standardscaler
scaling = standardscaler()
scaling.fit_transform([age, height])

SK learn has a library called standard scaler which will transform the data considering which is mean =0 and standard deviation=1

This is the most popular used technique in most of the problem statements.


Comments

Popular posts from this blog

Transformers: Self-attention

Retrieval Augmented Generation(RAG)

Large Language Models(LLMs)