Data Science Thoughts

Posts

Normalization and Standardization Use Case

September 14, 2021

Case study: We Have a used cars dataset from the website. This dataset contains information about used cars. This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning. The columns in the given dataset are as follows: name, year, selling_price, km_driven, fuel, distance, seller_type, transmission, Owner For used motorcycle datasets please go to https://www.kaggle.com/nehalbirla/motorcycle-dataset Here using the above features we should predict the selling price of cars. so feature km_driven and distance are in different scaling if we load these features into a model then prediction may go wrong due to the wrong interpretation of slops. To overcome these we will scale down these features into normal values between 0 to 1. from sklearn.preprocessing import MinMaxScaler Minscaler = MinMaxScaler() scaler = Minscaler.fit( 'distance', 'km_driven' ) scaler.data_min_ scaler...

Normalization and Standardization

September 12, 2021

Suppose if you have any use case, so the most important thing for the use case is data. Initially, you will be collecting the data so if you have collected the data that data have many features so those features may contain independent feature and dependent feature so with the help of the independent we will try to predict dependent feature in supervised machine learning. so when you consider these features this has 2 important properties. 1. Unit 2. Magnitude let's have features like personage, height, weight, etc. so if I consider the feature age the unit basically no of years and the magnitude is basically value. For Ex: Suppose if I say 25years then 25 is magnitude and years is unit. Each feature is calculated with unit and magnitude so if you have many features so it will get computed with different units. so this unit and magnitude vary between different features. so it is very necessary that for the machine learning algorithm the data we provide that we should try t...

Normalization and Standardization (Train Test Split)

September 04, 2021

1. When you should use Standard normalization and MinMaxscaler? In most scenarios whenever we use a machine learning algorithm that involves Euclidean distance and gradient descent basically it means parabola curve where you find the best minimal point in order to retrieve that point we need to scale down the values. most of the algorithms we use normalization. 2. Should we need to perform Normalization and Standardization before Train Test Split or After Train Test Split of the dataset. Firstly we divide our complete Dataset into Train and Test Datasets. Train Data is used to train our model, Test Data will be given to our model to test model accuracy before passing unseen data. If we perform Normalization and standardization to our entire dataset before Train Test Split, we will face issues with interpreting the available slops because slops are calculated on given different units or scaling so it's wrong interpreting the slops of o...

Covariance

July 22, 2021

Covariance is one of a family of statistical measures used to analyze the linear relationship between two variables. How do two variables behave as PAIR? There are certain terminology like covariance, correlation, linear regression etc.. make confuse how all are related to each other. here linear regression related to correlation which is related to covariance in nature so all this measures analyses by looking with linear relationship between two variables. Covariance is a Descriptive measure of the linear association between two variables that is very simple to interpret. 1. A positive value indicates a direct or increasing linear relationship. 2. A negative value indicates a decreasing relationship. A concept here is the direction that is sign on the covariance whether it is positive or negative For Example, if we have 4 quadrants like (I ,II ,III ,IV) I - (+,+) : represents the both x and y values are positive. II - (-,+) : represents the x is negative and y is positive. III - (-...

Feature Scaling

July 18, 2021

Why require feature scaling? so whenever we discuss about feature scaling means we are talking about features. let me consider that I have features like height and weight, based on this I want to predict my Body Mass Index(BMI). here Hight and Weight are my independent features and BMI is my dependent feature. Every features have two properties. 1. Magnitude 2. Unit Magnitude is nothing but values from feature and unit is basically measurement like Kg, Cm, feet etc. In above Example height in cm and weight in kg. suppose if we don't perform feature scaling and if we apply magnitude with given units than some of the algorithm which works on distance and units so it will varies the value in large distance so that model accuracy goes down when we use different unit scaling. so we have to scale down this features with normalization and standardization between 0 to 1 with different techniques. for Example in linear regression coefficients basically found with help of Gradient Descent. I...

Map() vs Apply() vs ApplyMap() Functions

July 07, 2021

How do I apply a function to a Pandas Series or Data Frame? There are 3 way to apply function. 1. Map() Map is a series Method. Map allows you to map a existing value of a series to a different set of values. Lets say you need to create dummy variable for Column sex. Male, Female that is we need to translate male to 0 and female to 1. Example: Dataframe['sex'] = Datafrmae.sex.map({'female':0,'male':1}) map will assign or map numerical values to respective strings in dictionary. 2. Apply() Apply is a usually both series method and Data Frame method. Apply as Series method: It will apply a function to a each element in a series. Example: If I want to calculate length of each string in name column, Dataframe['Namelength'] = Dataframe.Name.apply(len) This is Apply as a series method applying to each element in series to calculate length. Apply as Data Frame method: It apply a function along with either access of a Data Frame. Example: Dataframe[: ,'name...

Difference Between R square and Adjusted R square

July 02, 2021

In many of the supervised machine learning problem statement basically we have two kind of use cases. 1.Regression 2.classification For Regression type use case if we want to check accuracy usually we follow the techniques like R square and adjusted R square. In this article we will be discuss about difference between R square and adjusted R square. 1. R square R square formula is given by, = coefficient of determination = sum of square of residuals or error = total sum of square Here residuals are sum of square of difference between actual point and predict point. and for given data if we have only target variable we will find the best fit line by taking average of all the values. so total sum of square of actual points and average value. So from above formula we will get value between 0 and 1, the more value near to 1 is the best fit line. whether can we get R2 value less than 0? YES, only when your best fit line is worse than average best fit line. If RSS > TSS so ratio beco...