Data Science Thoughts

Posts

Machine Learning Cross Validation

October 07, 2021

There are various pipelines for the machine learning use cases: Data collection, Feature engineering, Feature selection, Model creation, and Model deployment. So always remember before model creation what we do is whenever we have a dataset suppose I have 1000 records we usually perform a train test split saying that Training set has 70% of data and Test set have 30% or 80% train set and 20% test set depends on the count of the dataset. so our model will use 70% of the data to only train the model itself and the remaining 30% we will use to check the accuracy so when we do train test split 70% of data will randomly select and 30% also randomly select so when this kind of random selection happens so the type of data present in the test may not present in train set due to this our model accuracy go down. so whenever we use train test split we usually use random state, It will randomly select the data point so when we take random state=0 then it will shuffle our data and provide accu...

Normalization and Standardization Use Case

September 14, 2021

Case study: We Have a used cars dataset from the website. This dataset contains information about used cars. This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning. The columns in the given dataset are as follows: name, year, selling_price, km_driven, fuel, distance, seller_type, transmission, Owner For used motorcycle datasets please go to https://www.kaggle.com/nehalbirla/motorcycle-dataset Here using the above features we should predict the selling price of cars. so feature km_driven and distance are in different scaling if we load these features into a model then prediction may go wrong due to the wrong interpretation of slops. To overcome these we will scale down these features into normal values between 0 to 1. from sklearn.preprocessing import MinMaxScaler Minscaler = MinMaxScaler() scaler = Minscaler.fit( 'distance', 'km_driven' ) scaler.data_min_ scaler...

Normalization and Standardization

September 12, 2021

Suppose if you have any use case, so the most important thing for the use case is data. Initially, you will be collecting the data so if you have collected the data that data have many features so those features may contain independent feature and dependent feature so with the help of the independent we will try to predict dependent feature in supervised machine learning. so when you consider these features this has 2 important properties. 1. Unit 2. Magnitude let's have features like personage, height, weight, etc. so if I consider the feature age the unit basically no of years and the magnitude is basically value. For Ex: Suppose if I say 25years then 25 is magnitude and years is unit. Each feature is calculated with unit and magnitude so if you have many features so it will get computed with different units. so this unit and magnitude vary between different features. so it is very necessary that for the machine learning algorithm the data we provide that we should try t...

Normalization and Standardization (Train Test Split)

September 04, 2021

1. When you should use Standard normalization and MinMaxscaler? In most scenarios whenever we use a machine learning algorithm that involves Euclidean distance and gradient descent basically it means parabola curve where you find the best minimal point in order to retrieve that point we need to scale down the values. most of the algorithms we use normalization. 2. Should we need to perform Normalization and Standardization before Train Test Split or After Train Test Split of the dataset. Firstly we divide our complete Dataset into Train and Test Datasets. Train Data is used to train our model, Test Data will be given to our model to test model accuracy before passing unseen data. If we perform Normalization and standardization to our entire dataset before Train Test Split, we will face issues with interpreting the available slops because slops are calculated on given different units or scaling so it's wrong interpreting the slops of o...

Covariance

July 22, 2021

Covariance is one of a family of statistical measures used to analyze the linear relationship between two variables. How do two variables behave as PAIR? There are certain terminology like covariance, correlation, linear regression etc.. make confuse how all are related to each other. here linear regression related to correlation which is related to covariance in nature so all this measures analyses by looking with linear relationship between two variables. Covariance is a Descriptive measure of the linear association between two variables that is very simple to interpret. 1. A positive value indicates a direct or increasing linear relationship. 2. A negative value indicates a decreasing relationship. A concept here is the direction that is sign on the covariance whether it is positive or negative For Example, if we have 4 quadrants like (I ,II ,III ,IV) I - (+,+) : represents the both x and y values are positive. II - (-,+) : represents the x is negative and y is positive. III - (-...

Feature Scaling

July 18, 2021

Why require feature scaling? so whenever we discuss about feature scaling means we are talking about features. let me consider that I have features like height and weight, based on this I want to predict my Body Mass Index(BMI). here Hight and Weight are my independent features and BMI is my dependent feature. Every features have two properties. 1. Magnitude 2. Unit Magnitude is nothing but values from feature and unit is basically measurement like Kg, Cm, feet etc. In above Example height in cm and weight in kg. suppose if we don't perform feature scaling and if we apply magnitude with given units than some of the algorithm which works on distance and units so it will varies the value in large distance so that model accuracy goes down when we use different unit scaling. so we have to scale down this features with normalization and standardization between 0 to 1 with different techniques. for Example in linear regression coefficients basically found with help of Gradient Descent. I...

Map() vs Apply() vs ApplyMap() Functions

July 07, 2021

How do I apply a function to a Pandas Series or Data Frame? There are 3 way to apply function. 1. Map() Map is a series Method. Map allows you to map a existing value of a series to a different set of values. Lets say you need to create dummy variable for Column sex. Male, Female that is we need to translate male to 0 and female to 1. Example: Dataframe['sex'] = Datafrmae.sex.map({'female':0,'male':1}) map will assign or map numerical values to respective strings in dictionary. 2. Apply() Apply is a usually both series method and Data Frame method. Apply as Series method: It will apply a function to a each element in a series. Example: If I want to calculate length of each string in name column, Dataframe['Namelength'] = Dataframe.Name.apply(len) This is Apply as a series method applying to each element in series to calculate length. Apply as Data Frame method: It apply a function along with either access of a Data Frame. Example: Dataframe[: ,'name...