Data Science Thoughts

Posts

Hypothesis Testing - Statistics

March 03, 2022

1. What is Hypothesis Testing and when do we use it? Hypothesis testing is a part of statistical analysis, where we test the assumptions made regarding a population parameter. It is generally used when we were to compare: a single group with an external standard two or more groups with each other A Parameter is a number that describes the data from the population whereas, a Statistic is a number that describes the data from a sample . 2. Terminology used Null Hypothesis: Null hypothesis is a statistical theory that suggests there is no statistical significance exists between the populations. Alternative Hypothesis: An Alternative hypothesis suggests there is a significant difference between the population parameters. It could be greater or smaller. Basically, it is the contrast of the Null Hypothesis. Note: H 0 must always contain equality(=). H a always contains difference( ≠, >, <). For example, if we...

Z Score - Normal Distribution

March 02, 2022

-> We are going to have a deep discussion on the Z score. But right before that, we need to understand what a normal distribution and a standard normal distribution are. What is a distribution? A distribution in statistics is a function that shows the possible values for a variable and how often they occur. it may occur with various different values like age, height, the weight of people. What is a Normal distribution? The normal distribution is a distribution that is symmetric about the mean(mean is nothing but average of all the observations). Most of the observations in the normal distribution are surrounded by the mean. What is a standard normal distribution? The standard normal distribution is a normal distribution whose mean and standard deviation are scaled at 0 and 1 respectively. Z score can only be calculated for the observations which follow a normal distribution. What is a Z score? A Z-score is a numerical measurement that describes a value’s relationship to the mea...

Machine Learning Cross Validation

October 07, 2021

There are various pipelines for the machine learning use cases: Data collection, Feature engineering, Feature selection, Model creation, and Model deployment. So always remember before model creation what we do is whenever we have a dataset suppose I have 1000 records we usually perform a train test split saying that Training set has 70% of data and Test set have 30% or 80% train set and 20% test set depends on the count of the dataset. so our model will use 70% of the data to only train the model itself and the remaining 30% we will use to check the accuracy so when we do train test split 70% of data will randomly select and 30% also randomly select so when this kind of random selection happens so the type of data present in the test may not present in train set due to this our model accuracy go down. so whenever we use train test split we usually use random state, It will randomly select the data point so when we take random state=0 then it will shuffle our data and provide accu...

Normalization and Standardization Use Case

September 14, 2021

Case study: We Have a used cars dataset from the website. This dataset contains information about used cars. This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning. The columns in the given dataset are as follows: name, year, selling_price, km_driven, fuel, distance, seller_type, transmission, Owner For used motorcycle datasets please go to https://www.kaggle.com/nehalbirla/motorcycle-dataset Here using the above features we should predict the selling price of cars. so feature km_driven and distance are in different scaling if we load these features into a model then prediction may go wrong due to the wrong interpretation of slops. To overcome these we will scale down these features into normal values between 0 to 1. from sklearn.preprocessing import MinMaxScaler Minscaler = MinMaxScaler() scaler = Minscaler.fit( 'distance', 'km_driven' ) scaler.data_min_ scaler...

Normalization and Standardization

September 12, 2021

Suppose if you have any use case, so the most important thing for the use case is data. Initially, you will be collecting the data so if you have collected the data that data have many features so those features may contain independent feature and dependent feature so with the help of the independent we will try to predict dependent feature in supervised machine learning. so when you consider these features this has 2 important properties. 1. Unit 2. Magnitude let's have features like personage, height, weight, etc. so if I consider the feature age the unit basically no of years and the magnitude is basically value. For Ex: Suppose if I say 25years then 25 is magnitude and years is unit. Each feature is calculated with unit and magnitude so if you have many features so it will get computed with different units. so this unit and magnitude vary between different features. so it is very necessary that for the machine learning algorithm the data we provide that we should try t...

Normalization and Standardization (Train Test Split)

September 04, 2021

1. When you should use Standard normalization and MinMaxscaler? In most scenarios whenever we use a machine learning algorithm that involves Euclidean distance and gradient descent basically it means parabola curve where you find the best minimal point in order to retrieve that point we need to scale down the values. most of the algorithms we use normalization. 2. Should we need to perform Normalization and Standardization before Train Test Split or After Train Test Split of the dataset. Firstly we divide our complete Dataset into Train and Test Datasets. Train Data is used to train our model, Test Data will be given to our model to test model accuracy before passing unseen data. If we perform Normalization and standardization to our entire dataset before Train Test Split, we will face issues with interpreting the available slops because slops are calculated on given different units or scaling so it's wrong interpreting the slops of o...

Covariance

July 22, 2021

Covariance is one of a family of statistical measures used to analyze the linear relationship between two variables. How do two variables behave as PAIR? There are certain terminology like covariance, correlation, linear regression etc.. make confuse how all are related to each other. here linear regression related to correlation which is related to covariance in nature so all this measures analyses by looking with linear relationship between two variables. Covariance is a Descriptive measure of the linear association between two variables that is very simple to interpret. 1. A positive value indicates a direct or increasing linear relationship. 2. A negative value indicates a decreasing relationship. A concept here is the direction that is sign on the covariance whether it is positive or negative For Example, if we have 4 quadrants like (I ,II ,III ,IV) I - (+,+) : represents the both x and y values are positive. II - (-,+) : represents the x is negative and y is positive. III - (-...