Data Science Thoughts

Posts

Showing posts from July, 2021

Covariance

July 22, 2021

Covariance is one of a family of statistical measures used to analyze the linear relationship between two variables. How do two variables behave as PAIR? There are certain terminology like covariance, correlation, linear regression etc.. make confuse how all are related to each other. here linear regression related to correlation which is related to covariance in nature so all this measures analyses by looking with linear relationship between two variables. Covariance is a Descriptive measure of the linear association between two variables that is very simple to interpret. 1. A positive value indicates a direct or increasing linear relationship. 2. A negative value indicates a decreasing relationship. A concept here is the direction that is sign on the covariance whether it is positive or negative For Example, if we have 4 quadrants like (I ,II ,III ,IV) I - (+,+) : represents the both x and y values are positive. II - (-,+) : represents the x is negative and y is positive. III - (-...

Feature Scaling

July 18, 2021

Why require feature scaling? so whenever we discuss about feature scaling means we are talking about features. let me consider that I have features like height and weight, based on this I want to predict my Body Mass Index(BMI). here Hight and Weight are my independent features and BMI is my dependent feature. Every features have two properties. 1. Magnitude 2. Unit Magnitude is nothing but values from feature and unit is basically measurement like Kg, Cm, feet etc. In above Example height in cm and weight in kg. suppose if we don't perform feature scaling and if we apply magnitude with given units than some of the algorithm which works on distance and units so it will varies the value in large distance so that model accuracy goes down when we use different unit scaling. so we have to scale down this features with normalization and standardization between 0 to 1 with different techniques. for Example in linear regression coefficients basically found with help of Gradient Descent. I...

Map() vs Apply() vs ApplyMap() Functions

July 07, 2021

How do I apply a function to a Pandas Series or Data Frame? There are 3 way to apply function. 1. Map() Map is a series Method. Map allows you to map a existing value of a series to a different set of values. Lets say you need to create dummy variable for Column sex. Male, Female that is we need to translate male to 0 and female to 1. Example: Dataframe['sex'] = Datafrmae.sex.map({'female':0,'male':1}) map will assign or map numerical values to respective strings in dictionary. 2. Apply() Apply is a usually both series method and Data Frame method. Apply as Series method: It will apply a function to a each element in a series. Example: If I want to calculate length of each string in name column, Dataframe['Namelength'] = Dataframe.Name.apply(len) This is Apply as a series method applying to each element in series to calculate length. Apply as Data Frame method: It apply a function along with either access of a Data Frame. Example: Dataframe[: ,'name...

Difference Between R square and Adjusted R square

July 02, 2021

In many of the supervised machine learning problem statement basically we have two kind of use cases. 1.Regression 2.classification For Regression type use case if we want to check accuracy usually we follow the techniques like R square and adjusted R square. In this article we will be discuss about difference between R square and adjusted R square. 1. R square R square formula is given by, = coefficient of determination = sum of square of residuals or error = total sum of square Here residuals are sum of square of difference between actual point and predict point. and for given data if we have only target variable we will find the best fit line by taking average of all the values. so total sum of square of actual points and average value. So from above formula we will get value between 0 and 1, the more value near to 1 is the best fit line. whether can we get R2 value less than 0? YES, only when your best fit line is worse than average best fit line. If RSS > TSS so ratio beco...

Continuous vs Categorical Variable

July 02, 2021

What is Variable? A variable is something that is need to be measured. it is a type of recorded piece of information or characteristic about a person or case or unit in our study. for Example : We might record the age of everyone in our sample and for every one we're recording the age but the age is going to change or vary from person to person so this is kind of the opposite of a constant which is always the same. for the sake of our discussion we have few examples like age , weight , BMI, does someone have disease, yes or no etc.. we summarize and analyze the data depend on type of variable we have. Type of variables: 1. Independent variable : It is also known as experimental or predictor variable. Independent variable is a variable that is the causes or reason of any situation which can be manipulated. 2. Dependent variable : Dependent variable is something that depends on their factors. It is also known as outcome variable. for Example : Time spent studying causes a c...