Machine Learning Cross Validation

October 07, 2021

There are various pipelines for the machine learning use cases: Data collection, Feature engineering, Feature selection, Model creation, and Model deployment.

So always remember before model creation what we do is whenever we have a dataset suppose I have 1000 records we usually perform a train test split saying that Training set has 70% of data and Test set have 30% or 80% train set and 20% test set depends on the count of the dataset. so our model will use 70% of the data to only train the model itself and the remaining 30% we will use to check the accuracy so when we do train test split 70% of data will randomly select and 30% also randomly select so when this kind of random selection happens so the type of data present in the test may not present in train set due to this our model accuracy go down.

so whenever we use train test split we usually use random state, It will randomly select the data point so when we take random state=0 then it will shuffle our data and provide accuracy as 85%, and again if we change random state=50 then again it will shuffle our data set and give accuracy 89%, etc. so here if I go on the select different random state than our accuracy fluctuate so you are not able to tell your business like what exactly is our model accuracy so in order to prevent above problem we have a concept called cross-validation.

CROSS-VALIDATION

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

Type of cross validation

1. Leave one out cross-validation(LOOCV)

Experiment 1: Suppose I have 1000 records out of that I will take one data set as my test and the remaining all will be my train.

Experiment 2: The next data set will be tested and the remaining will be trained and etc it will be going on.

So here if we taking 1 record for 1000 records then I have to perform many iterations like 1000 iteration. so understand how much computation needs that is one disadvantage of LOOCV and it will lead to low bias.

2. K Fold Cross-Validation

Suppose I have 1000 records, we select k value k=5 it means we performing 5 experiments

so for each experiment, it will decide what will be the test data like 1000/5=200 so 200 will be my test data for every experiment.

Experiment 1: so the First 200 is the test and the remaining 800 will be the train.

Experiment 2: Next 200 is the test and the remaining 600+200 will be the train.

Experiment 3: Next 200 is the test and the remaining 400+400 will be the train.

and it will continue till the kth iteration.

so now we have 5 accuracies we gonna take mean of the accuracy and then you can tell your business that this project will be giving this much accuracy after applying K fold cross-validation and you can say Minimum accuracy will be the minimum of this cross-validation accuracy scores and Maximum accuracy will be maximum of cross-validation accuracy scores.

There are some disadvantages to K fold CV.

we have taken the first 200 as my test set let's assume that in this set I have only one type of data present so in the remaining 800 set we have another type of data present so it became an imbalanced dataset and this will be a problem so the model is not giving right kind of accuracy so, in order to solve this, we have another technique called stratified cross-validation.

3. Stratified cross-validation

In Stratified cross-validation, everything will be the same as we did in k fold cv, k=5 5 experiments will do with test and train data set

So here whenever the test and train set has been selected it will make sure that the no of instances of each class for each exp in the train and test will be taken in a proper way

suppose I have 1000 Records: out of that 600 records as one type and 400 records as another type so I am considering it as imbalanced data set.

In this case, Stratified cross-validation will make sure that the Train set will have a good proportion of both types of data, and even the Test set also has a good proportion of both types of data. so that your model will be able to give good accuracy.

Stratified cross-validation is mainly used to prevent the disadvantage present in k fold cross-validation. so in each experiment it will make sure that the random sample present in train test data set will be at least some amount of each instance of the class will be present.

4. Time Series Cross-Validation

Time Series Cv will be work for different problems whenever you have a dataset that relates to time series like a stock price. In the stock price, we have to predict future data so we cannot do with train test split and try to find accuracy.

suppose we have Day1-> Day2 -> Day3 -> Day4 data and want to predict Day5 -> Day6 data so I have to be dependent on Day1-> Day4 data and predict Day5.

similarly, I will predict Day6 with Day2 -> Day5 data & Day7 with Day3 -> Day6 data so on.

This is how time-series cross-validation works and finds accuracy for a given dataset.

Search This Blog

Data Science Thoughts

Machine Learning Cross Validation

Comments

Post a Comment

Popular posts from this blog

Transformers: Self-attention

Retrieval Augmented Generation(RAG)

Large Language Models(LLMs)