This blog is part 5 of machine learning interview questions and answers series. Previous blog article and be found here.
What is the need of cross validation in machine learning modeling?
Cross validation is used to better utilize the input data to ensure the model performance. The input data will be divided into many subsets and train the machine learning model on all the subset except one among them. One subset will be kept as a test set. After training, the subset which was kept aside will be utilized for evaluating the model. In the next iteration, another subset will be used for evaluation and the remaining subsets will be utilized for training. The cross validation can help to understand whether the model learns the correct pattern from the data and thereby it detects if there is overfitting in the model.
What is the significance of stratified sampling in classification?
Stratified sampling is a topic often appears in machine learning interview questions.
If the data is not large enough, random sampling can result in bias due to the sampling error. Strata in stratified sampling means homogeneous subgroups. If we divide the population into homogeneous subgroups and then do random sampling over these subgroups, the sampling error will be reduced. In stratified sampling, the percentage of each stratum will be preserved and hence we can get the test as a representative of the actual population. In stratified sampling every element in the population will belong to any one subgroup and there won’t be any data repetition in the sampling process. As each subgroup/strata get proper representation in the sample, we can estimate the model parameters with better accuracy as compared to that of random sampling.
Why do you perform pruning in the decision tree?
Pruning is performed to reduce the size of the decision tree. Pruning compresses the data by removing the redundant and error prone parts in classification. Pruning functions like a regularizer and hence it can do generalization in a better way with less sparsity. Hence, Pruning helps to reduce overfitting. As it reduces the complexity of the model, it can improve the accuracy of prediction.
Explain Chi-square test
Chi-square test is a statistical procedure used to determine the difference between observed and expected result. The test intends to find out if the disparity between the observed and expected output is by chance or due to any underlying relationship among the variables. Hence the chi-square test can also be used to find the relationship between two categorical variables.
The formula for chi square test is,
${\chi_c}^2 = \sum{\frac{{E_i}^2 – {O_i}^2}{E_i}}$
Here, c is the degree of freedom, E is the expected value and O is the observed value. Chi square test can tell us whether our data follows a well defined probability distribution like Poisson distribution or Normal distribution.
Detail the concept behind $L_1$ and $L_2$ regularization
The calibration of machine learning models to minimize the loss function can be done with regularization. $L_1$ regularization and $L_2$ regularization are two well-known regularization methods. The absolute value magnitude of the coefficient will be used as a penalty term to the loss function in $L_1$ regularization. In $L_1$ regularization, many variables used to be assigned with binary weight of 1 or 0 and it tries to make the distribution of a Laplacian prior.
In $L_2$ regularization, the squared magnitude of the coefficient will be used as a penalty to the loss function. The $L_2$ regularizer try to distribute the error among all terms and it belongs to a Gaussian prior.
Differentiate between ranking SVM and SVR
The ranking SVM is a variant used to understand ranking in support vector machines (SVM). It adopts a pairwise ranking strategy to sort the result based on its relevance to a query point. It uses a mapping function which extracts the match between the features in the expected result and the query point and then ranks it based on the match. For instance, if we wish to know how important a particular click to a web page is to a specific query, the map function projects this data pair of search query and clicks on the web page to the feature space. These features are then combined with the corresponding click can be used to train the model and used for ranking of SVM algorithms.
SVR is Support Vector Regression. It is a completely different thing from ranking SVM. It is a regression method in which the predicted value will be a real number by making use of support vector concepts.
List the advantages of using Naive Bayes for classification purposes
- On occasions where the conditional independence assumption of Naive Bayes holds, the convergence can be achieved faster compared to other discriminative models
- Naive Bayes classifier requires very less data for training
- Suitable for binary classification tasks as well as multi-class classification task
- Naive bayes classifier can handle both continuous and binary data
- Implementation is easier and fast
- Naive Bayes classifier is not sensitive to features that are irrelevant
Gaussian Naive Bayes is the same as binary Naive Bayes! Do you agree with this statement? Explain with reason
No, they are not the same.
They are algorithms handling different types of data distributions by making use of the Naive Bayes model.
Gaussian Naive Bayes assumes normal distribution of data and hence will be used if all our features are continuous. On the other hand, Binary Naive Bayes assumes the data belongs to two categories, i.e., either binary 0 which means a word is not present in a document or binary 1 indicates the word is present in a document.
Explain selection bias in machine learning
Question regarding selection bias is not only one of the important machine learning interview questions but also important from an engineering perspective.
If we take a data set example which is not representative of its real-world distribution, it can lead to selection bias. Selection bias occurs when there is a gap in the data collection process or there is improper randomization which results in the example data appearing not as a reflection of actual data distribution.
Explain the difference between entropy and information gain in data science
The randomness or the uncertainty associated with a data is called entropy. More randomness in data means more entropy.
Information gain is used in algorithms such as decision trees and random forest to find the best split of data. Information gain for a split is calculated by subtracting the weighted entropy of each branch from the entropy of the entire data.
If the information gain of a sample is more, then it is a best split and the entropy of that split is less.