This article is one among the series of discussions on interview questions frequently asked in machine learning job roles
Explain the concept confusion matrix in machine learning?
Confusion matrix is a table displaying a summary of prediction results of a classification problem.
Measuring accuracy of a model alone can be misleading in cases where the classification tasks involve more than one type of classes in the data set or the classification problem contains an unequal number of observations. In such a scenario, Confusion matrix can be a good choice as an error matrix illustrating performance of the task of classification wherein the true values of test data are well known. Confusion matrix demonstrates the different classes such as true positive, true negative, false positive and false negative by giving value counts of each class.
In binary classification, the following terms has to be familiar to understand the confusion matrix.
True Positive (TP): These are observations which are actually positive and predicted also positive
True Negative (TN): These are observations which are actually negative and predicted also negative
False Positive (FP): These are observations which are actually negative but predicted positive
False Negative (FN): These are observations which are actually positive but predicted negative
Accuracy: Accuracy is the ratio of total accurate predictions to the total predictions
$$Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$$
Sensitivity: Sensitivity is also called recall or true positive rate (TPR). As the name suggest it is the ratio of true positive instances to the total actual positive instances
$$Sensitivity = \frac{TP}{TP+FN}$$
Specificity: Specificity or true negative rate (TNR) is the ratio of true negative instances to the total actual negative instances
$$Specificity = \frac{TN}{TN+FP}$$
Precision: Precision is the ratio of positive instances to the total predicted positive instances
$$Precision= \frac{TP}{TP+FP}$$
What is F1 score and how is it related to precision and sensitivity?
F1 score is an evaluation metric of a machine learning model which uses sensitivity and precision as a measure to understand model performance.
$$F1 score= 2 \times \frac{ precision \times sensitivity}{precision+sensitivity}$$
Explain ROC curve and AUC score?
The receiver operating characteristic (ROC) is a graph which plots the parameter True Positive Rate (TPR) vs False Negative Rate (TNR) and thereby displays the performance of classification models at the classification thresholds.
Lowering the classification threshold lead to a situation in which more items get classified as positive. Area under the curve (AUC) can be used to quantify the points in the ROC curve. AUC measure the two-dimensional area under the ROC curve and thereby we can understand the aggregate performance across the classification thresholds. However, AUC measures the performance of a model irrespective of the classification threshold we have chosen.
AUC score varies from 0 to 1. If the predictions are completely wrong, it gives a value of 0 and for 100 % correct predictions it gives a value of 1.
What is meant by the “Curse of dimensionality” in machine learning?
The computational efforts increase with increase in the dimension of data. When the number of input features is too large, it will be difficult to optimize the problem with grid search or brute force. Too many features imply higher dimensions. In such a higher dimensional problem, each observation appears at equidistant from every other observation and so the clustering of observations will be computationally harder.
Based on theory, a greater number of features and the associated increase in dimension add more information to the data and improves the data quality. But in practice, increase in dimensionality causes an increase in noise level and adds the burden of repeatability or redundancy during data analysis. This difficulty caused by dimensionality despite its expected benefit is called the curse of dimensionality.
What is principal component analysis?
Principal Component Analysis (PCA) is a dimensionality reduction technique that corresponds to the class of unsupervised learning algorithms in machine learning. PCA follows a statistical method which follows orthogonal transformations to extract linearly uncorrelated features from correlated observations. The transformed features are said to be the principal components in PCA. The advantage of PCA is it can retain most of the information in the low dimensional space after the transformation from high dimensional space. Also, PCA helps to find the hidden pattern in the high dimensional data. Some of the applications of PCA in the real world are movie recommendation systems and image processing.
Explain the difference between normalization and regularization
The key difference between normalization and regularization is normalization is performed during the data preprocessing stage whereas regularization applies to model prediction stage to improve the prediction function.
If the scale of data in our hand is very different, we have to reformulate the data in order to have comparable basic statistics like mean and standard deviation. This process of altering each column to get compatible basic statistics in order to help the machine to maintain accuracy is called normalization.
Regularization regularizes the model prediction by adding some penalty to the loss function. Regularization controls the prediction function by providing simpler fitting functions in place of complex ones. Regularization has no effect on the distribution of data or on the data itself.
Differentiate between classification and regression
Classification and regression are supervised learning algorithms in machine learning. The table below displays the key differences.
Classification | Regression |
Classification deals with finding a function which can classify the data points into different classes | Regression deals with the task to find correlation between the dependent variable and independent variable |
Classification algorithms try to find mapping functions which can map the input to discrete output | Regression algorithms try to find mapping functions which map the input to continuous output |
The predicted data will be unordered in classification | The predicted data will be in ordered form in regression |
Examples of classification algorithms are K nearest neighbors, decision tree etc. | Examples of regression algorithms are linear regression, random forest etc. |
In the real world scenario, classification algorithms are used to identify cancer cells, spam email filtering etc. | In the real world scenario, regression algorithms are used for weather prediction, real estate price prediction etc. |
If you are in a scenario where you have met target imbalance in your data, how would you fix it?
Target imbalance occurs when we are having categorical features as target variables. When we take the frequency count and find some categories are significantly more in number as compared to other categories, there is said to have target imbalance. For instance, if the target column has 10 % of 0s, 20 % of 1s and 70% of 2s, here 2s are in majority. We can fix it by adding or deleting the column via up-sampling and down- sampling and then can use more sophisticated algorithms like support vector machines or gradient boosting.
What is linear regression?
Linear regression is a supervised machine learning method in which the algorithm finds the best linear relationship between the dependent and independent variable. Simple linear regression and multiple linear regression are the two sub classes of linear regression.
Simple linear regression
In simple regression model, only one independent variable is present and the algorithm has to find its linear relationship with the dependent variable
Mathematically,
$$Y= L_1X+L_0$$
Here,
$$X$$- Independent variable, $$Y$$- Dependent variable, $$L_1$$- Linear regression coefficient, and $$L_0$$ – Intercept of the line
Multiple linear regression
If there is more than one independent variable is required to find the value of the dependent variable, the linear regression algorithm in such a case is called multiple linear algorithm
Mathematically,
$$Y= L_1X_1+L_2X_2+……+L_nX_n+L_0$$
Here $$L_1, L_2, …. L_n$$ are slopes or linear regression coefficient of independent variables $$X_1, X_2……X_n$$
What are the most popular distribution curves used in machine learning algorithms? Give a brief scenario-based description
The most popular distribution curves used in machine learning are Uniform distribution, Bernoulli distribution, Binomial distribution, Normal distribution, Poisson distribution, Exponential distribution, Gamma distribution and Weibull distribution.
Uniform distribution
Uniform distribution represents incidents in which probability of occurrence remains constant. Coin toss is an example of uniform distribution.
Bernoulli distribution
Bernoulli distribution describes probabilistic events which occur only once with two possible outcomes. Problems like whether you will pass in a particular exam or fail, a team will win a championship or not etc. can be represented with Bernoulli distribution.
Binomial distribution
It is a special case of Bernoulli distribution in which the number of success counts of repeated Bernoulli experiments is represented. Repeated coin toss can be represented with Binomial distribution.
Normal distribution
The Normal distribution is also called the Gaussian distribution. The probability of events to occur is distributed symmetrically around the central (average) tendency like mean or median. The values which are closer to the average tendency are more likely to occur compared to the distant values from average. Weight or Height of a human beings in a country can be represented by a normal distribution.
Poisson distribution
Poisson distribution describes the number of events occurring in a particular time frame. The number of bus service from a place to another particular place in 30 mts or a selected fixed time interval can be represented with Poisson distribution.
Exponential distribution
Exponential distribution is a modified version of Poisson distribution. The Poisson distribution deals with the number of events in a particular time interval whereas exponential distribution describes waiting time between two incidents. The time interval between two recharging of a battery can be represented with exponential distribution.
Gamma distribution
Gamma distribution describes the waiting time between a fixed number of events. It can be considered as a special case of exponential distribution or as a skewed normal distribution. For instance, imagine there is a show and the organizers need at least 20 people in the gallery to begin the show. Suppose in every 5-minute, it is observed that 4 people are coming. The waiting time to begin the show can be represented with gamma distribution.
Weibull distribution
It is another variation of exponential distribution in which waiting time for one event is described. Analysis of lifetime of a radioactive isotope can be represented with a Weibull distribution.
The next part in this blog series is available here.