This is another addition to the series of machine learning (ML) interview questions and answers that have been published on this website.

## What are Sigmoid function and softmax functions? Explain the difference

Sigmoid and softmax functions are functions used for the classification tasks in machine learning and deep learning. Both Sigmoid function and softmax function take value from a range of real numbers and give output as a measure of probability with value between 0 and 1.

*Fig.1. Sigmoid function (source: wikipedia)*

The key difference between sigmoid and softmax functions is that the sigmoid function is used for binary classification whereas the softmax function is used for multiclass classification.

The sigmoid function takes a variable as input and outputs a single probability value that it belongs to a class. On the other hand, the softmax function takes an input as a vector with the number of components equal to that of the number of classes present in the data set. The softmax function gives output as another vector with each component as a probability value belonging to a particular class.

## Is it possible to use logistic regression for more than two classes?

Logistic regression is a binary classifier by definition. Classification tasks involving more than two classes are multiclass classification. The modified version of logistic regression is called multinomial logistic regression. Multinomial logistic regression can be used for a dataset with more than two classes.

## What do you understand from the term recommendation system?

Recommendation system is an artificial intelligence-based algorithm which give suggestions to a user about the most suitable items to them. It removes redundant or insignificant information from the database to enable accurate recommendations.

The data required for recommender systems are collected from user interactions in websites such as web searches, user rating, comments by a user, purchase history etc. Based on these data the recommendation system extracts inferences to recommend suitable music, products, music films etc. to the user.

## How do you select K for K-means clustering?

The elbow method and silhouette method are two direct methods to find optimal $K$ for K-means clustering.

In the elbow method, we calculate for each value of $k$, the within-cluster sum of squared errors (WSS). Then we plot WSS vs $K$ values and get a plot which contains an elbow shape. The point at which the WSS starts to decrease for the first time will be selected as the optimal value of $K$.

The silhouette method is based on how similar a point is to other points in a cluster so that it can be assigned to a particular cluster. The silhouette values range from $-1$ to $+1$. High values indicate, it is more likely that a point belongs to a particular cluster.

The silhouette scores can be plotted against the number of clusters $K$ and the point at which the silhouette score reaches its global maximum is treated as the optimal $K$.

## What is the key difference between loss function and cost function?

Loss function is used to calculate the error associated with a single instance. It measures the difference between the predicted and actual value of a single observation. Absolute error per observation, $L_1$ loss etc. are examples of loss function.

Cost function is an aggregate or sometimes average of loss function. Using the cost function, we calculate the sum of errors associated with multiple instances in the training data. Mean absolute error (MAE) and mean square error (MSE) are examples of cost functions.

## Explain the general principle of Ensemble method

The principle of ensemble method is to build a strong model with versatile learning traits from a combination of weak and strong learning models. In order to overcome the robustness of a single model, ensemble method combines the output of many models built using a given algorithm. The models whose outputs are used for making the ensemble model are called base learners. The average or median of base learners are used to build the ensemble model.

Bagging, stacking and boosting are three commonly used ensemble methods in machine learning.

## What is bagging and boosting in ensemble method?

Bagging and boosting are two ensemble methods with a difference in the way it performs the task of combining the decisions made by base learners. Bagging follows a parallel mode of performance whereas boosting performs in a sequential way.

Bagging comes from the concept of bootstrap aggregation. Bootstrap is a method of creating several small subsets of samples from a dataset. Each base learner uses a random subset of data from the actual dataset. As each classifier (base learner) performs the task parallel and independently, the models inherit features with slight differences. Bagging combines the output of these base learners by taking its average and treating it as the final output.

In the boosting method, the output of each base learner is used as input of the next base learner in the sequential arrangement. More weight will be assigned to the base learners which misclassify, so that the next base learner can rectify it.

## What are Bayesian networks?

The graphical model used to represent the probabilistic relationship among a set of variables is called Bayesian network. Bayesian networks can be used to understand the probabilistic relation between diseases and its symptoms.

## Distinguish between heuristic for decision trees and heuristic for rule learning

In heuristic for decision trees, the rules set to reach the solution, otherwise called heuristics are fixed. On the other hand, we can change the rules based on learning and make updates in heuristic for rule learning.

## What is the advantage of incremental learning algorithms over ensemble modeling?

The ability of an algorithm to learn from a new dataset even after a classifier is generated from a given dataset is called an incremental model. In ensemble model, no further learning is possible once the final classifier is generated. Incremental models are flexible and so we can debug during iteration.

Incremental models accept training instances over time and adjust the learning in accordance with the nature of new datasets and features. Incremental models can meet superior requirements as compared to traditional machine learning methods.

The previous set of discussion can be found here