Outlines

1.) Why do we need to normalize numerical features?
2.) The choice between precision and recall
3.) How to tackle imbalance data?
4.) Why can Dropout suppress overfitting?
5.) Briefly describe KNN(k-nearest neighbor) algorithm

1.) Why do we need to normalize numerical features?

1. Equalizing Scales

Numerical features often have different scales (ranges of values). For example, consider age (ranging from 0 to 100) and income (ranging from 0 to 100,000+).
Algorithms treat features with larger scales as more influential. This can lead to biased results.
Normalization ensures that all features contribute equally by scaling them to a common range.

2. Gradient Descent Convergence

Many machine learning algorithms (e.g., linear regression, neural networks) use gradient descent for optimization.
If features are not normalized, the gradient steps may be uneven across different features.
Normalization helps gradient descent converge faster and more consistently.

3. Avoiding Numerical Instabilities

Some algorithms (e.g., PCA, SVM) rely on matrix operations.
Large-scale differences in feature values can cause numerical instability (e.g., overflow or underflow).
Normalization mitigates these issues.

4. Interpretability and Comparisons

Normalized features are easier to interpret and compare.
Coefficients in linear models represent the impact of one standard deviation change in the feature.
Without normalization, coefficients may not be directly comparable.

In summary, normalizing numerical features improves algorithm performance, stability, and interpretability. It ensures fair treatment of all features, regardless of their original scales.

2.) The choice between precision and recall

If you forgot the definition, you could still derive on the white board using the following method:

There are four cases:

TP (True Positive): Model predict true, and the sample is true.
TN (True Negative): Model predict false, and the sample is false.
FP (False Positive): Model predict true, but the sample is false.
FN (False Negative): Model predict false, but the sample is true.

Precision: How many retrieved items are relevant?
Focuses on minimizing False Positives (FP).
Recall: How many relevant items are retrieved?
Focuses on minimizing False Negatives (FN). The choice depends on the specific problem and its impact on users or stakeholders.

Case 1: Training an Image Classifier with a Balanced Dataset

In this case, a balanced dataset means that the number of positive and negative examples is roughly equal.
Precision is important because we want to minimize false positives (i.e., avoid classifying non-relevant images as relevant).
A balanced dataset allows us to achieve both high precision and high recall.

Case 2: Face Recognition in Google Photos

For face recognition, recall is more critical.
We want to ensure that we don’t miss any actual faces in the photos.
A false positive (mistakenly recognizing a non-face as a face) is less problematic than a false negative (missing a face).

Case 3: Search Photos in Google Photos

When searching for photos, precision is more important.
Users expect accurate search results without irrelevant photos.
A false positive (showing irrelevant photos) can be frustrating for users.

Case 4: Rare Disease Detection

In rare disease detection, recall is crucial.
Missing a rare disease case can have severe consequences.
A false positive (flagging a non-disease case) is less harmful than a false negative.

Case 5: Credit Card Fraud Detection

For fraud detection, both precision and recall can be emphasized.
Precision: if we want to minimize false alarms (flagging legitimate transactions as fraud).
Recall: Since a false negative (missing a fraudulent transaction) can lead to financial losses.

3.) How to tackle imbalance data?

In what scenario, the accuracy is high, but the F1-Score is low?

Accuracy: (TP + TN)/ (TP + TN + FP + FN) * 100%
F1-Score: 2 * (Precision * Recall) / (Precision + Recall)

The model is good at capturing TN, but not TP. This could happen in in-balanced dataset. Since a kind of data sample dominate the other.

1. Collect More Data

If possible, collect additional data for the minority class.
Consider historical data or non-subscriber records (if applicable) to augment the dataset.

2. Combine Minority Classes

Instead of classifying each type of anomaly separately, consider combining them into a single class (e.g., “Abnormal Heartbeat”).
This simplifies the problem and ensures better representation.

3. Change Performance Metrics

Accuracy can be misleading for imbalanced data.
Focus on other metrics like precision (minimizing false positives) or recall (minimizing false negatives).

4. Resampling Techniques

Under-sampling: Reduce the majority class samples to balance the distribution.
Over-sampling: Increase the minority class samples.
SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic samples for the minority class.

5. Cost-Sensitive Learning

Assign different misclassification costs to different classes.
Penalize misclassifying the minority class more heavily.

6. Ensemble Techniques

Use ensemble models (e.g., Random Forest, XGBoost) that handle class imbalance better.

7. One-Class Classification

Treat the minority class as an anomaly detection problem.
Train a model to identify the rare class.

8. Hierarchical Classification

Apply multi-stage classification on the original task.
We can mitigate in-balanced data distribution by sub-classing the original problem.

4.) Why can Dropout suppress overfitting?

What is Dropout?

Dropout is a regularization technique used during training in neural networks.
It randomly drops (sets to zero) a fraction of neurons during each forward and backward pass.
The dropped neurons do not contribute to the computation, effectively creating a smaller network.

How Does Dropout Suppress Overfitting?

Reducing Co-Adaptation

Dropout prevents neurons from relying too much on specific features or co-adapting with other neurons.
By randomly dropping neurons, the network learns more robust and independent representations.

Ensemble Effect

During training, each forward pass uses a different subset of neurons.
This is akin to training multiple models (an ensemble) and averaging their predictions.
Dropout acts as a form of model averaging, reducing overfitting.

Regularization

Dropout introduces noise into the network, acting as a form of regularization.
It discourages complex co-adaptations and encourages simpler, more generalizable models.

Generalization

Dropout helps the network generalize better to unseen data.
It prevents the network from fitting the training data too closely, which can lead to overfitting.

Implementation

Dropout is typically applied to hidden layers during training.
The dropout rate (probability of dropping a neuron) is a hyperparameter.
Common values for dropout rate are around 0.2 to 0.5.

5.) Briefly describe KNN(k-nearest neighbor) algorithm

Basic Idea

KNN is a non-parametric and supervised learning algorithm.
It uses proximity (distance) to make predictions or classifications about individual data points.
The assumption is that similar points tend to be close to each other.

How It Works

Given a new data point, KNN finds the k nearest neighbors to that point based on a distance metric (usually Euclidean distance).

For classification problems

It assigns a class label based on a majority vote among the k neighbors.
The most frequent class label among the neighbors is used.

For regression problems

It predicts a value by taking the average of the target values of the k neighbors.

Lazy Learning

KNN is part of the lazy learning family.
It stores the entire training dataset (no explicit training phase).
Computation occurs during prediction or classification.

Pros and Cons

Pros

Simple and easy to understand.
Works well with small datasets.
Suitable for recommendation systems, pattern recognition, and more.

Cons

Inefficient for large datasets.
Sensitive to irrelevant features.
Requires careful tuning of the hyperparameter k.

Use Cases

Simple recommendation systems.
Data mining.
Financial market predictions.
Intrusion detection.

Top Interview Questions for AI Engineer

Fundamental Basics for Machine Learning