美瑛

Top Interview Questions for AI Engineer

Fundamental Basics for Machine Learning

John Lu
6 min readApr 27, 2024

--

Outlines

  • 1.) Why do we need to normalize numerical features?
  • 2.) The choice between precision and recall
  • 3.) How to tackle imbalance data?
  • 4.) Why can Dropout suppress overfitting?
  • 5.) Briefly describe KNN(k-nearest neighbor) algorithm

1.) Why do we need to normalize numerical features?

1. Equalizing Scales

  • Numerical features often have different scales (ranges of values). For example, consider age (ranging from 0 to 100) and income (ranging from 0 to 100,000+).
  • Algorithms treat features with larger scales as more influential. This can lead to biased results.
  • Normalization ensures that all features contribute equally by scaling them to a common range.

2. Gradient Descent Convergence

  • Many machine learning algorithms (e.g., linear regression, neural networks) use gradient descent for optimization.
  • If features are not normalized, the gradient steps may be uneven across different features.
  • Normalization helps gradient descent converge faster and more consistently.

3. Avoiding Numerical Instabilities

  • Some algorithms (e.g., PCA, SVM) rely on matrix operations.
  • Large-scale differences in feature values can cause numerical instability (e.g., overflow or underflow).
  • Normalization mitigates these issues.

4. Interpretability and Comparisons

  • Normalized features are easier to interpret and compare.
  • Coefficients in linear models represent the impact of one standard deviation change in the feature.
  • Without normalization, coefficients may not be directly comparable.

In summary, normalizing numerical features improves algorithm performance, stability, and interpretability. It ensures fair treatment of all features, regardless of their original scales.

2.) The choice between precision and recall

If you forgot the definition, you could still derive on the white board using the following method:

There are four cases:

  • TP (True Positive): Model predict true, and the sample is true.
  • TN (True Negative): Model predict false, and the sample is false.
  • FP (False Positive): Model predict true, but the sample is false.
  • FN (False Negative): Model predict false, but the sample is true.
Source: Wikipedia
  • Precision: How many retrieved items are relevant?
    Focuses on minimizing False Positives (FP).
  • Recall: How many relevant items are retrieved?
    Focuses on minimizing False Negatives (FN). The choice depends on the specific problem and its impact on users or stakeholders.

Case 1: Training an Image Classifier with a Balanced Dataset

  • In this case, a balanced dataset means that the number of positive and negative examples is roughly equal.
  • Precision is important because we want to minimize false positives (i.e., avoid classifying non-relevant images as relevant).
  • A balanced dataset allows us to achieve both high precision and high recall.

Case 2: Face Recognition in Google Photos

  • For face recognition, recall is more critical.
  • We want to ensure that we don’t miss any actual faces in the photos.
  • A false positive (mistakenly recognizing a non-face as a face) is less problematic than a false negative (missing a face).

Case 3: Search Photos in Google Photos

  • When searching for photos, precision is more important.
  • Users expect accurate search results without irrelevant photos.
  • A false positive (showing irrelevant photos) can be frustrating for users.

Case 4: Rare Disease Detection

  • In rare disease detection, recall is crucial.
  • Missing a rare disease case can have severe consequences.
  • A false positive (flagging a non-disease case) is less harmful than a false negative.

Case 5: Credit Card Fraud Detection

  • For fraud detection, both precision and recall can be emphasized.
  • Precision: if we want to minimize false alarms (flagging legitimate transactions as fraud).
  • Recall: Since a false negative (missing a fraudulent transaction) can lead to financial losses.

3.) How to tackle imbalance data?

In what scenario, the accuracy is high, but the F1-Score is low?

  • Accuracy: (TP + TN)/ (TP + TN + FP + FN) * 100%
  • F1-Score: 2 * (Precision * Recall) / (Precision + Recall)

The model is good at capturing TN, but not TP. This could happen in in-balanced dataset. Since a kind of data sample dominate the other.

1. Collect More Data

  • If possible, collect additional data for the minority class.
  • Consider historical data or non-subscriber records (if applicable) to augment the dataset.

2. Combine Minority Classes

  • Instead of classifying each type of anomaly separately, consider combining them into a single class (e.g., “Abnormal Heartbeat”).
  • This simplifies the problem and ensures better representation.

3. Change Performance Metrics

  • Accuracy can be misleading for imbalanced data.
  • Focus on other metrics like precision (minimizing false positives) or recall (minimizing false negatives).

4. Resampling Techniques

  • Under-sampling: Reduce the majority class samples to balance the distribution.
  • Over-sampling: Increase the minority class samples.
  • SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic samples for the minority class.

5. Cost-Sensitive Learning

  • Assign different misclassification costs to different classes.
  • Penalize misclassifying the minority class more heavily.

6. Ensemble Techniques

  • Use ensemble models (e.g., Random Forest, XGBoost) that handle class imbalance better.

7. One-Class Classification

  • Treat the minority class as an anomaly detection problem.
  • Train a model to identify the rare class.

8. Hierarchical Classification

  • Apply multi-stage classification on the original task.
  • We can mitigate in-balanced data distribution by sub-classing the original problem.

4.) Why can Dropout suppress overfitting?

What is Dropout?

  • Dropout is a regularization technique used during training in neural networks.
  • It randomly drops (sets to zero) a fraction of neurons during each forward and backward pass.
  • The dropped neurons do not contribute to the computation, effectively creating a smaller network.

How Does Dropout Suppress Overfitting?

Reducing Co-Adaptation

  • Dropout prevents neurons from relying too much on specific features or co-adapting with other neurons.
  • By randomly dropping neurons, the network learns more robust and independent representations.

Ensemble Effect

  • During training, each forward pass uses a different subset of neurons.
  • This is akin to training multiple models (an ensemble) and averaging their predictions.
  • Dropout acts as a form of model averaging, reducing overfitting.

Regularization

  • Dropout introduces noise into the network, acting as a form of regularization.
  • It discourages complex co-adaptations and encourages simpler, more generalizable models.

Generalization

  • Dropout helps the network generalize better to unseen data.
  • It prevents the network from fitting the training data too closely, which can lead to overfitting.

Implementation

  • Dropout is typically applied to hidden layers during training.
  • The dropout rate (probability of dropping a neuron) is a hyperparameter.
  • Common values for dropout rate are around 0.2 to 0.5.

5.) Briefly describe KNN(k-nearest neighbor) algorithm

Basic Idea

  • KNN is a non-parametric and supervised learning algorithm.
  • It uses proximity (distance) to make predictions or classifications about individual data points.
  • The assumption is that similar points tend to be close to each other.

How It Works

Given a new data point, KNN finds the k nearest neighbors to that point based on a distance metric (usually Euclidean distance).

For classification problems

  • It assigns a class label based on a majority vote among the k neighbors.
  • The most frequent class label among the neighbors is used.

For regression problems

  • It predicts a value by taking the average of the target values of the k neighbors.

Lazy Learning

  • KNN is part of the lazy learning family.
  • It stores the entire training dataset (no explicit training phase).
  • Computation occurs during prediction or classification.

Pros and Cons

Pros

  • Simple and easy to understand.
  • Works well with small datasets.
  • Suitable for recommendation systems, pattern recognition, and more.

Cons

  • Inefficient for large datasets.
  • Sensitive to irrelevant features.
  • Requires careful tuning of the hyperparameter k.

Use Cases

  • Simple recommendation systems.
  • Data mining.
  • Financial market predictions.
  • Intrusion detection.

--

--

John Lu

AI Engineer. Deeply motivated by challenges and tends to be excited by breaking conventional ways of thinking and doing. He builds fun and creative apps.