Introduction to Machine Learning
FAQs

Frequently asked questions

What are continous target variables?

A continuous target variable is a variable that can take on any value within a certain range. It's not limited to specific, separate categories like in classification.

For example, when predicting house prices, the price can be any number within a range, like $100,000 to $500,000. The price is a continuous variable because it can take on any value in between, like $235,798 or $401,532.

In contrast, a categorical variable in classification has distinct categories, like "expensive" or "affordable", in the case of house prices. The categories are separate and don't have in-between values.

So, "continuous target variable" means predicted value is a number that can vary continuously within a range, rather than being limited to specific categories.

What is overfitting?

Overfitting is a common problem in machine learning where a model learns the noise and specific patterns in the training data too well, to the point that it negatively impacts the model's ability to generalize to new, unseen data.

Imagine you're teaching a student to solve math problems. You give them the same set of problems to practice over and over again. They memorize the solutions to those specific problems perfectly, but when you give them a new problem, they struggle to solve it. This is like overfitting in machine learning.

When a model overfits:

  • It performs exceptionally well on the training data, often achieving high accuracy.
  • It fails to generalize well to new, unseen data, leading to poor performance on test data or real-world scenarios.
  • It becomes overly complex, trying to capture every minor detail and noise in the training data.

What does feature selection mean?

Feature selection is the process of choosing the most relevant and informative features (input variables) from a larger set of features to use in a machine learning model. The goal is to simplify the model, improve its performance, and reduce overfitting.

You can think of it like packing for a trip. You have a bunch of items you could bring, but you want to select only the essentials that will be most useful for your trip. Similarly, in machine learning, you have a dataset with many features, but you want to select only the most important ones that will contribute to making accurate predictions.

The main reasons for performing feature selection are:

  • Dimensionality reduction: Reducing the number of features helps simplify the model and makes it easier to interpret. It also reduces the computational complexity and memory requirements.
  • Improved model performance: By removing irrelevant or redundant features, feature selection can improve the model's accuracy and generalization ability. It helps the model focus on the most informative features.
  • Avoiding overfitting: Including unnecessary features can lead to overfitting, where the model learns noise and specific patterns in the training data that don't generalize well to new data. Feature selection helps mitigate this problem.
  • Faster training and inference: With fewer features, the model requires less time and resources for training and making predictions.

What are dependent and independent variables in machine learning?

In machine learning, the terms "dependent variable" and "independent variable" are borrowed from statistics and are used to describe the relationship between the input features and the target variable in a dataset.

Independent Variables (Features):

  • Independent variables, also known as features or input variables, are the variables that are used to predict or explain the dependent variable.
  • These variables are assumed to have an influence on the outcome or the dependent variable.
  • In a dataset, independent variables are typically denoted as X or a matrix of features.
  • For example, In a spam email classification problem, the independent variables could be the presence of certain keywords, the length of the email, the sender's domain, etc.

Dependent Variable (Target):

  • The dependent variable, also known as the target variable or output variable, is the variable that we aim to predict or explain using the independent variables.
  • It is the outcome or the result that depends on the independent variables.
  • For example, In a spam email classification problem, the dependent variable would be the label indicating whether an email is spam or not (e.g., 1 for spam, 0 for not spam).

The goal of machine learning is to learn a function or a model that can accurately map the independent variables to the dependent variable. The model learns patterns and relationships between them.

What are discrete or categorical variables in machine learning?

In machine learning, discrete or categorical variables are variables that take on a limited number of distinct values or categories. These variables are not continuous and do not have a natural order or numerical meaning. Each value represents a separate category or group.

Examples of discrete or categorical variables include:

  • Color: Red, Green, Blue, Yellow
  • Educational Level: High School, Bachelor's, Master's, PhD
  • Product Category: Electronics, Clothing, Home Appliances, Books

Categorical variables can be further classified into two types:

  • Nominal Variables:
    • Categories have no inherent order or ranking
    • Examples: Color, Product Category
  • Ordinal Variables:
    • Categories have a natural order or ranking
    • The difference between categories is not necessarily uniform or measurable
    • Examples: Educational Level (High School < Bachelor's < Master's < PhD), Rating (Poor < Average < Good < Excellent)

What are hyperparameters in machine Learning?

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. It's a configuration that is external to the model and whose value cannot be estimated from data. The term "hyperparameter" is used to distinguish them from the parameters of the model itself, which are learned during training.

Key points about hyperparameters:

  • They are set prior to training: Hyperparameters are specified by the practitioner before the learning algorithm starts.
  • They control the learning process: Hyperparameters guide the learning algorithm and have a significant impact on the performance of the resulting model.
  • They require tuning: Finding the optimal values for hyperparameters often involves a search process called hyperparameter tuning or optimization.

Examples of hyperparameters in different machine learning algorithms:

  • In neural networks: The learning rate, number of hidden layers, number of neurons per layer, activation functions, etc.
  • In decision trees and random forests: The maximum depth of the tree, minimum number of samples required to split a node, number of trees in a forest, etc.