Introduction to Machine Learning
Classification

Introduction to Classification

Introduction

Classification is a fundamental concept in machine learning, where the goal is to predict which category or class a new observation belongs to based on a training dataset of observations whose category is known. Classification has numerous real-world applications, such as spam email detection, image recognition, sentiment analysis, and medical diagnosis.

Consider a dataset of emails to build a system that automatically categorizes them as spam or not spam. The model learns patterns and characteristics that distinguish between the two classes by bein trained on a labeled dataset, where each email is marked as spam or not spam. Once trained, the model can classify new, unseen emails based on the learned patterns.

Definition

Classification is a supervised learning technique that aims to predict the class or category of an input instance based on its features. The goal is to learn a function that maps input features to discrete output labels. The key components of a classification problem are:

  • Input Features: The independent variables used to make predictions, denoted as XX.
  • Output Labels: The discrete dependent variable or class labels we aim to predict, denoted as yy.
  • Model: The mathematical function maps the input features to the output labels, denoted as f(X)f(X).
  • Decision Boundary: The boundary or hyperplane separating different classes.

Given a labeled dataset D=(x1,y1),(x2,y2),...,(xn,yn)D = {(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)} consisting of nn data points, where xix_i represents the input features and yiy_i represents the corresponding class label, the goal of classification is to learn a function f(X)f(X) that accurately predicts the class labels for new, unseen instances.

The performance of a classification model is typically evaluated using metrics such as accuracy, precision, recall, and F1-score. These metrics measure how well the model's predictions align with the true class labels.

Types of Classification

There are several types of classification based on the number and nature of the class labels:

  • Binary Classification: Binary classification problems have two possible outcomes. Examples include identifying whether an email is spam or not spam, determining if a tumor is malignant or benign, and predicting whether a customer will buy a product or not. The output variable in binary classification is often represented as 0 or 1, where each number corresponds to a class.

  • Multi-Class Classification: Multi-class classification involves categorizing instances into one of three or more classes. Examples include classifying types of fruits, diagnosing diseases into multiple categories, and identifying the author of a piece of text from several possibilities. Unlike binary classification, multi-class classification deals with more than two classes.

  • Multi-Label Classification: An instance can belong to multiple classes simultaneously in multi-label classification. An example is tagging an image with multiple relevant labels, such as "beach," "sunset," and "people."

Popular Methods for Classification

Logistic Regression

Logistic regression is a widely used algorithm for binary/two-class classification problems. It models the probability of an instance belonging to a particular class based on the input features. Unlike linear regression, which predicts a value, logistic regression predicts the probability of an instance belonging to the positive class (usually denoted as 1). The equation for logistic regression looks like this:

P(y=1x)=11+e(β0+β1x1+...+βnxn)P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_nx_n)}}

Where:

  • P(y=1x)P(y=1|x) is the probability of the instance belonging to the positive class given the input features xx,
  • x1,...,xnx_1, ..., x_n are the input features,
  • β0,β1,...,βn\beta_0, \beta_1, ..., \beta_n are the coefficients the logistic regression algorithm learns to fit the data.

Example: Consider an example of predicting whether a customer will buy a product based on their age and income:

P(buy=1age,income)=11+e(β0+β1×age+β2×income)P(\text{buy}=1|\text{age}, \text{income}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{age} + \beta_2 \times \text{income})}}

In this equation:

  • buy is the target variable (1 if the customer buys the product, 0 otherwise),
  • age is the first input feature (the customer's age),
  • income is the second input feature (the customer's income),
  • β0\beta_0, β1\beta_1, β2\beta_2 are learned coefficients over course of training.

Logistic regression learns the best values for β0\beta_0, β1\beta_1, and β2\beta_2 that maximize the likelihood of the observed data. The predicted probability can be converted into a binary class label by setting a threshold (e.g., 0.5).

Decision Trees

Decision trees are a popular algorithm for both classification and regression tasks. They create a tree-like model of decisions and their possible consequences based on the input features. Each internal node of the tree represents a decision based on a feature, and each leaf node represents a class label or a numerical value. The equation for a decision tree can be represented as a series of if-then rules:


if feature1 <= threshold1:
    if feature2 <= threshold2:
        return class1
    else:
        return class2
else:
    if feature3 <= threshold3:
        return class3
    else:
        return class4

Example: Consider an example of predicting whether a person will default on a loan based on their credit score and income:

if credit_score <= 650:
    if income <= 50000$:
        return high_risk
    else:
        return medium_risk
else:
    if income <= 80000$:
        return medium_risk
    else:
        return low_risk

In this example:

  • credit_score and income are the input features,
  • high_risk, medium_risk, and low_risk are the class labels representing the risk of loan default.

Decision trees are intuitive and easy to interpret, but they can be prone to overfitting if the tree becomes too deep. Techniques like pruning and ensemble methods (e.g., random forests) can help mitigate this issue.

Applications

  • Sentiment Analysis for Crypto Twitter: Classification models can be employed to determine the sentiment (bullish, bearish, or neutral) expressed in tweets about markets or tokens.
  • Email Spam Filtering: Classification can automatically identify and filter out spam emails based on various features such as the sender, subject, content, and presence of specific keywords.

Conclusion

Classification is a fundamental task in machine learning that enables data to be automatically categorised into predefined classes. It has numerous applications across domains, from spam email detection to medical diagnosis and sentiment analysis.

When approaching a classification problem, it is essential to understand the nature of the task (binary, multi-class, or multi-label), select an appropriate algorithm based on the characteristics of the data and the problem at hand, and evaluate the model's performance using suitable metrics.

Resources