Learning Machine Learning with Examples #1
This tutorial demonstrates the typical lifecycle of a Machine learning experiment through an example. We will explore the various stages that occur and note our choices while developing the model, which may vary depending on the practitioner and the end goal.
Problem Statement: Consider building a paymaster on Ethereum or having to calculate the fees someone might have paid over the year.
Paymasters are a valuable tool for abstracting the payment of transaction fees on the Ethereum network. They allow users to pay for their transactions in various ways without holding Ether or sacrificing custody of their account. They also allow users on one chain to pay for transactions on other chains under the umbrella of account abstraction (opens in a new tab). We assume we are building a universal paymaster, which is blockchain agnostic.
This problem also works the same way if a person wants to calculate the fees they have paid over the past year, but for clarity, we will move forward considering the paymaster example.
Given the chain and rough transaction time of the day, we want to build a model that outputs the amount of balance the paymaster should maintain. In other words, our model should output the fees in native fees (like eth, sol, or avax) or the final amount.
The above diagram showcases the steps in the lifecycle of an ML model.
Step 1: Data Collection
The foundation of any machine learning project lies in the quality and relevance of the data. This step focuses on gathering data from various sources, such as databases, APIs, or web scraping. Ensuring that the collected data aligns with the problem statement and contains sufficient information to train our models effectively is essential.
Our Example:
We want to collect data from the past year on every chain of interest.
Chains of Interest but not limited to:
- Solana
- Avalanche
- Base
- Ethereum
- Scroll
- Arbitrum
- Optimism
Attributes to capture
- Date
- Time
- Gas Fees in native assets (gwei (for Ethereum), lamports (for Solana))
- Liveness data
- Application data (Some applications cost more than others)
Step 2: Data Preparation
Raw data is rarely ready for immediate use in machine learning models. The data preparation involves cleaning, preprocessing, and transforming the collected data into a suitable format. Data Preparation includes handling missing values, dealing with outliers, normalizing or scaling features, and encoding categorical variables. Data preparation is critical to ensure the quality and consistency of the input data, as it directly impacts the model's performance.
Our Example:
- We want to clean the data and fill in the missing values
- This can occur in case the chain is down for maintenance reasons or liveness issues
- In this case, either 0 the fees or copy over the previous day's values, or fill them with 0.
- Handle conversion from gwei and lamports to sol/eth and any other native asset for ease of use.
Step 3: Exploratory Data Analysis (EDA)
Before diving into model building, gaining insights and understanding of the data through exploratory data analysis is crucial. EDA involves visualizing the data, examining distributions, identifying patterns, and uncovering relationships between variables. This step helps in feature selection and hypothesis generation and guides the choice of appropriate machine learning algorithms. EDA also aids in detecting any data quality issues or anomalies that may require further attention.
Our Example: We perform EDA to gain insights from the data, doing analysis and visualization helps gain some insight into how to structure data and model choices.
For example, visualize the gas price on Ethereum for the last year.
This image shows that except for particular periods around April to June, gas prices have usually stayed below 50 gwei.
We can also make a heatmap of gas prices for the past week or longer to gain further insight.
Gas mostly stays normal during the day, and spikes during 15-18 UTC hours.
These are just a few possible exploratory analyses.
Step 4: Feature Engineering
Feature engineering is the process of creating new features or transforming existing ones to improve the predictive power of machine learning models. It involves domain knowledge and creativity to extract meaningful information from raw data. Feature engineering is vital in enhancing model performance and capturing complex relationships within the data.
Our Example:
- We only want approximate gas that needs to be maintained in the paymaster
- Assuming cyclicality of gas price, prices that were observed last year will repeat during this year as well.
We can make feature engineering choices as follows:
- Bucket months of the years into quarters
- Note any specific chain-level trends
- For example, if we observe the chart, we see that ethereum gas was higher in April - June 2023.
- Someone could also note that since Solana season began in October, Solana gas prices might have been higher in the last quarter.
- We can also opt to bucket by day.
- We do not want an accurate prediction of gas. So we can also choose ranges of gas like (1-25 gwei), ...,(100+ gwei)
Step 5: Model Selection
Choosing the correct machine learning algorithm is critical in the success of a Machine Learning project. The choice depends on various factors, such as the nature of the problem (classification, regression, clustering), the size and complexity of the data, and the interpretability requirements. Popular algorithms include linear regression, logistic regression, decision trees, random forests, support vector machines, and neural networks. When selecting an algorithm, it is essential to consider the trade-offs between model performance, computational complexity, and interpretability.
Our Example:
- Our feature choice and the original decision of bucketing gas costs instead of granular predictions convert the problem from a regression problem (predict value) to a classification problem (predict the bucket, i.e., range of gas)
- Within classification models, we can choose between logistic regression and decision trees for ease of interpretability and understanding.
- We know that it is not a binary classification problem, because it is a multiclass classification problem.
- We can choose decision trees because of their interpretability and if-else-type rule extraction.
Step 6: Model Training
Once the features are engineered and the algorithm is selected, it is time to train the machine learning model. The training involves feeding the prepared data into the chosen algorithm and optimizing the model's parameters to minimize the prediction error. The data is typically split into training and validation sets to assess the model's performance during training. Hyperparameter tuning is performed to find the best model parameter combination that yields optimal results.
Our Example:
- Toy models can be trained to see if our feature engineering and model selection were correct when dealing with large datasets.
- In this case, for example, let's say we train a decision tree model and yield the following rules.
if Jan <= month <= march:
if time <= 12am UTC:
return 50 gwei
else:
return 100 gwei
else if april <= month <= june:
if time <= 12am UTC:
return 50 gwei
else:
return 100 gwei
- These are not all the rules the model would yield, but they provide a rough idea given our chosen subset.
- We also need to see chain-level granularity because we only trained the model on ethereum.
- Once a toy model is trained on the subset of the data, the final model can be trained on the full data.
Step 7: Model Evaluation
After training the model, evaluating its performance on unseen data is crucial to assess its generalization ability. This step involves using evaluation metrics specific to the problem type, such as accuracy for classification tasks, or mean squared error for regression tasks. Cross-validation techniques are often employed to obtain more robust performance estimates. If the model's performance is unsatisfactory, it may be necessary to iterate back to earlier steps, such as feature engineering or model selection.
Our Example:
- Once we have a model, we can evaluate the model, in this case, a decision tree on a test set.
- Since we did not split the data into training, validation, and test sets, we will choose random dates outside of the training set and test our model.
- For example, when given the input: ("Ethereum," "Blur," "10 am UTC", "April"), we would expect the output: 100+ gwei.
Conclusion
We walked through an example of developing an approximate gas estimator model for a Paymaster. We went through a simple development cycle for clarity and simplicity. We also observed that data collection, data analysis, and feature engineering all have a heavy downstream effect on the types of problems that we solve.
In future examples, we will introducemore advanced techniques methods like holdout sets and challenge sets for better evaluation of models. We will also consider formal evaluation models like accuracy, which will lead to insights around the quality of the trained models and the potential issues arising from overfitting / underfitting.