Submission done by: Ng Hong Jin, Jonathan
Email: ngjonathan2@gmail.com
This project aims to leverage historical same day forecasted weather data, to implement predictive algorithms that identify and learn from patterns associated with varying efficiency levels. The repository contains a notebook eda.ipynb, which conducts Exploratory Data Analysis on the given dataset weather.db and air_quality.db.
Afterwhich, an end-to-end machine learning pipeline is also designed and implemented as well to ingest and process the entailed dataset, subsequently, feeding it into the machine learning algorithms. For more details, take a look at the attached pdf for the in depth problem statement and requirements.
.github: Scripts to execute end-to-end machine learning pipeline using GitHub Actionsdata/: Contains database files (not submitted)results/: Results of model trainingfeature_importance.png: Visualization of feature importancesmodel_results.csv: Performance metrics of Models
src/: Source code filesconstants.py: Constant variablesdata_ingestion.py: Data loading scriptsdata_preprocessing.py: Data cleaning and preparationmodel.py: ML model implementationmain.py: Main execution script
eda.ipynb: Exploratory Data Analysis notebookrequirements.txt: Required Python packagesrun.sh: Bash script to run the project
- Clone this repository
- Place the
weather.db&air_quality.dbfile in adatafolder. The machine learning pipeline should retrieve the dataset using the relative pathdata/weather.db&data/air_quality.db.
Note: Github Actions will fail if .db files are not added into the
/datafolder
- Run the bash script run.sh
./run.sh-
Data Ingestion
- Source data from
weather.dbandair_quality.dbSQLite databases - Use custom
DataIngestionclass to connect and extract data - Convert extracted data into pandas DataFrames for further processing
- Source data from
-
Data Preprocessing
- Handle missing values in numerical columns using mean imputation
- Clean and standardize categorical data (e.g., wind direction, dew point categories)
- Remove outliers using the Interquartile Range (IQR) method
- Encode categorical variables using techniques like one-hot encoding or label encoding
- Normalize numerical features using StandardScaler to ensure all features are on the same scale
-
Feature Engineering
- Select the most important features using techniques like correlation analysis and feature importance from tree-based models
-
Model Training
- Split the preprocessed data into training and testing sets
- Implement multiple models:
- Random Forest Classifier
- Support Vector Machine (SVM)
- Gradient Boosting (XGBoost)
- Perform hyperparameter tuning using GridSearchCV with cross-validation
- Train each model on the training data with the best hyperparameters
- Split the preprocessed data into training and testing sets
-
Model Evaluation
- Predict solar panel efficiency on the test set using each trained model
- Calculate and compare performance metrics:
- Accuracy
- Precision, Recall, and F1-score for each efficiency class
- ROC AUC score for multi-class classification
- Analyze feature importance to understand key predictors of solar panel efficiency
- Select the best-performing model based on evaluation metrics
-
Class Imbalance
- Finding: Uneven distribution of efficiency labels (Low, Medium, High)
- Action: Implemented SMOTENC for balanced class representation
-
Outlier Detection
- Finding: Presence of outliers in the dataset
- Action: Implemented robust scaling and outlier removal techniques in preprocessing, except for rainfall data since the standard variance is extremely large with extremely skewed data
-
Missing Data Patterns
- Finding: Some features had systematic missing values
- Action: Developed a custom imputation strategy
-
No observable temporal patterns
- Finding: Does not seem that temporal patterns lead to any distinguishable patterns
- Action: Remove date from potential features
| Feature Type | Processing Method | Rationale |
|---|---|---|
| Numerical Features | StandardScaler (optional, determined by GridSearchCV) | Standardization brings all numerical features to the same scale, which is crucial for algorithms like SVM. For tree-based models (Random Forest, XGBoost), I let the hyperparameter tuning decide if scaling improves performance |
| Categorical Features | OneHotEncoder (with handle_unknown='ignore') |
One-hot encoding creates binary columns for each category, allowing models to work with categorical data The 'ignore' option handles any unknown categories in future data |
| Target Variable (Daily Solar Panel Efficiency) | LabelEncoder | Converts categorical efficiency labels to numerical values |
-
Feature Selection:
- Method: SelectKBest with f_classif
- Rationale: Selects the top 20 most informative features, reducing dimensionality and potentially improving model performance
-
Handling Imbalanced Data:
- Method: SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous features)
- Rationale: Addresses class imbalance by creating synthetic examples of the minority classes, helping the model learn to predict all classes effectively
-
Hyperparameter Tuning:
- Method: GridSearchCV with StratifiedKFold cross-validation
- Rationale: Optimizes model hyperparameters, including preprocessing steps, to find the best combination for each model type
After model training, I performed feature importance analysis:
- For tree-based models (Random Forest, XGBoost): I used the built-in feature_importances_ attribute
- For other models (e.g., SVM): I used permutation importance
This analysis helps identify which features are most crucial for predicting solar panel efficiency, providing insights for future data collection and model refinement.
For my solar panel efficiency prediction task, I have selected three diverse machine learning models, each with its own strengths and characteristics. My goal was to compare different approaches and identify the most effective model for the specific problem.
Rationale for Selection:
- Handles non-linear relationships well, which is crucial for complex environmental data
- Provides feature importance rankings, offering insights into key predictors of solar panel efficiency
- Relatively robust to outliers and non-scaled features
- Performs well with both numerical and categorical data
- Less prone to overfitting due to its ensemble nature
Hyperparameters Tuned:
n_estimators: Number of trees in the forestmax_depth: Maximum depth of the treesmin_samples_split: Minimum number of samples required to split an internal nodemin_samples_leaf: Minimum number of samples required to be at a leaf node
Rationale for Selection:
- Effective in high-dimensional spaces, which is relevant given the numerous weather and air quality features
- Often performs well when there's a clear margin of separation between classes
- Robust against overfitting in high dimensional spaces
Hyperparameters Tuned:
C: Regularization parameterkernel: Kernel type to be used in the algorithmgamma: Kernel coefficient for 'rbf', 'poly' and 'sigmoid' kernels
Rationale for Selection:
- Known for its high performance and speed
- Implements regularization, helping to prevent overfitting
- Often outperforms other algorithms in structured/tabular data
- Provides feature importance, similar to Random Forest
Hyperparameters Tuned:
n_estimators: Number of gradient boosted treesmax_depth: Maximum tree depth for base learnerslearning_rate: Boosting learning ratesubsample: Subsample ratio of the training instancecolsample_bytree: Subsample ratio of columns when constructing each tree
I evaluated three machine learning models for predicting solar panel efficiency: Random Forest, Support Vector Machine (SVM), and XGBoost. Each model was assessed using various metrics to ensure a comprehensive understanding of their performance.
- Accuracy: Overall correctness of predictions across all classes
- ROC AUC Score: Ability to distinguish between classes, accounting for class imbalance
- Precision: Ratio of correct positive predictions to total positive predictions
- Recall: Ratio of correct positive predictions to all actual positives
- F1-score: Harmonic mean of precision and recall
| Model | Accuracy | ROC AUC Score | Macro Avg F1-Score |
|---|---|---|---|
| Random Forest | 0.8667 | 0.9782 | 0.8671 |
| SVM | 0.9200 | 0.9788 | 0.9191 |
| XGBoost | 0.8933 | 0.9839 | 0.8929 |
-
Support Vector Machine (SVM)
- Best overall performance with the highest accuracy (92.00%) and macro average F1-score (0.9191)
- Excellent balance between precision and recall across all efficiency classes
- Particularly strong in identifying 'Low' efficiency cases (90.24% precision, 98.67% recall)
-
XGBoost
- Second-best in accuracy (89.33%) with the highest ROC AUC score (0.9839)
- Strong performance in 'High' efficiency class (92.75% precision, 85.33% recall)
- Balanced performance across all classes
-
Random Forest
- Slightly lower overall accuracy (86.67%) but still a high ROC AUC score (0.9782)
- Strong in identifying 'Low' efficiency cases (89.61% precision, 92.00% recall)
- Relatively weaker in 'Medium' efficiency class compared to other models
All models showed varying performance across different efficiency classes (Detailed breakdown in results/model_results.csv):
- High Efficiency: SVM achieved the best balance (92.11% precision, 93.33% recall)
- Low Efficiency: All models performed well, with SVM having the highest F1-score (0.9427)
- Medium Efficiency: XGBoost and SVM showed similar performance, with SVM having a slight edge in precision
While all models demonstrate strong performance, the Support Vector Machine (SVM) stands out as the best overall model for this task:
- Highest accuracy and macro average F1-score
- Most consistent performance across all efficiency classes
- Balances precision and recall effectively, crucial for practical application
This is not surprising since SVMs are good at handling large feature dimensionalities, given that I have chosen the top 20 most important features, by making use of kernels.
Upon feature importance analysis using permutation importance for the best-performing SVM model, I have identified the following key features influencing solar panel efficiency:
- Max Wind Speed (km/h): The most important feature by a significant margin
- psi_west: The second most important feature, indicating that air quality in the western region significantly impacts efficiency
- psi_central: Central region air quality is also highly influential, suggesting that air quality, in general, is a key factor
- Maximum Temperature (deg C): Temperature remains a significant factor affecting panel performance
- pm25_south: Particulate matter levels in the southern region round out the top five features
This analysis reveals that a combination of weather conditions (wind, temperature) and air quality measures are critical in predicting solar panel efficiency.
Key factors to consider for model deployment:
- Real-time Predictions: For immediate efficiency forecasts, optimize the model and infrastructure for low-latency responses
- Batch Predictions: For daily or weekly efficiency planning, set up a batch prediction system that can handle large volumes of data efficiently
- Automated Data Collection: Implement systems to automatically collect and preprocess new weather and air quality data
- Data Validation: Develop robust data validation checks to ensure incoming data meets expected formats and ranges
When developing my solar panel efficiency prediction model, I needed to carefully consider the use of potentially synthetic data. The key assumptions that I'm taking are:
- Any synthetic features in the dataset maintain the underlying relationships present in real-world data
- The process of creating synthetic data hasn't accidentally introduced biases that don't exist in real-world scenarios
