Incident solar radiation, Buildings energy consumption, and other environmental analysis are based on expensive computational simulations. The building energy consumption forecast relies on environmental simulations and digital 3D models that represents the building physics properties. Typically, the simulation is slower because implies thermodynamic computations and/or fluid simulations. In addition, the 3D model needs a minimum geometric definition to achieve accurate results. In consequence, this approach makes the energy consumption analysis expensive and complicated to perform.
Machine Learning has the potential to speed up the process simplifying the model and learning to approximate the results. It can be used to reduce the need of energy simulation in early design stages allowing the designer to explore optimal design solutions. In addition, the knowledge gained can be transferred from one model to another to provide accurate results.
In this article, we will present our initial findings to predict the cumulative solar radiation using a simple Machine Learning Model. The model was trained on synthetic data from different simulations using grasshopper, ladybug, and Laga library. The article about how data was generated can be found it here: Synthetic data with Rhino-Grasshopper, Ladybug and Laga library
TRAINING AND TESTING MODELS
A Learning curve in Machine Learning is a graph that compares the performance of a training and testing model over a number of training instances. The relationship between the amount of training data and performance, should improve as the number of training points increase. Nevertheless, this is not always the case.
By separating training and testing sets and graphing performance on each separately its possible to understand how well the model can generalize the unseen data. A learning curve allow us to verify when a model has learned as much as it can about the data and when this occurs. Its possible to understand if the model is suffering from variance or bias.
max_depth = 1 The training model curve decreases as more training points are added and the testing model curve increases as more training points are added. Both curves stabilize around the 1,000 training points, what means if more data is added the model will not perform better. In addition, both curves converge to a lower value with an increasing size of the training set, thus model will not get much benefit from the training data.
max_depth = 3 The training model curve decreases gently and continuously and remains very close to score 0.8. The testing model curve increases the score above 0.7 after 100 training points. Nevertheless, as more training points are used for testing, the curve also begins to flatten. Both curves converge around the score 0.8. Increase the number of training points will have a very small impact on the overall score.
max_depth = 6 The score is above 0.8 but the curves do not converge in a point like the previous graphs. most probably more training points are need it.
max_depth = 10 the training curve is much higher than the validation curve and they tend to converge at some point. Adding more training points will most likely increase the generalization and the algorithm performance.
The gap between learning curves indicates the training dataset is insufficient for the model to do well on the testing set. In this case the training data set may not be large enough. In principle the idea is to keep the gap between curves as small as possible.
This graph is similar to the previous learning curves. In this case on the X axis we are comparing the different decision tree depths (Maximum Depth). The key point here is to understand where the model changes from bias (underfitting) to variance (overfitting). For the final model and test, we have selected a depth of 12, although the graph shows some variance.
On the left image, we have selected the parameters with higher weight correlated to the variable total. Basically these 3 parameters are the ones with the higher influence in the machine learning model to predict the total solar radiation.
On the right image, you can see 11 different and independent tests. The diamond represents the actual measure from the simulation and the blue dot represents the prediction from the machine learning model.
From the 11 tests, we could say 10 tests are quite close to the actual measure or at least a distance less than 10%. The model is far from being perfect but certainly a good progress.