New synthetic data informed the ML model to improve the prediction results by half. In the previous test the mean distance between prediction and real measure was ~23.70. Now the distance was reduced to ~8.25.
One of the main advantages of synthetic data is that can be moulded to the problem. For the revisited excercise the feautures
length were replaced for
Area5. These new features represents the facade areas of the building. In addition to these new features,
ang5 were also added, and represents the angle between the nomal on the facade surfaces and the project North.
A very useful matrix to understand the shape of data is the correlation matrix. It is useful to understand the relationship between 2 variables in the dataset. Typically, to quantify the relationship the Pearson correlation coefficient is used:
- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables
Therefore, the further away the correlation coefficient is from 0, the stronger relationship between the 2 variables is.
Correlation matrix comparison between the previous dataset on the left and the new dataset on the right.
The correlation matrix on the left is from the previous test. The prediction of the
total target is mainly based on 3 features:
volume. The features
width have a limited influence in the ML model. The other 3 layers do not participate in the prediction, that is why the white mask on top of these cells.
The correlation matrix on the right incorporate the new features discribed previously. For this excercise the prediction of the
total is based on 6 features:
volume. The other 6 features, related with the angles of the building have an small impact in the prediction. It worth to mention the angle in the revisted exercise hast a correlation coefient of 0.056 in contrast to the previous test: 0.043. It gets more importance to predict the feature total.
The new synthetic data generated added new useful features to predict the total solar radiation. In addition, the new data optimized the distribution of the features weights.
The images above shows the results between prediction and solar radiation measures from the software. The X axis indicates the different independent tests. from A to K in the left and from A to M in the right. The Y axis indicates the total sun radiation. The diamonds represents the real measures and the blue circles the predictions.
The tests inside the circles, represents very close measure and prediction results. Its easy to see how accurate was the second test (blue circles) incomparison to the 1st test, in green. The mean distance between measure and prediction in the first graphic is 23.7047, while in the revisted version: 8.2596. This implies an accurate prediction in comparison to the previous test.
An interesting point is the ML model did not changed at all. Yet, the data to train the model it's different.