Presented by: Amsyar & Anwar From Group 4
RESEARCH BACKGROUND
Composite materials are increasingly utilized in many engineering applications because they offer several enhanced properties and various advantages.
The Boeing 787 is 50% composites by weight and by 80% volume. The plane is also 20% aluminum, 15% titanium, 10% steel, and 5% other materials. Boeing can benefit from this structure due to the plenty of savings to be had when it comes to weight. Even though composites make up the majority of the structure, the total weight is cut by an average of 20%. (Sumit Singh, Dec 2022)
By using glass fiber representing synthetic fiber and flax fiber from natural fiber. Each composite has its own identity such as strength, durability, and environmentally friendly. Furthermore, a hybrid composite, combine with resin or adhesion to create a bonding is crucial.
Silane treatment is a common surface modification technique used to improve the properties of hybrid composites. The energy absorption of hybrid composites is an important factor to consider in many applications, including the aerospace and automotive industries.
WHAT IS SILANE
From the right side, is images from the 3 idiots movie. And the left side is silane (3-Aminopropyl) trimethoxysilane. The similarity between both pictures can bring are, Ranchodas act like an adhesion that glued their friendship more, while silane works as a bridging agent between the inorganic and organic substrate and increase the bonding of the strength. After using silane, typically flexural strength in composite rise around 40%. (Pape, 2011)
PROBLEM STATEMENT
- What are the limitations of current methods for predicting energy absorption in silane-treated hybrid composites, and how can these limitations be addressed through the development of a machine-learning model?
- What data will be used to train the machine-learning model for predicting energy absorption in silane-treated hybrid composites, and how accurate will the model be in predicting energy absorption at new or unseen silane concentrations?
- In what ways will the development of a machine-learning model for predicting energy absorption in silane-treated hybrid composites provide a more efficient and cost-effective method for optimizing the performance of these composites in various applications?
- Which algorithm deems suitable for the machine learning model to get the best prediction result?
OBJECTIVE
- From manually performing tests for each different concentration, machine learning will be the first gateway before conducting research.
- Train the model on existing data attributes and use results to predict energy absorption for new data.
- Build a machine learning model to predict energy absorption in a silane-treated surface, so can get prediction results without conducting real research first.
- Compare the performance of the regression algorithm with other algorithms to determine the best model.
TYPE OF MACHINE LEARNING TO BE USED
- The type of Machine Learning intended to use is Regression in Supervised Learning. It is as the dataset in numerical and regression analysis always involves a numeric dependent variable. (Van den Berg, S. M)
LIST OF ABBREVIATIONS
- CV: Cross-Validation
- DT: Decision Tree
- EA: Energy Absorption
- F: Force
- LS: Lasso Regresssion
- LR: Linear Regression
- MAE: mean absolute error
- MSE: mean squared error
- OLS: Ordinary Least Squared
- R2 :R-Squared
- RF: Random Fores
- RMSE: Root Mean Square Error
- RR: Ridge Regression
- X: Independent Variable/Feature(Attributes)
- Y:Dependent Variable/Target
DATA SET DESCRIPTION
The dataset was obtained from Aerospace Department at UniKL. There are five samples with different concentration ranging from (0%, 2%, 6%, and 8%) and each dataset contain energy absorption of different concentration with force and stroke applied as time goes on. Each dataset contains 5 columns (features) and more than 500 rows. Refer to Table 1.
From Table 2, the target is identified. The target’s column name is “Energy_Absoprtion”.
The dataset is checked for its data shape, info and any missing value occurs inside the sample.
DATA ANALYSIS
Check Correlation between Features
By checking correlation in the dataset, variables are identified which one are most strongly related to each other.
Based on Table 3 and Figure 4, Heatmap, we can see one of the features, Force has the highest correlation with the target, Energy_Absorption.
Visualization of Features vs Target
Visualizing the relationship between features and target variables is an important step in data analysis, as it helps to identify patterns and relationships in the data.
Based on the graphs above, Figure 5 has a direct linear correlation between Force vs Energy_Absorption.
Features Selection (using Ridge regression)
We use features selection using RidgeCV to perform features selection using Cross-Validation of 5 and Alpha parameters are 0.01, 0.1. 1. 10, and 100. Why we use Ridge regression because it tries to determine variables that have exactly zero effects without wasting any information. Also Verma, Y. (2022, April 5) said, Ridge regression is popular because it uses regularization for making predictions and regularization is intended to resolve the problem of overfitting. Ridge directly nullify the effect of less competent features.
From the Table 4, the coefficient of Force is positive 2.4460, while Stroke is negative 0.0080 and time is positive 0.0091. We decided to choose Force as our feature because it is more important to focus on the sign and relative magnitude of the coefficients to interpret their impact on the target variable.
MACHINE LEARNING
Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable (target) (Y) and one or more independent variables (features) (X).
In this model, one feature and target are used thus linear regression is a suitable method to use as it provides a simple and interpretable model for predicting the target based on the feature. Which is X is Force and Y is Energy_Absorption.
Train-Test Split for Evaluating Machine Learning Algorithms
In 1997, a new method was discussed in a paper called A scaling law for the validation-set training-set size ratio (Guyon)
"train_test_split" function is used to split the data into 80% training data and 20% testing data. The 80/20 rule, as per Detective, T.D. (2020, January 31), produce an increase of 2% accuracy and improvement in precision, recall, and f1-score when he tried to verify the data based on Guyon's paper.
That’s why we choose the 80/20 train test data.
Outlier Check
Checking for outliers is an important step in the analysis of a linear regression model. Outliers are data points that fall outside of the expected range of values and can significantly affect the results of the regression analysis.
Based on Figure 12, there are no outliers observed and this can be interpreted as the dataset is relatively consistent and uniform in its values.
LINEAR REGRESSION MODEL
Using the 'linear_model.LinearRegression()' function, it fits a linear regression model to the given training data, which involves finding the coefficients of a linear equation that best fits the data. These coefficients are then used to make predictions on the test data.
However, in the Figure 13, [2] note, a strong multicollinearity is expected.
Based on Figure 14, an R-squared value of 0.999 indicates a perfect fit between the model and the data. Overall, the low MAE, MSE, and RMSE indicate that the model is making accurate predictions, while the R-Squared of 1.0 indicates that the model is a perfect fit for the data
Next, the 'predict' method is used to generate predictions for the testing data, which can be compared to the actual target values to evaluate the performance of the model.
As you can see, in Figure 16, the fitted values really follow the actual value distribution. So it can be a sign of overfitting. However, on [2] note, strong multicollinearity is expected.
Whether this model is reliable we cannot confirm, we just continue comparing using another algorithm first.
MODEL COMPARISON USING DT AND RF
Decision Tree (DT) Regressor
A decision tree is a type of machine-learning algorithm used for both classification and regression tasks. The algorithm builds a tree-like model by recursively splitting the data into smaller subsets based on the feature values that lead to the best separation of the target variable.
Based on figure 17, the decision tree model has an R-squared value of 0.9999968398481558, which indicates that the model explains nearly all of the variation in the response variable. However, it's worth noting that the R-squared value can be somewhat misleading for decision tree models since they don't make the same assumptions as linear regression models and are not as easily interpretable in terms of the strength and direction of the relationship between the predictor and response variables.
Random Forest (RF) Regressor
Random forest is a type of ensemble learning algorithm that combines multiple decision trees to make more accurate predictions. In a random forest, multiple decision trees are built on different subsets of the data and features. During prediction, each decision tree in the forest independently predicts the outcome, and the final prediction is the average (for regression problems) or majority vote (for classification problems) of the predictions from all the decision trees.
In the random forest model, the R2 value is 0.9999684505196771, which indicates that the model explains a very high proportion of the variance in the response variable.
REGULARIZATION TECHNIQUES
Based on the Linear Regression model, found multicollinearity, and one of the ways to address multicollinearity is by using Regularization techniques.
We need to apply regularization techniques. As Andrea Perlato, (n.d) said to reduce multicollinearity, we can use regularization which means keeping all the features but reducing the magnitude of the coefficients of the model. This is a good solution when each predictor contributes to predicting the dependent variable.
There are two techniques that can be applied which are Ridge and Lasso Regularization techniques,
Ridge limitation is not good for feature reduction. While Lasso is not suitable for two or more highly collinear variables. In this case, we will use Ridge to handle multicollinearity.
However, we also apply Lasso regularization techniques to compare whether it can outperform the Ridge.
Comparing Coefficient between Ridge and Lasso
- From the Figure 19, the ridge coefficient (0.0033327) which is higher than lasso coefficient (0.00333251).
- From the result of the coefficient, we continue using Ridge Regression to tune the hyperparameter.
Ridge Regression (RR)
Based on Figure 20, after applying Ridge regularization techniques, the R-squared, not quite visible differences can be seen.
However, if we plot based on figure 21, we can see the fitted values did follow the actual value distribution without overfitting. Some points deviate from the original curve. But it is still quite reliable.
COMPARING LR AND RR
In figure 22, we can see the before (LR) and after (RR) applying regularization techniques.
TUNED HYPERPARAMETER
We use Ridge to tune the model using Cross-Validation of 5 and Comparing Alpha parameters from 0.01, 0.1. 1. 10, until 100. For the result, based on Figure 23, we get the Best Alpha parameter is 0.01, and the tuned r-squared score is 0.9998777.
So do we really need to tune?
Hyperparameter tuning is an essential part of controlling the behavior of a machine-learning model. If we don’t correctly tune our hyperparameters, our estimated model parameters produce suboptimal results, as they don’t minimize the loss function. This means our model makes more errors.
Lasso Regression (LS)
The result for Lasso Regression in Figure 24, R-squared is also nearby to 1 which is 0.99987082.
So are there any differences?
COMPARING SKEWNESS AND KURTOSIS
The skewness for a normal distribution is "0" (zero), and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
What is the suitable range?
Both skew and kurtosis can be analyzed through descriptive statistics. Acceptable values of skewness fall between − 3 and + 3, and kurtosis is appropriate from a range of − 10 to + 10 when utilizing SEM (Brown, 2006).
If the skewness and kurtosis is close to 0, then a normal distribution is often assumed. Based on the figure.. ridge regression and lasso regression have a balance between skewness and kurtosis.
COMPARING ALL PERFORMANCE METRICS WITH 5 SAMPLES
For all samples, Linear and Ridge have the same result for all performance metrics. However, linear regression cannot handle multicollinearity, the metric value may be the same, but the performance quite differs. Refer to Table 8.
CONCLUSION
For our conclusion, we can decide that Ridge Regression is the best algorithm to create the machine learning model compared to Linear Regression, Decision Tree regressor, and Random Forest Regressor.
For a starter, when the dataset used linear regression when we plotted the actual vs fitted values, it tend to overfitting while the when we plot Ridge Regression no overfitting occur.
Second, for skewness and kurtosis for dataset on each sample, only Ridge and Lasso Regression tend to be balanced.
Third, for performance metric Ridge and Linear regression quite the same, but linear regression cannot handle multicollinearity.
Lastly, if we rank the most balanced and suitable result goes to Ridge Regression, which is why we concluded that this algorithm is the best for this project.
FUTURE RECOMMENDATIONS
- Perform on new target, Total_Energy_Absoprtion which is more outlier to handle.
- Perfom multiple linear regression on force and stroke vs target to see the result.
- Using time lapse machine learning model as the data have time data.
REFERENCES
Why The Boeing 787 & Airbus A350 Are Built With Composite Materials. (2021, September 2). Why the Boeing 787 & Airbus A350 Are Built With Composite Materials. https://simpleflying.com/787-a350-composite/
Pape, P. G. (2011). Adhesion promoters. In Applied Plastics Engineering Handbook,. Elsevier. https://doi.org/10.1016/b978-1-4377-3514-7.10029-7
van den Berg, S. M. (n.d.). Chapter 6 Categorical predictor variables | Analysing Data using Linear Models. Chapter 6 Categorical Predictor Variables | Analysing Data Using Linear Models. https://bookdown.org/pingapang9/linear_models_bookdown/chap-categorical.html
Feature Selection for Ridge Regression with Provable Guarantees. (n.d.). Feature Selection for Ridge Regression With Provable Guarantees | MIT Press Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/7439920
Verma, Y. (2022, April 5). A hands-on guide to ridge regression for feature selection. Analytics India Magazine. https://analyticsindiamag.com/a-hands-on-guide-to-ridge-regression-for-feature-selection/
Detective, T. D. (2020, January 31). Finally: Why We Use an 80/20 Split for Training and Test Data Plus an Alternative Method (Oh Yes. . .). Medium. https://towardsdatascience.com/finally-why-we-use-an-80-20-split-for-training-and-test-data-plus-an-alternative-method-oh-yes-edc77e96295d
Varghese, D. (2019, May 10). Comparative study on Classic Machine learning Algorithms. Medium. https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222
Deal Multicollinearity with Ridge Regression - Andrea Perlato. (n.d.). Deal Multicollinearity With Ridge Regression - Andrea Perlato. https://www.andreaperlato.com/mlpost/deal-multicollinearity-with-ridge-regression/
What is hyperparameter tuning? | Anyscale. (n.d.). Anyscale. https://www.anyscale.com/blog/what-is-hyperparameter-tuning
Chugh, A. (2022, March 16). MAE, MSE, RMSE, Coefficient of Determination, Adjusted R Squared — Which Metric is Better? Medium. https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e
Kurtosis - an overview | ScienceDirect Topics. (n.d.). Kurtosis - an Overview | ScienceDirect Topics. https://doi.org/10.1016/B978-0-12-396973-6.00010-1
Gawali, S. (2021, May 2). Skewness and Kurtosis: Quick Guide (Updated 2023). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/