Biodegradation is confirmed as an important mechanism of organic chemicals removal in natural systems. Estimation of biodegradability of chemicals which reach the aquatic environment in significant or even negligible quantities is necessary in assessing the entire hazard associated with their use. Therefore, how to quickly and accurately diagnose and analyze biodegradability is imperative.
Aim: to build an effective and reliable predictive model that is able to accurately classify chemicals into readily and non-readily biodegradable.
Dataset Description
The QSAR biodegradation dataset was built in the Milano Chemometrics and QSAR Research Group (Università degli Studi Milano-Bicocca, Milano, Italy). The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under Grant Agreement n. 238701 of Marie Curie ITN Environmental Chemoinformatics (ECO) project.
The data have been used to develop QSAR (Quantitative Structure Activity Relationships) models for the study of the relationships between chemical structure and biodegradation of molecules. Biodegradation experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). Classification models were developed in order to discriminate ready (356) and not ready (699) biodegradable molecules by means of three different modelling methods: k Nearest Neighbours, Partial Least Squares Discriminant Analysis and Support Vector Machines. Details on attributes (molecular descriptors) selected in each model can be found in the quoted reference: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878.
The data set contains values for 41 attributes (molecular descriptors) to classify 1055 chemicals into 2 classes (ready and not ready biodegradable). The attribute information of 41 molecular descriptors and 1 experimental class is as below:
- SpMax_L: Leading eigenvalue from Laplace matrix
- J_Dz(e): Balaban-like index from Barysz matrix weighted by Sanderson electronegativity
- nHM: Number of heavy atoms
- F01[N-N]: Frequency of N-N at topological distance 1
- F04[C-N]: Frequency of C-N at topological distance 4
- NssssC: Number of atoms of type ssssC
- nCb-: Number of substituted benzene C(sp2)
- C%: Percentage of C atoms
- nCp: Number of terminal primary C(sp3)
- nO: Number of oxygen atoms
- F03[C-N]: Frequency of C-N at topological distance 3
- SdssC: Sum of dssC E-states
- HyWi_B(m): Hyper-Wiener-like index (log function) from Burden matrix weighted by mass
- LOC: Lopping centric index
- SM6_L: Spectral moment of order 6 from Laplace matrix
- F03[C-O]: Frequency of C - O at topological distance 3
- Me: Mean atomic Sanderson electronegativity (scaled on Carbon atom)
- Mi: Mean first ionization potential (scaled on Carbon atom)
- nN-N: Number of N hydrazines
- nArNO2: Number of nitro groups (aromatic)
- nCRX3: Number of CRX3
- SpPosA_B(p): Normalized spectral positive sum from Burden matrix weighted by polarizability
- nCIR: Number of circuits
- B01[C-Br]: Presence/absence of C - Br at topological distance 1
- B03[C-Cl]: Presence/absence of C - Cl at topological distance 3
- N-073: Ar2NH / Ar3N / Ar2N-Al / R..N..R
- SpMax_A: Leading eigenvalue from adjacency matrix (Lovasz-Pelikan index)
- Psi_i_1d: Intrinsic state pseudoconnectivity index - type 1d
- B04[C-Br]: Presence/absence of C - Br at topological distance 4
- SdO: Sum of dO E-states
- TI2_L: Second Mohar index from Laplace matrix
- nCrt: Number of ring tertiary C(sp3)
- C-026: R--CX--R
- F02[C-N]: Frequency of C - N at topological distance 2
- nHDon: Number of donor atoms for H-bonds (N and O)
- SpMax_B(m): Leading eigenvalue from Burden matrix weighted by mass
- Psi_i_A: Intrinsic state pseudoconnectivity index - type S average
- nN: Number of Nitrogen atoms
- SM6_B(m): Spectral moment of order 6 from Burden matrix weighted by mass
- nArCOOR: Number of esters (aromatic)
- nX: Number of halogen atoms
- experimental class: ready biodegradable (RB) and not ready biodegradable (NRB)
Data Analysis
The feature selection method is used to find the best useful features in predicting the target variable. The feature selection method in this project is ANOVA (Analysis of Variance). ANOVA is the best fit because the input variable is numerical and the target output is a categorical dataset with classes of 'RB' and 'NRB'. The dataset response is a binary classification equivalent to a true or false value. As a result, the ANOVA method is the best feature selection method for the QSAR dataset for both the prescriptive Neural Network model and the logistic model.
When choosing features, filtering techniques were used. The statistical method is being used to determine the correlation between each input variable and the target variable. It determines the strength of the correlation between the feature and the target response using the QSAR dataset to compute the probability (p) value and score. The lower p values obtained, the stronger the relationship was with the response. Lower p values were obtained the stronger the relationship was with the response. The relationship between the response and score increases as the score rises. Since the p-value for each feature is less than 0.05, all features were chosen. The list of features with p-value and score is shown below.
Neural Network vs Logistic Regression
Data Modeling
Two predictive models, which are Neural Network and Logistic Regression are built. The ratio split is 80% training and 20% testing. The parameter for both predictive models is shown in Table 4.
1) Neural Network
Finding the optimal hyperparameters for the Neural network model is the first step in determining how the model will perform. The grid search method is the optimization technique used. This model's hyperparameter concentrates on the batch sizes, epochs, optimizer algorithm, and initialization mode that work best with this model's optimization. An optimizer named Adam has been used in this model.
2 ) Logistic Regression
Grid Search CV was used to build a logistic regression model. This is the report of the Logistic Regression accuracy, precision, recall, f1-score, support and confusion matrix:
Insights
Precision is how precise or accurate the model is out of those predicted positive and how many of them are actual positive. Precision is a good measure to determine, when the costs of False Positive is high and we want to minimize false positives.
Recall calculates how many of the Actual Positives our model capture through labelling it as positive (True Positive). It shall be the model metric we use to select our best model when there is a high cost associated with False Negative and we want to minimize the chance of missing positive cases (predicting false negatives).
Accuracy is a good measure if we have quite balanced datasets and are interested in all types of outputs equally. It is largely contributed by a large number of True Negatives which in most scenarios, we do not focus on much.
F1 score is needed when we want to seek a balance between Precision and Recall. It is a better measure to use if the datasets are imbalanced and there is an uneven class distribution (large number of Actual Negatives).
In simple, a model is considered to be good if it gives high accuracy scores in production or test data or is able to generalise well on an unseen data. In our opinion, accuracy greater than 70% is considered as a great model performance.
Discussion
For both models, they have the same accuracy value and almost the same Precision, Recall and f1-score. Hence, we can conclude that both models have the same performance. However, we think that Logistic Regression is better than Neural Network because:
1. Although Neural Network can find more interesting patterns in the data, which can lead to better performance, it can be more complex and harder to build and maintain.
2. Logistic Regression has significantly lower training time and cost than Neural Network.
3. Logistic Regression has significantly lower inference time than Neural Network to run the model and making predictions.
Designed by: Kumara Ruben A/L Kumaravelloo 153053, Derek Gan Kaa Kheng 154738, Mohammad Danial Hakim Bin Zol 152622, Nur Akma Aqishah Binti Mukhtar 152238