QSAR Biodegradation Prediction Group: QSAR_1

Biodegradation is confirmed as an important mechanism of organic chemicals removal in natural systems. Estimation of biodegradability of chemicals which reach the aquatic environment in significant or even negligible quantities is necessary in assessing the entire hazard associated with their use. Therefore, how to quickly and accurately diagnose and analyze biodegradability is imperative.

Aim: to build an effective and reliable predictive model that is able to accurately classify chemicals into readily and non-readily biodegradable.

Dataset Description

The QSAR biodegradation dataset was built in the Milano Chemometrics and QSAR Research Group (Università degli Studi Milano-Bicocca, Milano, Italy). The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under Grant Agreement n. 238701 of Marie Curie ITN Environmental Chemoinformatics (ECO) project.

The data have been used to develop QSAR (Quantitative Structure Activity Relationships) models for the study of the relationships between chemical structure and biodegradation of molecules. Biodegradation experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). Classification models were developed in order to discriminate ready (356) and not ready (699) biodegradable molecules by means of three different modelling methods: k Nearest Neighbours, Partial Least Squares Discriminant Analysis and Support Vector Machines. Details on attributes (molecular descriptors) selected in each model can be found in the quoted reference: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878.

Table 1 Sample of the QSAR biodegradation dataset

The data set contains values for 41 attributes (molecular descriptors) to classify 1055 chemicals into 2 classes (ready and not ready biodegradable). The attribute information of 41 molecular descriptors and 1 experimental class is as below:

SpMax_L: Leading eigenvalue from Laplace matrix
J_Dz(e): Balaban-like index from Barysz matrix weighted by Sanderson electronegativity
nHM: Number of heavy atoms
F01[N-N]: Frequency of N-N at topological distance 1
F04[C-N]: Frequency of C-N at topological distance 4
NssssC: Number of atoms of type ssssC
nCb-: Number of substituted benzene C(sp2)
C%: Percentage of C atoms
nCp: Number of terminal primary C(sp3)
nO: Number of oxygen atoms
F03[C-N]: Frequency of C-N at topological distance 3
SdssC: Sum of dssC E-states
HyWi_B(m): Hyper-Wiener-like index (log function) from Burden matrix weighted by mass
LOC: Lopping centric index
SM6_L: Spectral moment of order 6 from Laplace matrix
F03[C-O]: Frequency of C - O at topological distance 3
Me: Mean atomic Sanderson electronegativity (scaled on Carbon atom)
Mi: Mean first ionization potential (scaled on Carbon atom)
nN-N: Number of N hydrazines
nArNO2: Number of nitro groups (aromatic)
nCRX3: Number of CRX3
SpPosA_B(p): Normalized spectral positive sum from Burden matrix weighted by polarizability
nCIR: Number of circuits
B01[C-Br]: Presence/absence of C - Br at topological distance 1
B03[C-Cl]: Presence/absence of C - Cl at topological distance 3
N-073: Ar2NH / Ar3N / Ar2N-Al / R..N..R
SpMax_A: Leading eigenvalue from adjacency matrix (Lovasz-Pelikan index)
Psi_i_1d: Intrinsic state pseudoconnectivity index - type 1d
B04[C-Br]: Presence/absence of C - Br at topological distance 4
SdO: Sum of dO E-states
TI2_L: Second Mohar index from Laplace matrix
nCrt: Number of ring tertiary C(sp3)
C-026: R--CX--R
F02[C-N]: Frequency of C - N at topological distance 2
nHDon: Number of donor atoms for H-bonds (N and O)
SpMax_B(m): Leading eigenvalue from Burden matrix weighted by mass
Psi_i_A: Intrinsic state pseudoconnectivity index - type S average
nN: Number of Nitrogen atoms
SM6_B(m): Spectral moment of order 6 from Burden matrix weighted by mass
nArCOOR: Number of esters (aromatic)
nX: Number of halogen atoms
experimental class: ready biodegradable (RB) and not ready biodegradable (NRB)

Data Analysis

The feature selection method is used to find the best useful features in predicting the target variable. The feature selection method in this project is ANOVA (Analysis of Variance). ANOVA is the best fit because the input variable is numerical and the target output is a categorical dataset with classes of 'RB' and 'NRB'. The dataset response is a binary classification equivalent to a true or false value. As a result, the ANOVA method is the best feature selection method for the QSAR dataset for both the prescriptive Neural Network model and the logistic model.

When choosing features, filtering techniques were used. The statistical method is being used to determine the correlation between each input variable and the target variable. It determines the strength of the correlation between the feature and the target response using the QSAR dataset to compute the probability (p) value and score. The lower p values obtained, the stronger the relationship was with the response. Lower p values were obtained the stronger the relationship was with the response. The relationship between the response and score increases as the score rises. Since the p-value for each feature is less than 0.05, all features were chosen. The list of features with p-value and score is shown below.

Table 2 The selected features

Neural Network vs Logistic Regression

Table 3 Comparisons between Neural Network and Logistic Regression

Data Modeling

Two predictive models, which are Neural Network and Logistic Regression are built. The ratio split is 80% training and 20% testing. The parameter for both predictive models is shown in Table 4.

Table 4 Parameter predictive models

1) Neural Network

Finding the optimal hyperparameters for the Neural network model is the first step in determining how the model will perform. The grid search method is the optimization technique used. This model's hyperparameter concentrates on the batch sizes, epochs, optimizer algorithm, and initialization mode that work best with this model's optimization. An optimizer named Adam has been used in this model.

Figure 1.1 Result of the best-optimized parameter

Table 5 Result of accuracy for dataset

Figure 1.3 Analysis for KNN model result

Figure 1.4 Graph test loss after optimization

Figure 1.5 Graph test accuracy after optimization

2 ) Logistic Regression

Grid Search CV was used to build a logistic regression model. This is the report of the Logistic Regression accuracy, precision, recall, f1-score, support and confusion matrix:

Figure 2.1 Analysis of Logistic Regression model result

Insights

Hierarchy of Metrics from raw measurements / labeled data to F1-Score

Precision is how precise or accurate the model is out of those predicted positive and how many of them are actual positive. Precision is a good measure to determine, when the costs of False Positive is high and we want to minimize false positives.

Recall calculates how many of the Actual Positives our model capture through labelling it as positive (True Positive). It shall be the model metric we use to select our best model when there is a high cost associated with False Negative and we want to minimize the chance of missing positive cases (predicting false negatives).

Accuracy is a good measure if we have quite balanced datasets and are interested in all types of outputs equally. It is largely contributed by a large number of True Negatives which in most scenarios, we do not focus on much.

F1 score is needed when we want to seek a balance between Precision and Recall. It is a better measure to use if the datasets are imbalanced and there is an uneven class distribution (large number of Actual Negatives).

In simple, a model is considered to be good if it gives high accuracy scores in production or test data or is able to generalise well on an unseen data. In our opinion, accuracy greater than 70% is considered as a great model performance.

Discussion

Results of Neural Network

Results of Logistic Regression

For both models, they have the same accuracy value and almost the same Precision, Recall and f1-score. Hence, we can conclude that both models have the same performance. However, we think that Logistic Regression is better than Neural Network because:

1. Although Neural Network can find more interesting patterns in the data, which can lead to better performance, it can be more complex and harder to build and maintain.

2. Logistic Regression has significantly lower training time and cost than Neural Network.

3. Logistic Regression has significantly lower inference time than Neural Network to run the model and making predictions.

Designed by: Kumara Ruben A/L Kumaravelloo 153053, Derek Gan Kaa Kheng 154738, Mohammad Danial Hakim Bin Zol 152622, Nur Akma Aqishah Binti Mukhtar 152238

Appreciate