View Static Version
Loading

QSAR Biodegradation Prediction Group: QSAR_1

Biodegradation is confirmed as an important mechanism of organic chemicals removal in natural systems. Estimation of biodegradability of chemicals which reach the aquatic environment in significant or even negligible quantities is necessary in assessing the entire hazard associated with their use. Therefore, how to quickly and accurately diagnose and analyze biodegradability is imperative.

Aim: to build an effective and reliable predictive model that is able to accurately classify chemicals into readily and non-readily biodegradable.

Dataset Description

The QSAR biodegradation dataset was built in the Milano Chemometrics and QSAR Research Group (Università degli Studi Milano-Bicocca, Milano, Italy). The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under Grant Agreement n. 238701 of Marie Curie ITN Environmental Chemoinformatics (ECO) project.

The data have been used to develop QSAR (Quantitative Structure Activity Relationships) models for the study of the relationships between chemical structure and biodegradation of molecules. Biodegradation experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). Classification models were developed in order to discriminate ready (356) and not ready (699) biodegradable molecules by means of three different modelling methods: k Nearest Neighbours, Partial Least Squares Discriminant Analysis and Support Vector Machines. Details on attributes (molecular descriptors) selected in each model can be found in the quoted reference: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878.

Table 1 Sample of the QSAR biodegradation dataset

The data set contains values for 41 attributes (molecular descriptors) to classify 1055 chemicals into 2 classes (ready and not ready biodegradable). The attribute information of 41 molecular descriptors and 1 experimental class is as below:

  1. SpMax_L: Leading eigenvalue from Laplace matrix
  2. J_Dz(e): Balaban-like index from Barysz matrix weighted by Sanderson electronegativity
  3. nHM: Number of heavy atoms
  4. F01[N-N]: Frequency of N-N at topological distance 1
  5. F04[C-N]: Frequency of C-N at topological distance 4
  6. NssssC: Number of atoms of type ssssC
  7. nCb-: Number of substituted benzene C(sp2)
  8. C%: Percentage of C atoms
  9. nCp: Number of terminal primary C(sp3)
  10. nO: Number of oxygen atoms
  11. F03[C-N]: Frequency of C-N at topological distance 3
  12. SdssC: Sum of dssC E-states
  13. HyWi_B(m): Hyper-Wiener-like index (log function) from Burden matrix weighted by mass
  14. LOC: Lopping centric index
  15. SM6_L: Spectral moment of order 6 from Laplace matrix
  16. F03[C-O]: Frequency of C - O at topological distance 3
  17. Me: Mean atomic Sanderson electronegativity (scaled on Carbon atom)
  18. Mi: Mean first ionization potential (scaled on Carbon atom)
  19. nN-N: Number of N hydrazines
  20. nArNO2: Number of nitro groups (aromatic)
  21. nCRX3: Number of CRX3
  22. SpPosA_B(p): Normalized spectral positive sum from Burden matrix weighted by polarizability
  23. nCIR: Number of circuits
  24. B01[C-Br]: Presence/absence of C - Br at topological distance 1
  25. B03[C-Cl]: Presence/absence of C - Cl at topological distance 3
  26. N-073: Ar2NH / Ar3N / Ar2N-Al / R..N..R
  27. SpMax_A: Leading eigenvalue from adjacency matrix (Lovasz-Pelikan index)
  28. Psi_i_1d: Intrinsic state pseudoconnectivity index - type 1d
  29. B04[C-Br]: Presence/absence of C - Br at topological distance 4
  30. SdO: Sum of dO E-states
  31. TI2_L: Second Mohar index from Laplace matrix
  32. nCrt: Number of ring tertiary C(sp3)
  33. C-026: R--CX--R
  34. F02[C-N]: Frequency of C - N at topological distance 2
  35. nHDon: Number of donor atoms for H-bonds (N and O)
  36. SpMax_B(m): Leading eigenvalue from Burden matrix weighted by mass
  37. Psi_i_A: Intrinsic state pseudoconnectivity index - type S average
  38. nN: Number of Nitrogen atoms
  39. SM6_B(m): Spectral moment of order 6 from Burden matrix weighted by mass
  40. nArCOOR: Number of esters (aromatic)
  41. nX: Number of halogen atoms
  42. experimental class: ready biodegradable (RB) and not ready biodegradable (NRB)

Data Analysis

In this project, chi squared method is used as the feature selection method because chi squared works well on a categorical dataset. The dataset response is a binary classification, which has classes of 'RB' and 'NRB' as its values. This is equivalent to true or false value in binary classification. So, chi squared is the best features selection method for our dataset for both prescriptive models, K-Nearest Neighbours and Decision Tree.

To plot the graph, ten features out of 41 features and one response were chosen. Because the top ten features have the lowest values of all the features, this is the case. Each feature with the lowest value has the top ten highest correlation with the set of data. These are the top ten features, as determined by the chi square method, with the lowest possible scores:

Table 2 The selected features

K-Nearest Neighbors (KNN) vs Decision Tree

Table 3 Comparisons between K-Nearest Neighbors (KNN) and Decision Tree

Data Modeling

Two predictive models which are K-Nearest Neighbors and Decision Tree are built. The ratio split is 60% training 20% validation and 20% testing. The parameter for both predictive models is shown in Table 2.

Table 4 Parameter predictive models

1) K-Nearest Neighbors (KNN)

For K-Nearest Neighbors model, we first need to find the best k value to train the model. We use the validation testing method to determine the optimized value of k. The model will be trained against the validation set until it finds the k value with the highest score.

Figure 1.1 Result of the best value K
Figure 1.2 Analysis for KNN model result
Figure 1.3 Decision boundary of the Perceptron classifier

2 ) Decision Tree Model

This is the report of the Decision Tree accuracy, precision, recall, f1-score, support and confusion matrix:

Figure 2.1 Analysis for Decision Tree model result

The diagram below shows a decision tree model with a max depth of 11.

Figure 2.2 Decision Tree model diagram
Figure 2.3 Decision boundary of the Decision Tree classifier

Discussion

Table 5 Comparison

Based on the results of each model, we can observe that the accuracy score of the K-Nearest Neighbor model is higher compared to that of the Decision Tree. This can be concluded as the accuracy of the K-Nearest Neighbor model is more accurate and closer to the value one. Both models resulted in the same value of recall, precision, and F1-score, K-Nearest Neighbor. Therefore, we can conclude that the K-Nearest Neighbor algorithm is the best predictive model to be used on the QSAR Biodegradation dataset.

Designed by: Kumara Ruben A/L Kumaravelloo 153053, Derek Gan Kaa Kheng 154738, Mohammad Danial Hakim Bin Zol 152622, Nur Akma Aqishah Binti Mukhtar 152238

Credits:

Created with images by Miha Creative - "Young plant growing at sunlight. Environment and ecology concept. Sources for renewable, sustainable development." • Chinnapong - "Planting tree in green globe planet on volunteer's hands for world environment day, eco friendly concept. Elements of this image furnished by NASA"

NextPrevious