Introduction
Water quality has a direct impact on public health and the environment in Malaysia. Water is used for various practices, such as drinking, agriculture, and industry.
Universiti Sains Malaysia (USM) was appointed by National Water Research Institute of Malaysia (later referred as NAHRIM) to conduct the research on NAHRIM Environmental Sensor System (NUESS) for Monitoring Water Quality. Among the data which will be extracted and collated for this research are data on water quality.
Objective
The objective for the project is to develop a classifier model to predict water quality by using machine learning.
Problem Definition
Data analysis and machine learning algorithm must be applied to create autonomous model so the quality of water can be detected without human intervention. But, what is the best type of analysis and algorithm to create the best predictive model to predict water quality based on the available features?
Dataset Description
The sample water taken from selected sites. The tests on water quality will be conducted with equipment at the Environmental Laboratories at the School of Civil Engineering. The water quality for the sample water classified by the following features shown in Table 1:
Data Cleaning
The dataset consists of ten features and 3276 data values from sample water testing. As shown in Figure 1 and Figure 2, it was verified that there are many missing values in the dataset from three features: pH, Sulfate and Trihalomethanes. So, an approach should be done to solve the issue.
Figure 3 shows the calculation of medians of pH, Sulfate and Trihalomethanes for both potable and non-potable water. The medians are slightly different between each other. So, the overall median of the feature be used to impute the missing values.
Standardizing The Data
Data standardization brings data into a uniform format that allows easier data analysis and modelling. In statistics, standardization refers to the process of putting different features on the same scale in order to compare scores between different types of features. Figure 4 shows the programming code to standardize the available data by using Jupyter Notebook.
Data Analysis
Scatter Plot Matrix and Heatmap used to find out the correlation between all the features. As can be seen from Figure 5 and Figure 6, there seems to be very less correlation between all the features.
Data Modelling
Three predictive models are built by using Logistic Regression, Random Forest, K-nearest Neighbors (KNN) algorithms. The models are evaluated by hold-out method. The ratio of the split is 70% training set and 30% test set. From these 3 predictive models, comparison will be conducted to compare its accuracy. Figure 7 shows the programming code to conduct data modelling process by using Jupyter Notebook.
From the results in term of precision as shown in Figure 8, we can conclude that the predictive model by using Logistic Regression is more accurate compared to other models. The difference in term of precision between the two best models is just about 0.129591387 only.
Conclusion
1) The data contains almost equal number of acidic and basic pH level water samples.
2) 92% of the data was considered Hard.
3) Only 2% of the water samples were safe in terms of Chloramines levels.
4) Only 1.8% of the water samples were safe in terms of Sulfate levels.
5) 90.6% of the water samples had higher Carbon levels than the typical Carbon levels in drinking water (10 ppm).
6) 76.6% of water samples were safe for drinking in terms of Trihalomethane levels in water.
7) 90.4% of the water samples were safe for drinking in terms of the Turbidity of water samples.
8) The correlation coefficients between the features were very low.
9) The Logistic Regression worked the best to train the model.
Credits:
Created with images by ronymichaud - "drop of water drop impact" • Deyan Georgiev - "Measure water content with digital device. PH meter." • MartinStr - "bubbles water bubbly"