Predicting Water Quality by using Machine Learning AMIR RIDZUAN BIN AMIR HAMZAH | MOHAMAD AMIRUL ASRAF BIN MOHD ASSAN

Introduction

Water quality has a direct impact on public health and the environment in Malaysia. Water is used for various practices, such as drinking, agriculture, and industry.

Universiti Sains Malaysia (USM) was appointed by National Water Research Institute of Malaysia (later referred as NAHRIM) to conduct the research on NAHRIM Environmental Sensor System (NUESS) for Monitoring Water Quality. Among the data which will be extracted and collated for this research are data on water quality.

Objective

The objective for the project is to develop a classifier model to predict water quality by using machine learning.

Problem Definition

Data analysis and machine learning algorithm must be applied to create autonomous model so the quality of water can be detected without human intervention. But, what is the best type of analysis and algorithm to create the best predictive model to predict water quality based on the available features?

Dataset Description

The sample water taken from selected sites. The tests on water quality will be conducted with equipment at the Environmental Laboratories at the School of Civil Engineering. The water quality for the sample water classified by the following features shown in Table 1:

Table 1: The features and its respective data types and descriptions

Data Cleaning

The dataset consists of ten features and 3276 data values from sample water testing. As shown in Figure 1 and Figure 2, it was verified that there are many missing values in the dataset from three features: pH, Sulfate and Trihalomethanes. So, an approach should be done to solve the issue.

Figure 1: Missing values represented by white colour segment in the nullity matrix above

Figure 2: The number of missing values from the features

Figure 3 shows the calculation of medians of pH, Sulfate and Trihalomethanes for both potable and non-potable water. The medians are slightly different between each other. So, the overall median of the feature be used to impute the missing values.

Figure 3: The difference of the median values between potable and non-potable water is low.

Standardizing The Data

Data standardization brings data into a uniform format that allows easier data analysis and modelling. In statistics, standardization refers to the process of putting different features on the same scale in order to compare scores between different types of features. Figure 4 shows the programming code to standardize the available data by using Jupyter Notebook.

Figure 4: Data standardization

Data Analysis

Scatter Plot Matrix and Heatmap used to find out the correlation between all the features. As can be seen from Figure 5 and Figure 6, there seems to be very less correlation between all the features.

Figure 5: Scatter Plot Matrix to represent the correlation between all features

Figure 6: Heatmap to represent the correlation between all features

Data Modelling

Three predictive models are built by using Logistic Regression, Random Forest, K-nearest Neighbors (KNN) algorithms. The models are evaluated by hold-out method. The ratio of the split is 70% training set and 30% test set. From these 3 predictive models, comparison will be conducted to compare its accuracy. Figure 7 shows the programming code to conduct data modelling process by using Jupyter Notebook.

Figure 7: Data modelling by using Logistic Regression, K-nearest Neighbors and Random Forest

Figure 8: Precision score for each predictive model

From the results in term of precision as shown in Figure 8, we can conclude that the predictive model by using Logistic Regression is more accurate compared to other models. The difference in term of precision between the two best models is just about 0.129591387 only.

Conclusion

1) The data contains almost equal number of acidic and basic pH level water samples.

2) 92% of the data was considered Hard.

3) Only 2% of the water samples were safe in terms of Chloramines levels.

4) Only 1.8% of the water samples were safe in terms of Sulfate levels.

5) 90.6% of the water samples had higher Carbon levels than the typical Carbon levels in drinking water (10 ppm).

6) 76.6% of water samples were safe for drinking in terms of Trihalomethane levels in water.

7) 90.4% of the water samples were safe for drinking in terms of the Turbidity of water samples.

8) The correlation coefficients between the features were very low.

9) The Logistic Regression worked the best to train the model.

WATER IS LIFE, AND CLEAN WATER MEANS HEALTH

- AUDREY HAPBURN

Appreciate

Credits:

Created with images by ronymichaud - "drop of water drop impact" • Deyan Georgiev - "Measure water content with digital device. PH meter." • MartinStr - "bubbles water bubbly"