The main objective of this analysis was to continue research on a previous project regarding Fake News. Previously I used clustering algorithms to build a fake news filter. I was able to get decent results. The models predicted the correct classification 80 percent of the time but also relayed good interpretability. In this analysis I aimed to use deep learning to improve upon this classification rate. Thus increasing the model's utility. This model could be used in applications where filtering untruthful news is pertinent such as in educational settings.
Executive Summary:
First compiled a labeled dataset with over 70 thousand article observations. Performed preliminary EDA. Built a vectored corpus from both the article's title and text, which was then run through PCA. Numerous models were built using: K-Means, Random Forests, and Neural Networks. The final model presented was an 11 layer sequential neural net which produced an overall classification rate of 96 percent. The model is currently hosted on Streamlit(linked below):
Dataset:
The dataset used in this analysis contains 72 thousand rows and four columns; three feature columns– “Index”, “Title”, and “Text” – containing the index, the title of the news article, and the news article, respectively. The last column--the label column--indicated whether the article was fake or real. The dataset was compiled from three other Kaggle datasets[Appendix A].
EDA:
With previous experience with this dataset, initial exploratory data analysis was rather short. It was decided to initially investigate a 20 percent subset of the full dataset. This was in hopes of decreasing training times but still reveal generalized trends. Later the full dataset was tested on the final model. The subset was checked for any skewing but it was found to have a normal distribution of fake vs real news articles. Before building the models a corpus had to be generated from the article’s text. Before doing so, some basic NLP preprocessing was done: all words were lowercased, numbers were excluded, punctuation was removed, and all stop words were removed. From this the articles were vectorized and then fed into the models.
Final Model and Key Findings:
The final model was an 11 layer sequential model. Composed of one embedding layer, one MaxPooling layer, one convolutional layer, two LSTM layers, a couple dropout layers and dense layers. The model used an Adam optimizer and a binary cross entropy loss function. This model predicted the correct classification 96 percent of the time. Words such as ‘Trump’, ‘President’, and ‘people’ were commonly found in both real and fake articles. While words such as ‘Clinton’, ‘think’, and ‘reality’ were commonly found in real articles and words such as ‘Government’, ‘illegal’, and ‘Reuters’ were commonly found in fake articles. In this analysis just the text of each article was analyzed, to further this research the title and or the source of publication could be incorporated into the model. More sophisticated natural language preprocessing techniques could be performed.
Appendix A: Kaggle Datasets:
https://www.kaggle.com/datasets/hassanamin/textdb3
https://www.kaggle.com/datasets/mrisdal/fake-news
https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification