Our project uses multiple data sources, machine learning models, and AI tools to accurately identify fake information. Here are the main steps and models we used:
For the prediction model on Tweets, we used a dataset βTruth Seeker 2023β from the Canadian Institute of Cybersecurity under the University of New Brunswick. This dataset was initially gathered on 180,000 tweets from 2016-2017, particularly about news and politics. For the prediction model on news, we used the News Dataset (Kaggle Fake & Real News), which consists of ~45,000 news articles (23k fake, 21k real). We analyze the data characteristics to lead to further data cleaning, of making the subject column consistent, unifying letter case, tokenizing text, and removing common words.
We explored logistic regression, Random Forest, XGBoost, and BERT. BERT was chosen for tweets because it natural language directly. It reads the full text of each tweet and learns meaning from context, sentence structure, and the way ideas are expressed. XGBoost was chosen for news because it performed most reliably with noisy text, nonlinear patterns, and imbalanced subject distributions.
We developed a Python script and a specific prompt designed to query ChatGPT when a user inputs text. ChatGPT provides an additional True/False prediction and confidence score based on its general language knowledge.
Framework: The prediction service is built using Python Flask. Model Integration: The API integrates and hosts all three core models: BERT, XGBoost, and the ChatGPT. AWS API Gateway forwards requests from our GitHub Pages website to our EC2 instance, To ensure reliable web access, including robust Cross-Origin Resource Sharing (CORS) allow requests from our GitHub Pages frontend. and returns structured JSON results instantly.