This paper is based on a research project from June-August 2023. It uses a variety of machine learning and language processing techniques to detect fake news.

<!DOCTYPE html>

Automated fake news detection through contextual
similarity comparison
Dhruv Agrawal
---@unsw.edu.au
Duke Nguyen
---@unsw.edu.au
Jim Tang
---@unsw.edu.au
August 17, 2023
Contents
1 Introduction 3
2 Related work 4
2.1 Similarity Comparison using Titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Similarity Comparison using Fixed News Database . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Linguistic and Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methods 5
3.1 Preprocessing and tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Feature BERT embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Feature Non-latent features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Feature Similarity model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4.1 Summary extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.2 Article scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.3 Article vectorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.4 Similarity metric calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.5 Similarity metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Feature Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Model Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Model Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Experimental setup 13
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Results and discussion 14
5.1 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.1 Non-Latent Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.2 Similarity Metrics Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1.3 Data Analysis Using PCA and KMeans . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Conclusion 18
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3 Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Appendix A Article scraping 24
Appendix B Word Count Histogram 25
Appendix C Non-Latent Features 26
Appendix D Non-latent Feature Pearson Correlation Matrix 29
Appendix E ANOVA of Non-Latent Feature against Labels 30
Appendix F Machine learning 31
2
1 Introduction
As the distribution of news shifts towards social media, there is a rise in the dissemination of fake news
[1]. We define fake news as the creation of information presented as legitimate journalism that describes
a fictitious event or fabricates details of an event in an attempt to mislead readers. This phenomenon
became a major focal point in journalism during the 2016 US elections with political parties labelling
many publications and articles as fake news. There are two huge issues with how this occurred in 2016:
1. Many prominent political figures highlighted any news they disagreed with as fake news. This led
to the political isolation of the party, whereby any news that had portrayed them in a negative light
had the potential to be dismissed as fake news This reduced the accountability of political figures in
the US, a country where federal legislation has a sweeping impact across the country and the rest of
the world.
2. There was a lack of fake news detection tools on social media and due to the polarisation of the media
climate, it was extremely difficult for social media to regulate articles published on the platforms or
remove factually incorrect articles posted or shared by politicians and Americans [2].
Since then, there have been many attempts to introduce ways to deal with these issues such as Politifact
which manually reviews articles and social media posts for factual correctness. It posts its findings on its
website and is easily accessible for people. Other similar websites exist but the reason manual fact-checking
tools are not as prominent in spaces with high amounts of fake news is because it is impossible for manual
review tools to scale to the number of news articles and journalistic social media posts published every
day.
There are also many automated fake news detection algorithms which rely on linguistic features of
the text, comparisons between the title of the article and its content, the medium of transmission, and
any suspicious activity linked with its engagement online. These tools have become more effective since
COVID-19 and Twitter employs its own automated algorithms to automatically label articles about certain
topics it chooses as fake news [3]. However, these tools are known to be very unreliable as it is increasingly
common for fake news to read the same way as real news articles and be shared by humans.
Therefore, in order for fake news detection to become more widespread and effective in combating fake
news, there are a few different criteria it must fulfil:
1. The algorithms need to automatically classify news as real or fake so that they can scale with the
growth of social media and the increase in fake news dissemination.
2. The algorithms need to incorporate current methods of fake news detection as these have been highly
researched and are effective in many situations such as when fake news has been automatically
generated or constructed in a highly polarised manner designed to provoke intense responses from
readers.
3. The algorithms need to examine an article’s content and meaning beyond its writing style to combat
fake news that is well-written and designed to look like real news.
4. The dataset used to train and assess the algorithms must contain real and fake articles that are
written in the same style so that it is not apparent simply from the way an article is written whether
it is real or fake.
Our approach improves upon existing approaches that focus on the first and second criteria, or the
third one. Most models rarely focus on both the second and third criteria and our aims to combine those
to analyse both the content and style of articles. The model then makes a significantly more informed
decision on its classification. The model is restricted to a binary classification - it outputs either real or
fake rather than giving a confidence metric of an article’s legitimacy. This is done as the aim is for the
tool be easily adopted and focusing on the simplicity of the input and output is a priority.
3
We compiled a list of commonly used linguistic features for fake news detection. Multiple different
pairings of features were formed and analysis was conducted to determine the most effective linguistic
features for the task. This takes existing research into fake news detection and puts our model in line with
current methods. A new feature - lexicographical similarity - is used to achieve the third criterion above.
At a high level, this feature compares the queried article to other articles on Google news which are at the
top of Google’s searches and applies various algorithms based on the individual words shared between the
articles. As these Google news articles have high PageRank scores, the model can be confident that they
are real articles and it compares the similarity of the content between the queries article and each of these
top searched articles. This is done as a way for the model to infer context before making a judgement on
an article’s legitimacy. This approach brings out model in line with the way humans manually fact-check
articles which usually involves finding known trustworthy sources and comparing the content between the
articles to determine whether the queried one is consistent with the trustworthy ones.
Section 2 discusses related work in the study of automated fake news detection. It includes research
on existing similarity comparison models and linguistic features useful for this task. Section 3 covers the
methodology of our approach. Specifically, it covers preprocessing the input and queried articles, the
BERT, linguistic and similarity feature calculations, and classification with machine learning and neural
networks. Section 4 analyses the experimental setup and explains the reasoning behind the evaluation
metrics being used. Section 5 discusses the results of feature analysis of linguistic features, PCA and
KMeans analysis on the collected features, and a comparison between machine learning and neural net-
work approaches for classification. Section 6 covers our contributions, limitations and future work, and
attributions for the work.
Through this research, we have analysed and determined the most effective linguistic features for fake
news detection and shown that the use of similarity as a metric is effective in building upon these current
metrics to increase accuracy. We have also compared the use of the similarity metric with different machine
learning classifiers and discovered that it greatly increases the accuracy of less complex machine learning
methods and brings their performance in line with complex models.
2 Related work
2.1 Similarity Comparison using Titles
Methods of similarity comparison have been attempted in a few fake news detection papers. Antoun et al.
[4] presents an analysis of various fake news detection methods including an automated model winning an
international competition on this topic. This model used the title of the queried article to search Google
for similar articles and compared the word embeddings for the title against those of the top 5 searched
articles. It used cosine similarity only and tested the similarity scores along with other features such as
lexicon based features. Many classifiers were tested including SVM, XGBoost and random forest classifiers.
This resulted in a model that was effective in detecting articles where there was a lot of emphasis on the
title for fake news classification.
2.2 Similarity Comparison using Fixed News Database
A different approach to similarity detection is used in Alsuliman et al. [5] where similarity scores are
generated between the queried article and every article in a database of news articles. This method takes
the highest similarity score for each queried article and then uses a greedy approach to set a similarity
score threshold for real articles that maximises the overall accuracy. This paper uses three well-studied
techniques for similarity score calculation - cosine similarity, word appearance and TF-IDF. It explains
how these metrics are used within this problem and contrasts the results of each. The most significant
limitation of this paper is that the similarity articles that were selected were rarely relevant to the topic
of the queried article.
4
This paper forms the basis of our set of similarity metrics and we analyse each of the three metrics given.
Our approach improves upon the methodology of this approach by capitalising on Google’s PageRank
algorithm to only select articles with a high chance of being relevant to the queried topic. Additionally,
our dataset consists of around 10 times as many articles and our classification is done using machine
learning and deep learning classifiers rather than a greedy approach to fit a more realistic situation where
fake news detection is required.
2.3 Linguistic and Additional Features
Vijayaraghavan et al. [6] presents a series of different fake news model alongside useful linguistic features for
this task. It addresses preprocessing techniques including the removal of punctuation and stop words which
is an effective technique in this domain. It analyses the polarity of news articles and determines that both
real and fake news have similar polarity distributions making this an ineffective method of distinguishing
between them. It also analyses the part of speech distribution between fake and real news and determines
that features such as the number of adverbs and adjectives is higher in fake news but the number of nouns
and pronouns is higher in real news. This is because fake news relies on descriptive language to establish
facts whereas real news refers to research and experts to infer its legitimacy. Zhou & Zafarani [7] produced
a comprehensive survey on non-latent features which this research refers to extensively in Section 3.3,
along with Garg & Sharma [8], and Horne & Adali [9].
3 Methods
Figure 1: Our classification pipeline.
Figure 1 shows our mostly lin-
ear classification pipeline. After
preprocessing and tokenization, we
extract contextual articles which
are fed into a similarity model to
form our first feature. Addition-
ally, non-latent features from raw
text and BERT embeddings form
the rest of our features. The con-
catenation of all the features are
fed into our classification models
which infers a binary classification
label.
3.1 Preprocessing and tokenization
Before extracting any features, we will preprocess our input and convert the long form text into tokens.
We perform the following preprocessing methods in order:
Remove non-ascii: Our input articles contained unnecessary unicode tokens such as unicode
double quotation marks. These can be removed safely since they do not add any extra semantics
to the input articles and may confuse feature extraction.
Convert to lowercase: In our research, we converted all text to lowercase. However upon
further analysis, converting all text to lowercase hid acronyms such as “US” which could have
affected the main themes of the text. Further, all proper nouns such as names and places were
also hidden. We will discuss this limitation in Section 6.2.
Lemmatization: We used the nltk [10] library to reduce words down to their lemma in the
hopes of reducing the complexity within our text which may benefit feature extraction. This
5
ID Article extract Tokens
118 Real FBI Director James Comey said Sunday that the bureau
won’t change the conclusion it made in July after it exam-
ined newly revealed emails related to the Hillary Clinton
probe.
“Based on our review, we have not changed our conclu-
sions that we expressed in July with respect to Secre-
tary Clinton” Comey wrote in a letter to 16 members of
Congress. [...]
[’fbi’, ’director’, ’james’, ’comey’, ’said’,
’sunday’, ’bureau’, ’change’, ’conclusion’,
’made’, ’july’, ’examined’, ’newly’, ’re-
vealed’, ’email’, ’related’, ’hillary’, ’clin-
ton’, ’probe’, ’.’, ’”’, ’based’, ’review’, ’,’,
’changed’, ’conclusion’, ’expressed’, ’july’,
’respect’, ’secretary’, ’clinton’,. . . ]
15 Fake After hearing about 200 Marines left stranded after re-
turning home from Operation Desert Storm back in 1991,
Donald J.Trump came to the aid of those Marines by
sending one of his planes to Camp Lejuene, North Car-
olina to transport them back home to their families in
Miami, Florida.
Corporal Ryan Stickney was amongst the group that was
stuck in North Carolina and could not make their way
back to their homes. [...]
[’hearing’, ’200’, ’marines’, ’left’,
’stranded’, ’returning’, ’home’, ’oper-
ation’, ’desert’, ’storm’, ’back’, ’1991’, ’,’,
’donald’, ’j’, ’.’, ’trump’, ’came’, ’aid’,
’marines’, ’sending’, ’one’, ’plane’, ’camp’,
’lejuene’, ’,’, ’north’, ’carolina’, ’trans-
port’, ’back’, ’home’, ’family’, ’miami’,. . . ]
Table 1: Examples of preprocessing and tokenization extraction on items in dataset.
looks up the work in the WordNet corpus to get the lemma. Later in the research, we realised
that this hypothesis may have not been accurate.
Firstly the nltk library we were using does not automatically detect the part of speech and will
by default, only lemmatize nouns. While it is arguably better for us to maintain the tense of
nouns, we are technically not lemmatizing fully. Secondly, from more research, lemmatization
may not be ideal for BERT embeddings since it removes some semantics that could be learnt
by the BERT model. We will discuss these limitations further in Section 6.2.
Remove stopwords: Stopwords were removed from the text in order to reduce complexity.
Apart from the above methods, we also tested removing punctuation. However, this was not used in
the end since we added non-latent features to measure punctuation counts and also to maintain semantics
for BERT.
After preprocessing, tokens are then generated based on any whitespace and punctuation in the re-
maining text. Table 1 shows samples of tokenized input articles.
3.2 Feature BERT embeddings
BERT (Bidirectional Encoder Representations from Transformers) is a language representation model
proposed by Devlin et al [11]. It is pre-trained on BookCorpus and the English Wikipedia using masked
language modelling (MLM) and next sentence prediction (NSP). MLM masks some of the input tokens with
a training objective to predict the masked token simply based on context, and the model also concatenates
two sentences with 50% chance of being neighbouring sentences, and the model is pre-trained in the NSP
layer to predict if the two are indeed neighbours [11]. BERT obtains SOTA results on several tasks and is
suitable for representing document and textual data. Hence, we will be using BERT as the main feature
to encode our articles. We will use HuggingFace’s ’bert-base-uncased’ pretrained model which trains on
uncased data. To encode an article, we truncate it to the first 512 tokens pass it through ’bert-base-
uncased’ and output the CLS token’s vector as BERT features for our classification model. Due to this
truncation, the BERT encoding won’t be able to capture the entire article content. However, we hypothesis
6
that the ’fake news’ quality of an article is at the minimum visible at a range of 512 tokens (see BERT
Features in Section 6.2 for further discussion). In addition, any information not visible in this range will
be captured by non-latent features or similarity metrics.
3.3 Feature Non-latent features
From our literature review and survey, we are able to identify a significant amount of non-latent features
[7, 8, 9]. After combining features that are similar, and removing features which we cannot calculate due to
the need for proprietary softwares (i.e. LIWC), or due to the computational complexity of the algorithms,
or related reasons, we are able to identify 81 numerical features suitable for our experiments. Table 2
shows the 7 main categories of our features.
Type Description Examples
Diversity Number of unique words, or percentage of all
unique words of a particular part of speech
Noun, verb
Quantity Number of words, or percentage of all words
of a particular part of speech or linguistic unit
Noun, adjective,
quote
Sentiment Number of linguistic features denoting senti-
ment
Exclamation mark,
all-cap words,
polarity,
subjectivity
Pronoun Number of pronouns of a specific class
First person
singular pronoun:
I, me, my, mine
Average Average number of linguistic unit per other
linguistic unit
Characters per
word
Syntax Tree Depth The median syntax tree depth of a given unit
Median noun
phrase syntax tree
depth
Readability Measures complexity and how interpretable a
text is
Gunning-Fog,
Coleman-Liau
Table 2: The category that our 81 features fall under along with a description and examples.
We display all our features including our internal names in Table 7. Using the features, we apply
ANOVA (Analysis of Variance, see 3.4.5 for an overview) and filter for those with a p-value less than the
α significance level of 0.05. We remove the features with p-value above α and apply Pearson correlation
and identify all correlation clusters. Afterwards, we can apply a selection method on the cluster where
we remove all but one with the lowest p-value, specifically, we sort the selected features by p-value in
ascending order with the lowest p at the front of the list, we then search for all features which have a
correlation of at least 0.95 and remove them from the current list, and continue until the end. These will
be our non-latent features for our classification models.
3.4 Feature Similarity model
As a novel feature, we investigate the similarity of our input articles to contextual articles found online.
In Sections 3.4.1 and 3.4.2 we will discuss our process of gathering three contextual articles from online
7
sources which we treat as truth. These three articles extend our original dataset and are vectorized then
fed into a similarity model which we describe in Sections 3.4.3 and 3.4.4. We use this similarity as a feature
to ascertain whether our input article contains misinformation.
3.4.1 Summary extraction
To get the context articles, we need to summarize the main topic of our input article down to at most
10 keywords. We use the Python gensim [12] library which provides various topic modelling interfaces
for text inputs. We use the ldamodel which implements Latent Dirichlet Allocation (LDA) to extract a
single topic. LDA is a probabilistic model which assumes you have a number of documents representing
some latent topics characterized by a distribution over words. By feeding in the preprocessed sentences
of our input article as each document, we are able to get the main themes. We sort the output keywords
by the probability they represent the topic then cap the amount of words to 10 at most. We chose LDA
because it is a simple algorithm that outputs reproducible and reliable results and can be customized in
the number of topics and keywords it generates.
For the scope of our research, we are able to perform manual validation of the summaries extracted
to check the summary represented the article content well. Table 3 shows some samples of items in our
dataset after applying LDA. We see that while the summaries extracted are not perfect, they still represent
the general meaning of the article and were sufficient for the purposes of our research. Two common issues
we saw were:
Unordered words in the summary words representing the topics seemed to be unordered. To a
human reading the summary by itself, they might be able to see that the words are all keywords of
the article but put together in a sentence, will not completely make sense. We hypothesize that this
could have caused sub-optimal results when we started scraping articles using the summaries.
Appearance of stop words and other meaningless non-topic words in the summary As a flow on
issue from our preprocessing, our summary was left with words such as “wa” (from “was”) or “ha”
(from “has”). This would have impacted the meaning of our summary and later article scraping.
We will discuss the possibility of extracting better summaries using a more robust model in Section 6.2.
3.4.2 Article scraping
We feed the summary of the input article into Google News and collect the top three articles. We chose
Google News since it is a well know search engine that will return the most popular articles on the internet.
This sort of PageRank algorithm is important since we assume that the contextual articles are real and
describe what is the current word-of-mouth-truth from the internet. Therefore for purposes of comparison,
an input article that is very different to our contextual article is likely to be of the FAKE class.
For our research, we will only manually feed in all summaries for our dataset. Our motivation for this
research was to develop a tool that a user could potentially use to figure out if the current news they are
reading contains misinformation. We acknowledge there exists APIs that provide either a wrapper around
Google News or implement their own news search algorithm that we could have looked into. However,
given the size of the dataset and our scope, this was not necessary to demonstrate our system.
SETUP: We use a virtual machine with a freshly installed latest version of Google Chrome.
Searches are conducted in “Incognito Mode” tabs. We also use a VPN to the West coast of the
US. These invariants serve the main purpose so that Google’s does not give any personalized
results based on a browser fingerprint or IP address. We chose the US as the VPN destination
since our dataset articles were extracted from US news sources and we wanted to scrape for
articles with a similar style of writing. If you were to use the tool in Australia, Google would
usually return articles from local sources. We restrict our scope to specifically this dataset
rather than train on a wide dataset from all sources.
8
ID Article extract Summary
118 Real FBI Director James Comey said Sunday that the bureau
won’t change the conclusion it made in July after it exam-
ined newly revealed emails related to the Hillary Clinton
probe.
“Based on our review, we have not changed our conclu-
sions that we expressed in July with respect to Secre-
tary Clinton” Comey wrote in a letter to 16 members of
Congress. [...]
email review fbi
clinton said july
comey news new
wa
15 Fake After hearing about 200 Marines left stranded after re-
turning home from Operation Desert Storm back in 1991,
Donald J.Trump came to the aid of those Marines by
sending one of his planes to Camp Lejuene, North Car-
olina to transport them back home to their families in
Miami, Florida.
Corporal Ryan Stickney was amongst the group that was
stuck in North Carolina and could not make their way
back to their homes. [...]
home marines
trump wa stickney
way north plane
family
Table 3: Examples of summary extraction on items in dataset.
Another invariant we implement is to add a before:2020 to our summary. This forces Google
News to only find articles before this year so that the news we get won’t be from recent news.
A common discussion topic from our dataset was Donald Trump’s 2016 election campaign
and we know that the news regarding Trump in 2023 is much different to that of 2016. This
makes sense as we are not using a very recent dataset so clamping the date we find contextual
articles assumes that if were looking for fake articles at the time of reading the input article,
we wouldn’t have too much future articles available.
PROCESS: We attempt to get the top three articles and save the URL for each input article.
Not all summaries returned three articles so we perform scraping in three passes:
1. We enter the whole summary without any changes. This is the most ideal approach and
most machine-replicable. This covered 70% of our dataset.
2. Still performing only generic actions, we remove any bad words or non-important connec-
tives then searched again. This should still be machine-replicable with further work. This
covered the next 20% of our dataset.
3. For the last 10% of our dataset, we had to manually look at the input article content and
summary generated to figure out why we still received no results. Our hypothesis was
that this was a combination of our non-tuned summary extraction and the fact that some
outrageous Fake articles simply didn’t have any similar articles that could be found. We
will discuss this limitation in Section 6.2.
From the above passes, we were not able to find context articles for four input articles described
in a table in Appendix A. Furthermore, we were only able to find one or two articles for some
inputs but we can still continue with our similarity model.
9