This paper is based on a research project from June-August 2023. It uses a variety of machine learning and language processing techniques to detect fake news.

<!DOCTYPE html>

Automated fake news detection through contextual
similarity comparison
Dhruv Agrawal
---@unsw.edu.au
Duke Nguyen
---@unsw.edu.au
Jim Tang
---@unsw.edu.au
August 17, 2023
Contents
1 Introduction 3
2 Related work 4
2.1 Similarity Comparison using Titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Similarity Comparison using Fixed News Database . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Linguistic and Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methods 5
3.1 Preprocessing and tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Feature BERT embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Feature Non-latent features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Feature Similarity model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4.1 Summary extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.2 Article scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.3 Article vectorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.4 Similarity metric calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.5 Similarity metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Feature Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Model Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Model Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Experimental setup 13
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Results and discussion 14
5.1 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.1 Non-Latent Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1.2 Similarity Metrics Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1.3 Data Analysis Using PCA and KMeans . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Conclusion 18
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3 Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Appendix A Article scraping 24
Appendix B Word Count Histogram 25
Appendix C Non-Latent Features 26
Appendix D Non-latent Feature Pearson Correlation Matrix 29
Appendix E ANOVA of Non-Latent Feature against Labels 30
Appendix F Machine learning 31
2
1 Introduction
As the distribution of news shifts towards social media, there is a rise in the dissemination of fake news
[1]. We define fake news as the creation of information presented as legitimate journalism that describes
a fictitious event or fabricates details of an event in an attempt to mislead readers. This phenomenon
became a major focal point in journalism during the 2016 US elections with political parties labelling
many publications and articles as fake news. There are two huge issues with how this occurred in 2016:
1. Many prominent political figures highlighted any news they disagreed with as fake news. This led
to the political isolation of the party, whereby any news that had portrayed them in a negative light
had the potential to be dismissed as fake news This reduced the accountability of political figures in
the US, a country where federal legislation has a sweeping impact across the country and the rest of
the world.
2. There was a lack of fake news detection tools on social media and due to the polarisation of the media
climate, it was extremely difficult for social media to regulate articles published on the platforms or
remove factually incorrect articles posted or shared by politicians and Americans [2].
Since then, there have been many attempts to introduce ways to deal with these issues such as Politifact
which manually reviews articles and social media posts for factual correctness. It posts its findings on its
website and is easily accessible for people. Other similar websites exist but the reason manual fact-checking
tools are not as prominent in spaces with high amounts of fake news is because it is impossible for manual
review tools to scale to the number of news articles and journalistic social media posts published every
day.
There are also many automated fake news detection algorithms which rely on linguistic features of
the text, comparisons between the title of the article and its content, the medium of transmission, and
any suspicious activity linked with its engagement online. These tools have become more effective since
COVID-19 and Twitter employs its own automated algorithms to automatically label articles about certain
topics it chooses as fake news [3]. However, these tools are known to be very unreliable as it is increasingly
common for fake news to read the same way as real news articles and be shared by humans.
Therefore, in order for fake news detection to become more widespread and effective in combating fake
news, there are a few different criteria it must fulfil:
1. The algorithms need to automatically classify news as real or fake so that they can scale with the
growth of social media and the increase in fake news dissemination.
2. The algorithms need to incorporate current methods of fake news detection as these have been highly
researched and are effective in many situations such as when fake news has been automatically
generated or constructed in a highly polarised manner designed to provoke intense responses from
readers.
3. The algorithms need to examine an article’s content and meaning beyond its writing style to combat
fake news that is well-written and designed to look like real news.
4. The dataset used to train and assess the algorithms must contain real and fake articles that are
written in the same style so that it is not apparent simply from the way an article is written whether
it is real or fake.
Our approach improves upon existing approaches that focus on the first and second criteria, or the
third one. Most models rarely focus on both the second and third criteria and our aims to combine those
to analyse both the content and style of articles. The model then makes a significantly more informed
decision on its classification. The model is restricted to a binary classification - it outputs either real or
fake rather than giving a confidence metric of an article’s legitimacy. This is done as the aim is for the
tool be easily adopted and focusing on the simplicity of the input and output is a priority.
3
We compiled a list of commonly used linguistic features for fake news detection. Multiple different
pairings of features were formed and analysis was conducted to determine the most effective linguistic
features for the task. This takes existing research into fake news detection and puts our model in line with
current methods. A new feature - lexicographical similarity - is used to achieve the third criterion above.
At a high level, this feature compares the queried article to other articles on Google news which are at the
top of Google’s searches and applies various algorithms based on the individual words shared between the
articles. As these Google news articles have high PageRank scores, the model can be confident that they
are real articles and it compares the similarity of the content between the queries article and each of these
top searched articles. This is done as a way for the model to infer context before making a judgement on
an article’s legitimacy. This approach brings out model in line with the way humans manually fact-check
articles which usually involves finding known trustworthy sources and comparing the content between the
articles to determine whether the queried one is consistent with the trustworthy ones.
Section 2 discusses related work in the study of automated fake news detection. It includes research
on existing similarity comparison models and linguistic features useful for this task. Section 3 covers the
methodology of our approach. Specifically, it covers preprocessing the input and queried articles, the
BERT, linguistic and similarity feature calculations, and classification with machine learning and neural
networks. Section 4 analyses the experimental setup and explains the reasoning behind the evaluation
metrics being used. Section 5 discusses the results of feature analysis of linguistic features, PCA and
KMeans analysis on the collected features, and a comparison between machine learning and neural net-
work approaches for classification. Section 6 covers our contributions, limitations and future work, and
attributions for the work.
Through this research, we have analysed and determined the most effective linguistic features for fake
news detection and shown that the use of similarity as a metric is effective in building upon these current
metrics to increase accuracy. We have also compared the use of the similarity metric with different machine
learning classifiers and discovered that it greatly increases the accuracy of less complex machine learning
methods and brings their performance in line with complex models.
2 Related work
2.1 Similarity Comparison using Titles
Methods of similarity comparison have been attempted in a few fake news detection papers. Antoun et al.
[4] presents an analysis of various fake news detection methods including an automated model winning an
international competition on this topic. This model used the title of the queried article to search Google
for similar articles and compared the word embeddings for the title against those of the top 5 searched
articles. It used cosine similarity only and tested the similarity scores along with other features such as
lexicon based features. Many classifiers were tested including SVM, XGBoost and random forest classifiers.
This resulted in a model that was effective in detecting articles where there was a lot of emphasis on the
title for fake news classification.
2.2 Similarity Comparison using Fixed News Database
A different approach to similarity detection is used in Alsuliman et al. [5] where similarity scores are
generated between the queried article and every article in a database of news articles. This method takes
the highest similarity score for each queried article and then uses a greedy approach to set a similarity
score threshold for real articles that maximises the overall accuracy. This paper uses three well-studied
techniques for similarity score calculation - cosine similarity, word appearance and TF-IDF. It explains
how these metrics are used within this problem and contrasts the results of each. The most significant
limitation of this paper is that the similarity articles that were selected were rarely relevant to the topic
of the queried article.
4
This paper forms the basis of our set of similarity metrics and we analyse each of the three metrics given.
Our approach improves upon the methodology of this approach by capitalising on Google’s PageRank
algorithm to only select articles with a high chance of being relevant to the queried topic. Additionally,
our dataset consists of around 10 times as many articles and our classification is done using machine
learning and deep learning classifiers rather than a greedy approach to fit a more realistic situation where
fake news detection is required.
2.3 Linguistic and Additional Features
Vijayaraghavan et al. [6] presents a series of different fake news model alongside useful linguistic features for
this task. It addresses preprocessing techniques including the removal of punctuation and stop words which
is an effective technique in this domain. It analyses the polarity of news articles and determines that both
real and fake news have similar polarity distributions making this an ineffective method of distinguishing
between them. It also analyses the part of speech distribution between fake and real news and determines
that features such as the number of adverbs and adjectives is higher in fake news but the number of nouns
and pronouns is higher in real news. This is because fake news relies on descriptive language to establish
facts whereas real news refers to research and experts to infer its legitimacy. Zhou & Zafarani [7] produced
a comprehensive survey on non-latent features which this research refers to extensively in Section 3.3,
along with Garg & Sharma [8], and Horne & Adali [9].
3 Methods
Figure 1: Our classification pipeline.
Figure 1 shows our mostly lin-
ear classification pipeline. After
preprocessing and tokenization, we
extract contextual articles which
are fed into a similarity model to
form our first feature. Addition-
ally, non-latent features from raw
text and BERT embeddings form
the rest of our features. The con-
catenation of all the features are
fed into our classification models
which infers a binary classification
label.
3.1 Preprocessing and tokenization
Before extracting any features, we will preprocess our input and convert the long form text into tokens.
We perform the following preprocessing methods in order:
Remove non-ascii: Our input articles contained unnecessary unicode tokens such as unicode
double quotation marks. These can be removed safely since they do not add any extra semantics
to the input articles and may confuse feature extraction.
Convert to lowercase: In our research, we converted all text to lowercase. However upon
further analysis, converting all text to lowercase hid acronyms such as “US” which could have
affected the main themes of the text. Further, all proper nouns such as names and places were
also hidden. We will discuss this limitation in Section 6.2.
Lemmatization: We used the nltk [10] library to reduce words down to their lemma in the
hopes of reducing the complexity within our text which may benefit feature extraction. This
5
ID Article extract Tokens
118 Real FBI Director James Comey said Sunday that the bureau
won’t change the conclusion it made in July after it exam-
ined newly revealed emails related to the Hillary Clinton
probe.
“Based on our review, we have not changed our conclu-
sions that we expressed in July with respect to Secre-
tary Clinton” Comey wrote in a letter to 16 members of
Congress. [...]
[’fbi’, ’director’, ’james’, ’comey’, ’said’,
’sunday’, ’bureau’, ’change’, ’conclusion’,
’made’, ’july’, ’examined’, ’newly’, ’re-
vealed’, ’email’, ’related’, ’hillary’, ’clin-
ton’, ’probe’, ’.’, ’”’, ’based’, ’review’, ’,’,
’changed’, ’conclusion’, ’expressed’, ’july’,
’respect’, ’secretary’, ’clinton’,. . . ]
15 Fake After hearing about 200 Marines left stranded after re-
turning home from Operation Desert Storm back in 1991,
Donald J.Trump came to the aid of those Marines by
sending one of his planes to Camp Lejuene, North Car-
olina to transport them back home to their families in
Miami, Florida.
Corporal Ryan Stickney was amongst the group that was
stuck in North Carolina and could not make their way
back to their homes. [...]
[’hearing’, ’200’, ’marines’, ’left’,
’stranded’, ’returning’, ’home’, ’oper-
ation’, ’desert’, ’storm’, ’back’, ’1991’, ’,’,
’donald’, ’j’, ’.’, ’trump’, ’came’, ’aid’,
’marines’, ’sending’, ’one’, ’plane’, ’camp’,
’lejuene’, ’,’, ’north’, ’carolina’, ’trans-
port’, ’back’, ’home’, ’family’, ’miami’,. . . ]
Table 1: Examples of preprocessing and tokenization extraction on items in dataset.
looks up the work in the WordNet corpus to get the lemma. Later in the research, we realised
that this hypothesis may have not been accurate.
Firstly the nltk library we were using does not automatically detect the part of speech and will
by default, only lemmatize nouns. While it is arguably better for us to maintain the tense of
nouns, we are technically not lemmatizing fully. Secondly, from more research, lemmatization
may not be ideal for BERT embeddings since it removes some semantics that could be learnt
by the BERT model. We will discuss these limitations further in Section 6.2.
Remove stopwords: Stopwords were removed from the text in order to reduce complexity.
Apart from the above methods, we also tested removing punctuation. However, this was not used in
the end since we added non-latent features to measure punctuation counts and also to maintain semantics
for BERT.
After preprocessing, tokens are then generated based on any whitespace and punctuation in the re-
maining text. Table 1 shows samples of tokenized input articles.
3.2 Feature BERT embeddings
BERT (Bidirectional Encoder Representations from Transformers) is a language representation model
proposed by Devlin et al [11]. It is pre-trained on BookCorpus and the English Wikipedia using masked
language modelling (MLM) and next sentence prediction (NSP). MLM masks some of the input tokens with
a training objective to predict the masked token simply based on context, and the model also concatenates
two sentences with 50% chance of being neighbouring sentences, and the model is pre-trained in the NSP
layer to predict if the two are indeed neighbours [11]. BERT obtains SOTA results on several tasks and is
suitable for representing document and textual data. Hence, we will be using BERT as the main feature
to encode our articles. We will use HuggingFace’s ’bert-base-uncased’ pretrained model which trains on
uncased data. To encode an article, we truncate it to the first 512 tokens pass it through ’bert-base-
uncased’ and output the CLS token’s vector as BERT features for our classification model. Due to this
truncation, the BERT encoding won’t be able to capture the entire article content. However, we hypothesis
6
that the ’fake news’ quality of an article is at the minimum visible at a range of 512 tokens (see BERT
Features in Section 6.2 for further discussion). In addition, any information not visible in this range will
be captured by non-latent features or similarity metrics.
3.3 Feature Non-latent features
From our literature review and survey, we are able to identify a significant amount of non-latent features
[7, 8, 9]. After combining features that are similar, and removing features which we cannot calculate due to
the need for proprietary softwares (i.e. LIWC), or due to the computational complexity of the algorithms,
or related reasons, we are able to identify 81 numerical features suitable for our experiments. Table 2
shows the 7 main categories of our features.
Type Description Examples
Diversity Number of unique words, or percentage of all
unique words of a particular part of speech
Noun, verb
Quantity Number of words, or percentage of all words
of a particular part of speech or linguistic unit
Noun, adjective,
quote
Sentiment Number of linguistic features denoting senti-
ment
Exclamation mark,
all-cap words,
polarity,
subjectivity
Pronoun Number of pronouns of a specific class
First person
singular pronoun:
I, me, my, mine
Average Average number of linguistic unit per other
linguistic unit
Characters per
word
Syntax Tree Depth The median syntax tree depth of a given unit
Median noun
phrase syntax tree
depth
Readability Measures complexity and how interpretable a
text is
Gunning-Fog,
Coleman-Liau
Table 2: The category that our 81 features fall under along with a description and examples.
We display all our features including our internal names in Table 7. Using the features, we apply
ANOVA (Analysis of Variance, see 3.4.5 for an overview) and filter for those with a p-value less than the
α significance level of 0.05. We remove the features with p-value above α and apply Pearson correlation
and identify all correlation clusters. Afterwards, we can apply a selection method on the cluster where
we remove all but one with the lowest p-value, specifically, we sort the selected features by p-value in
ascending order with the lowest p at the front of the list, we then search for all features which have a
correlation of at least 0.95 and remove them from the current list, and continue until the end. These will
be our non-latent features for our classification models.
3.4 Feature Similarity model
As a novel feature, we investigate the similarity of our input articles to contextual articles found online.
In Sections 3.4.1 and 3.4.2 we will discuss our process of gathering three contextual articles from online
7
sources which we treat as truth. These three articles extend our original dataset and are vectorized then
fed into a similarity model which we describe in Sections 3.4.3 and 3.4.4. We use this similarity as a feature
to ascertain whether our input article contains misinformation.
3.4.1 Summary extraction
To get the context articles, we need to summarize the main topic of our input article down to at most
10 keywords. We use the Python gensim [12] library which provides various topic modelling interfaces
for text inputs. We use the ldamodel which implements Latent Dirichlet Allocation (LDA) to extract a
single topic. LDA is a probabilistic model which assumes you have a number of documents representing
some latent topics characterized by a distribution over words. By feeding in the preprocessed sentences
of our input article as each document, we are able to get the main themes. We sort the output keywords
by the probability they represent the topic then cap the amount of words to 10 at most. We chose LDA
because it is a simple algorithm that outputs reproducible and reliable results and can be customized in
the number of topics and keywords it generates.
For the scope of our research, we are able to perform manual validation of the summaries extracted
to check the summary represented the article content well. Table 3 shows some samples of items in our
dataset after applying LDA. We see that while the summaries extracted are not perfect, they still represent
the general meaning of the article and were sufficient for the purposes of our research. Two common issues
we saw were:
Unordered words in the summary words representing the topics seemed to be unordered. To a
human reading the summary by itself, they might be able to see that the words are all keywords of
the article but put together in a sentence, will not completely make sense. We hypothesize that this
could have caused sub-optimal results when we started scraping articles using the summaries.
Appearance of stop words and other meaningless non-topic words in the summary As a flow on
issue from our preprocessing, our summary was left with words such as “wa” (from “was”) or “ha”
(from “has”). This would have impacted the meaning of our summary and later article scraping.
We will discuss the possibility of extracting better summaries using a more robust model in Section 6.2.
3.4.2 Article scraping
We feed the summary of the input article into Google News and collect the top three articles. We chose
Google News since it is a well know search engine that will return the most popular articles on the internet.
This sort of PageRank algorithm is important since we assume that the contextual articles are real and
describe what is the current word-of-mouth-truth from the internet. Therefore for purposes of comparison,
an input article that is very different to our contextual article is likely to be of the FAKE class.
For our research, we will only manually feed in all summaries for our dataset. Our motivation for this
research was to develop a tool that a user could potentially use to figure out if the current news they are
reading contains misinformation. We acknowledge there exists APIs that provide either a wrapper around
Google News or implement their own news search algorithm that we could have looked into. However,
given the size of the dataset and our scope, this was not necessary to demonstrate our system.
SETUP: We use a virtual machine with a freshly installed latest version of Google Chrome.
Searches are conducted in “Incognito Mode” tabs. We also use a VPN to the West coast of the
US. These invariants serve the main purpose so that Google’s does not give any personalized
results based on a browser fingerprint or IP address. We chose the US as the VPN destination
since our dataset articles were extracted from US news sources and we wanted to scrape for
articles with a similar style of writing. If you were to use the tool in Australia, Google would
usually return articles from local sources. We restrict our scope to specifically this dataset
rather than train on a wide dataset from all sources.
8
ID Article extract Summary
118 Real FBI Director James Comey said Sunday that the bureau
won’t change the conclusion it made in July after it exam-
ined newly revealed emails related to the Hillary Clinton
probe.
“Based on our review, we have not changed our conclu-
sions that we expressed in July with respect to Secre-
tary Clinton” Comey wrote in a letter to 16 members of
Congress. [...]
email review fbi
clinton said july
comey news new
wa
15 Fake After hearing about 200 Marines left stranded after re-
turning home from Operation Desert Storm back in 1991,
Donald J.Trump came to the aid of those Marines by
sending one of his planes to Camp Lejuene, North Car-
olina to transport them back home to their families in
Miami, Florida.
Corporal Ryan Stickney was amongst the group that was
stuck in North Carolina and could not make their way
back to their homes. [...]
home marines
trump wa stickney
way north plane
family
Table 3: Examples of summary extraction on items in dataset.
Another invariant we implement is to add a before:2020 to our summary. This forces Google
News to only find articles before this year so that the news we get won’t be from recent news.
A common discussion topic from our dataset was Donald Trump’s 2016 election campaign
and we know that the news regarding Trump in 2023 is much different to that of 2016. This
makes sense as we are not using a very recent dataset so clamping the date we find contextual
articles assumes that if were looking for fake articles at the time of reading the input article,
we wouldn’t have too much future articles available.
PROCESS: We attempt to get the top three articles and save the URL for each input article.
Not all summaries returned three articles so we perform scraping in three passes:
1. We enter the whole summary without any changes. This is the most ideal approach and
most machine-replicable. This covered 70% of our dataset.
2. Still performing only generic actions, we remove any bad words or non-important connec-
tives then searched again. This should still be machine-replicable with further work. This
covered the next 20% of our dataset.
3. For the last 10% of our dataset, we had to manually look at the input article content and
summary generated to figure out why we still received no results. Our hypothesis was
that this was a combination of our non-tuned summary extraction and the fact that some
outrageous Fake articles simply didn’t have any similar articles that could be found. We
will discuss this limitation in Section 6.2.
From the above passes, we were not able to find context articles for four input articles described
in a table in Appendix A. Furthermore, we were only able to find one or two articles for some
inputs but we can still continue with our similarity model.
9
Figure 2: Sample of articles found in Google after searching an article summary.
After gathering three URL links for each context article, we use the Python newspaper3k [13] library to
download the article and automatically extract its title and content.
3.4.3 Article vectorisation
Past research by Alsuliman, et. al in [5] proposes two different ways to vectorise the articles: TF-IDF, and
Word2Vec. In addition to these two vectorisation methods, we propose a third “non-latent vectoriser”.
TF-IDF [14]: The term frequency times inverse document frequency, which is a “common
term weighting scheme in information retrieval”. The formula given as follows:
article.count(term)
len(article)
log
2
len(articles)
df(articles, term)
+ 1
The first term is the term frequency, and the second term is the inverse document frequency,
where article is the article we are applying TF-IDF on, article.count(term) is the fre-
quency of term in article, len(article) is the number of words in the article, len(articles)
is the number of articles, df(articles, term) is the document frequency of term in our
articles dataset, or the number of articles that contains this term [14]. The formula given
has a ’+1’ term in idf so that the algorithm would not ignore terms that appear in all articles,
this is the sklearn implementation and differs from the standard textbook formula which has
the ’+1’ in the denominator in log2 of idf [14]. TF-IDF will be fitted on the original dataset
of input articles (for articles in IDF term), then be used as vectorize to transform both the
input and context articles. We will apply TF-IDF in two n-gram ranges of (1,1) and (1,2).
Word2Vec [15]: A word embedding architecture introduced by Google in 2013. It uses
continuous bag-of-words (CBOW), which uses neighbouring words to predict target words, and
continuous skip-gram which uses target words to predict the neighbouring words, using either
hierarchical softmax or negative sampling [16] [17]. We will be using the gensim implementation
of word2vec using the pre-trained ’word2vec-google-news-300’ model which is trained on the
Google News dataset of about 100 billion words containing 3 million words and phrases in 300
dimensions. We chose this model due to the similarity between the domain of its dataset and
out dataset being news. To calculate the article vector, we retrieve the vector of every single
word in an article, existing those that do not exist in the embeddings, and then taking the
average from the list [5].
10
Non-Latent Vectoriser: The non-latent vectoriser uses the non-latent feature selected in
Section 3.3 and apply them on an article into a vector. This is the non-latent vector presentation
of the respective article.
3.4.4 Similarity metric calculation
Past research by Alsuliman et al. in [5] proposes three different metrics to calculate the similarity between
two documents: cosine distance, word appearance (word app), and matching score. In addition, we also
propose a third metric being the harmonic mean of the three, to harmonise any statistical difference and
incorporate all distributional differences between the measures.
Cosine distance: Calculated as one minus the cosine similarity of two vectors u and v.
The cosine similarity is the cosine of the angle between the two vectors (calculated as the dot
product of u and v) divided by the product of the Euclidean L2 norm of the two vectors to
scale the range to [0, 1] [14]. Lower values denote higher similarity between the two vectors,
and vice versa. The formula is given as follows [18]:
1
u · v
u∥∥v
Word app: Calculated as the number of unique common words between the prediction and
the context articles divided by the number of unique words in the context article. Given that
input
unique
is the set of unique words in the input document, context
unique
is the set of unique
words in the context document. The formula is as follows:
|input
unique
context
unique
|
|context
unique
|
Matching score: Calculated as the L1 norm of the vector of the unique common words
between the prediction and the context article, divided by L1 norm of vectorized unique words
in the context article. Given that input
unique
is the set of unique words in the input document,
context
unique
is the set of unique words in the context document, vec(x) is a function which
vectorises the set of words x, the formula is as follows:
vec(input
unique
context
unique
)
1
vec(context
unique
)
1
Harmonic mean [19]: Calculated by the following formula :
¯x
H
=
n
P
n
i=1
1
x
i
When n = 3, x
1
= c is the cosine distance, x
2
= w is the word app, and x
3
= m is the matching
score, we have the following formula:
¯x
H
=
n
1
c
+
1
w
+
1
m
All these metrics are between the range of [0, 1]. Higher values for matching score and word app denote
higher similarity, and vice verse. This is the opposite for cosine distance. Since we scrape up to three
context articles per input article, we can apply similarity metric smoothing by calculating the similarity
metric for the input article as the average of the similarity between the input article and each of the context
article to reduce variance.
11
3.4.5 Similarity metric selection
Since we have different similarity metrics, we ought to compare them and select for the one that helps
differentiate the REAL and the FAKE articles the best. We will use three methods to aid our selection: δµ,
Jensen-Shannon Divergence, and ANOVA.
δµ, or the difference between the mean of REAL and FAKE articles is a very naive and simplistic measure
to compare how differentiated the two distributions are. This measure does not remedy the behaviour of
outliers which might significantly shift the mean of the distributions. It also ignores the variance of the
distributions. However, it is still a numerically simple metric which would aid us when the distributions
are well-formed.
Jensen-Shannon Divergence (JSD) [20] is a symmetric Kullback-Leibler Divergence (KL Diver-
gence) to compute the metric distance between two probability distributions [18, 20]. It is bounded
between 0 and 1 with values closer to 0 denoting more similar distributions and closer to 1 more dissimilar
distributions. This is a value we hope to maximise. We will use the square root of the Jensen-Shannon
divergence to normalise p and q, which is given by the formula [18]:
r
D(p m) + D(q m)
2
D is the KL Divergence and m is the “point-wise mean of p and q” [18].
ANOVA (Analysis of Variance) [21] is a statistical method to determine if there are significant
differences between the means of groups. We will be performing one-way repeated-measures ANOVA which
test for any significant difference between the means of the dependent variables (our non-latent features)
among the groups defined by the independent variable (our label of REAL and FAKE) to reject the null
hypothesis that the group means(FAKE and REAL) are equal. We reject this hypothesis when the p-value
associated with the f-statistic of the repetitive feature is below the significant level α = 0.05.
3.5 Feature Normalisation
We experimented with different feature normalisation techniques across the entire dataset and for specific
features. Normalisation was conducted column-wise for each feature so that each feature was normalised
individually.
MinMaxScaler: This was used for features where there was a clear range of values in order
to convert these values into the range of 0-1. This was not implemented frequently across our
features as they did not have a fixed range of values making this technique ineffective.
StandardScaler: This was primarily used for linguistic features that did not have a clear
range. This technique works by finding the Z-score for each value within the feature where
Z-score =
Xµ
σ
. This is done by first calculating the mean and standard deviation of the
feature and then transforming each value into its Z-score.
RobustScaler: We experimented with this in place of StandardScaler for values prone to
outliers. This was done as RobustScaler uses IQR = Q3Q1 instead of the standard deviation
in the transformation calculation from StandardScaler. This is used because standard deviation
is highly affected by outlier values and this was noticeable in features such as word count context
articles.
Ultimately, after each scaling technique was applied and the classification was conducted, the results
with each type of normalisation and various combinations of them underperformed compared to the original
dataset without normalisation. There are a few different reasons why this was the case but after testing
various classification models, the most probable reasons are the small size of the dataset used and the lack
of appropriate scaling methods available for the linguistic features that rely on counting. Due to this, we
decided not to use normalisation for our results.
12
3.6 Model Machine learning
We used four state of the art machine learning models (commonly used in fake news detection) to perform
our classification. We chose Logistic Regression (LR), Support Vector Machines (SVM), Decision Trees
(DT), and XGBoost (XGB). Due to the small size of our dataset, we needed to tune the regularization
hyperparameters to ensure our models didn’t overfit. In particular, tree-based models such as DT and
XGB should be able to fully segment our classes so we need to control the depth of the tree and splitting
criteria. Models such as LR and SVM will need to control L2 regularization. In SVM, we will test the
type of kernel used.
To find the best hyperparameters, we perform 5-Fold cross validation across 80% of our dataset and
average the validation score. We pick the parameters with the best validation score and test our model
with the remaining 20% of the dataset. The table in the Appendix F shows the hyperparameters tested
for each model and a reason for why the ranges were selected.
3.7 Model Neural networks
The second approach taken for classification was using deep learning as we hypothesised that it would
be able to find more complex relationships between the large variety of BERT, linguistic, and similarity
features. Initially both convolutional neural networks and simple fully connected layers were considered.
After analysing the features and noting that they did not have any spacial relation, only the second
approach was considered.
We used a very simple structure with 4 hidden layers using the ReLU activation function and an output
layer with one neuron and the sigmoid activation function as the task required binary classification. We
kept the structure simple due to the small size of the dataset and to prevent overfitting. Additionally,
a dropout of 0.2 was added to further prevent overfitting. The dataset was split in a 60:20:20 train,
validation, test split to allow for hyperparameter training and binary-crossentropy loss was used. Multiple
optimisers were used during training, however, based on the validation accuracy we decided to use Adam
as the optimiser with default parameters but a modified learning rate of 0.0001.
4 Experimental setup
4.1 Dataset
For our research, we use the FakeNewsData dataset collated by Horne & Adali in [9] on research regarding
fake news in the 2016 presidential elections. This dataset contains two subsets, “Buzzfeed Political News”,
and “Random Political News”. We make use of the Buzzfeed subset since this contains long form text
articles that are binary categorized in with Fake and Real labels. The Random subset contains an extra
label, Satire, which is out of scope for our research.
The original dataset was collated by Craig Silverman (BuzzFeed News Editor) in an article [22]
analysing fake news. The analysis concentrates on the Facebook engagement on real and fake news articles
shared to the social media website. Various keywords related to events during the election were searched
and articles with highest engagement were collected. A ground truth was assigned by manual analysis
using a list of known fake and hyperpartisan news sites. A details description of their process can be found
in their article.
Following BuzzFeed’s analysis, Horne & Adali extracted the content and title from the articles and
formed the dataset. In total, there were 53 REAL and 48 FAKE articles. Notable events during the election
such as Donald Trump’s campaign and various Hillary Clinton scandals and rumours were features in the
articles.
After extending this dataset with our novel context article scraping and similarity methods, we used
a 60/20/20 train/validation/test set. This was stratified and randomized to ensure the best results. An
example of items in our dataset can be found in Table 4.
13
118 Real 1 Fake
FBI Completes Review of Newly Revealed Hillary
Clinton Emails Finds No Evidence of Criminality
5 Million Uncounted Sanders Ballots Found On Clin-
ton’s Email Server
FBI Director James Comey said Sunday that the bureau
won’t change the conclusion it made in July after it exam-
ined newly revealed emails related to the Hillary Clinton
probe.
“Based on our review, we have not changed our conclu-
sions that we expressed in July with respect to Secre-
tary Clinton” Comey wrote in a letter to 16 members of
Congress. [...]
Hillary in hot water over her email server, again. Sacra-
mento, CA Democratic nominee Hillary Clinton is in
hot water again after nearly 5 million uncounted Cali-
fornia electronic ballots were found on her email server
by the F.B.I. The majority of those ballots cast were by
Bernie Sanders supporters. [...]
Table 4: A sample of one FAKE and REAL article in our dataset. The article ID, title and content are shown in the
rows. Both articles are regarding a scandal with Hillary Clinton using a private server to store emails. The FAKE
article reports on an event that never happens whereas the real article reports the true event that Clinton was
exonerated from criminality.
4.2 Evaluation metrics
To evaluate our classification models, we will use accuracy and F1 score. These metrics are commonly
used for binary classification problems as well as in the misinformation detection domain. When referring
to these metrics, we will label REAL articles as positive and FAKE articles as negative.
Accuracy is measured as the proportion of the total number of correctly classified samples over the
total count of samples. Mathematically this is represented as
Accuracy =
T P + T N
T P + T N + F P + F N
,
where T P, T N, F P, F N refers to the true and false, positive and negative. Accuracy is an effective metric
for this analysis as it is a holistic measure of the model’s ability to identify a real article as REAL and a
fake article as FAKE, which is the core component of the problem of fake news detection. Our dataset is
quite balanced so this will be a good general first step measure.
Recall refers to how likely the model is able to identify an article is real when its ground truth is real.
Precision is a measure of the proportion of all articles the model predicts to be real that have a ground
truth of real. As such, recall is a measure of the quantity of the predictions whereas precision is a measure
of the quality of the predictions. We are focusing on F1-score as it is the harmonic mean of recall and
precision where
F
1
= 2
precision × recall
precision + recall
.
This allows for a fair assessment of both the quantity and quality of the predictions especially in
imbalanced datasets. While our dataset is mostly balanced, we still output this metric to compare the
model’s performance on our positive class.
5 Results and discussion
5.1 Feature Analysis
5.1.1 Non-Latent Feature Selection
We see that the majority of our features have a value that falls into the range of 0.1 10. Several quantity
and some diversity features are in the higher range of 100 1000. A box plot showing the means and
14
standard deviations of all our features can be found in Figure 9 in the Appendix.
After applying ANOVA, we identify that the removed features account to almost half of all features, and
they include readability indices, the syntax tree depth features, and a few features from other categories,
and diversity features are all below α. Subsequently, we create a Pearson correlation matrix for features
with p-value below α, resulting in the matrix in Appendix D. We then apply the algorithm to cluster
correlations and keep only the one with lowest p-value as outlined in Section 3.3. Figure 11 shows our
results. After grouping them by their original category, we have 29 features shown in Figure 3.
Figure 3: The number of features we have for each category.
5.1.2 Similarity Metrics Comparison
Figure 4 shows our results with similarity metrics comparison. A good metric should have the FAKE (in
red) and the REAL (in green) distributions be somewhat well separated. The brown area indicates the
overlapping region. Non-latent Cosine Distance is the worst metric with 0δµ, 0.02JSD, and 6.10e 01p,
where the two FAKE and REAL distributions are overlapping each other almost at the same point. Word2vec
Cosine Distance and TF-IDF(1-1) also perform quite poorly with δµ of 0.03 and 0.05, JSD of 0.02 and
0.06, and p-value of 1.74e-03 and 2.38e-02 respectively. Both these plots have the REAL distribution shifted
further to the right but the distributions still overlap significantly, unlike the Non-Latent Cosine Distance,
however, these metrics fall below the typical significance level α of 0.05. Although TF-IDF (1-2) Cosine
Distance has a similar δµ and JSD of 0.06 and 0.08 respectively, its p-value is markedly low at 2.81e-05.
Word App follows a similar pattern where TF-IDF (1-1) and (1-2) produce distinctly different results
of 5.17e-05 and 1.63e-02 in p-value, but similar δµ of 0.13 and JSD of 0.22 and 0.23 respectively. We
can conclude that the ngram-range of TF-IDF definitely contribute to marked differences in the metrics
output. Matching score metrics have similar δµ of 0.12 and JSD range of [0.23, 0.28] with low p-values
of 2.81e-05 and 6.11e-05 respectively. The most significant features however are the harmonic mean,
combining the previous metrics, with TF-IDF (1-1) at δµ = 0.14, JSD = 0.24, p = 1.08e 05 and TF-IDF
(1-2) at δµ = 0.14, JSD = 0.17, p = 1.11e 05. Since TF-IDF (1-1) yields more difference between the
distributions, we will use this as our definitive similarity metric.
5.1.3 Data Analysis Using PCA and KMeans
In order to observe relationships between data within the entire dataset, the BERT, linguistic and similarity
features for each data point were collected and dimensionality reduction was conducted using Principle
Component Analysis to reduce the dimensions of the data to 2. Plot of these values were constructed
15
Figure 4: Similarity metrics comparison results. Each plot contains the name of the vectoriser, the similarity
metric, and δµ, JSD (Jensen-Shannon Distance), and p-value when ANOVA is perform with the metrics as depen-
dent variables against the labels which are independent variables. The red distribution is the dependent variable
when the label is label FAKE, and the green distribution is the dependent variable when the label is REAL
(a) PCA Graph (b) KMeans Graph
Figure 5: Analysis on PCA based on BERT features
with two different combinations of features: one using only BERT features for PCA and the other using
linguistic and similarity features in addition to BERT features.
Additionally, to observe whether the data was forming clusters based on its reduced dimensions,
KMeans was used with 5 clusters. The reason for using 5 clusters instead of 2 was because both FAKE and
REAL articles about different topics or written differently can form clusters with other articles in a similar
style and we were observing for distinct sets of REAL and FAKE clusters rather than simply one region of
FAKE articles and another with REAL ones.
16
(a) PCA Graph (b) KMeans Graph
Figure 6: Analysis on PCA based on BERT, linguistic and similarity features
As can be seen in Figure 5a, there is very little relationship between the principle components of
FAKE and REAL articles. When clustering is performed, as shown in Figure 5b, each cluster has an even
distribution of REAL and FAKE articles implying there is no evident pattern in the data.
Whereas in Figure 6a, there is a clear grouping of values for fake data. Interestingly, real data does not
have a clear cluster of values and instead occupies a large region of area to the right of the fake data. One
potential reason for this is that there are a lot of similar features between fake articles such as the number
of adverbs and the types of polarising language. Additionally, the similarity metric could be playing a
strong role in influencing the first principle component, pc1, which would potentially result in fake news
having low scores and real news having a broader range of similarity scores. Regardless, there is a clear
distinction and this is also shown in Figure 6b where the majority of fake values are clustered in the purple
cluster and the majority of real values are present in two separate clusters.
This indicates that the features selected are effective in separating the dataset into distinct FAKE and
REAL clusters.
5.2 Model Results
For our analysis, we will be comparing the top 3 performing machine learning methods alongside the best
performing neural network for each task. There are three combinations of inputs that are compared:
1. Only BERT features,
2. BERT and linguistic features,
3. BERT, linguistic and similarity features.
Figure 7 presents the classification results. As hypothesised from dataset analysis with PCA, only
using BERT does not perform well. With a testing accuracy around 50% for each classifier, it is simply
random guessing on the data. This is used as the baseline for our analysis as it is does incorporate any of
our novel contribution to this knowledge domain.
When incorporating the most effective linguistic features in addition to just BERT features, there is
a significant increase in testing accuracy and F1-score across every model. Evident in the more complex
models of neural networks and XGBoost, it is clear that there is a complex relationship being captured.
There is an increase in test accuracy to around 85% and increase in F1-score to 0.87 for the top performing
17
Figure 7: Accuracy and F1-score metrics for top performing models across different features
model. However, the decision tree classifier which is a far simpler model than XGBoost underperforms
with this set of features.
Adding our novel similarity feature,he neural network accuracy increases, going from 85% to 90%.
Additionally, F1-score increases from 0.86 to 0.91. Although SVC and XGBoost have similar accuracy
with and without similarity, their F1-score increases slightly. This is because there is a slight increase
in precision and decrease in recall. This means that fewer articles that have a ground truth of REAL are
classified as REAL but the quality of the REAL predictions increases. Another interesting note is that the
simpler classifier - decision tree - performs much better when the similarity score is added. This indicates
that the use of similarity score results in a split within the decision tree with much higher information
gain than was possible with only BERT and linguistic features.
The reason these combinations of features were chosen for analysis was that BERT and linguistic
features have a significant amount of prior research. These were included to represent the current state-
of-the-art research within this section of long-form journalistic fake news detection. In terms of this, we
conducted analysis and refined these features to develop a set of the most effective features where the final
inputs adds our novel similarity. As shown in the results, the linguistic features selected perform extremely
well in this task and the additional similarity metric reduces the complexity of the classification whilst
increasing the accuracy and F1-score for the neural network and decision tree classifiers.
Table 5 shows the results for the four machine learning classifiers used and adds more details than
visible in Figure 7. It displays the training and testing metrics for each model to show the effect of
changing the combination of features.
6 Conclusion
6.1 Summary of Contributions
Through this work, we show that real-time contextual information in the form of similarity scores with
related articles is effective in distinguishing between real and fake news. Additionally, through our analysis
18
Features Model Train Acc. Train F1 Test Acc. Test F1
BERT
LR 0.99 0.99 0.5 0.5
SVC 1.0 1.0 0.4 0.4
DT 0.99 0.99 0.45 0.56
XGB 1.0 1.0 0.6 0.6
BERT
+ Linguistic
LR 0.94 0.94 0.70 0.73
SVC 0.79 0.79 0.85 0.87
DT 1.0 1.0 0.65 0.70
XGB 1.0 1.0 0.85 0.87
BERT
+ Linguistic
+ Similarity
LR 0.94 0.94 0.7 0.73
SVC 0.79 0.79 0.85 0.87
DT 1.0 1.0 0.85 0.88
XGB 1.0 1.0 0.85 0.87
Table 5: Table showing all training and test accuracies and F1 scores on our machine learning models with
different feature inputs. The best test accuracy and F1 score have been emboldened.
into a range of linguistic features, we collate the features that provide the highest distinction between fake
and real articles, along with visualisations and classification models based on them.
Our work brings a more in-depth study of linguistic features and literature to find the most effective
one. It builds upon the limitations of previous similarity models including the lack of relevant between
context articles and the queried article, the usage of only the title rather than full body of text, and the
focus on only one main type of feature rather than a study of linguistic and similarity techniques together.
6.2 Limitations and Future Work
Preprocessing and tokenization
While our preprocessing was quite generic for NLP tasks, we believe there were a few oversights that
caused unreliable results. Two issues were:
Converting everything to lowercase destroyed acronyms such as “US”. This changed the meaning of
some sentences.
The nltk lemmatizer required manually specifying the part of speech to work. This caused some
verbs to be incorrectly lemmatized. We could have used a different library such as spaCy to extract
the POS automatically. Alternatively we could have investigated not lemmatizing at all to maintain
proper structure for non-latent features and BERT embeddings.
This research would have really benefited from less or no preprocessing at all. For all our features, there
could be an argument where no preprocessing would have been better. For example summary extraction
or BERT features could have learnt meaning from the unfiltered text. This could be investigated for future
research to strengthen our features.
BERT Features
In this research, we have used the last layer hidden state of the CLS token (processed by Linear layer
with Tanh activation) to encode the features as BERT features due to its common usage as a common
19
feature for classification [23]. However, the HuggingFace transformer documentation has also mentioned
that this output is not necessarily a ’good summary of the semantic content of the input’, and suggests
applying pooling or averaging the sequence of hidden-states for the entire input [23]. In addition, we can
also use the fully connected layers on top of the frozen BERT, which is the typical approach for textual
classification. These are things that we could experiment with to improve our BERT features input.
The plot in Appendix B shows the length distribution of the articles. We can see that input articles
and context articles have means of 1090.58 and 1827.29 respectively, with 50% percentile being 583 and
1032 respectively. This means that the input token size of 512 for BERT cannot encode a significant
amount of textual information in an article. We assume in our research that fake news is always visible
within 512 tokens of an article. However, we could conduct our experiments with LongFormer to test for
our hypothesis and for any further improvement.
Non-Latent Features
Our non-latent feature extraction pipeline currently applies on the content of the article. However, we could
easily use the same procedure to detect for any features which would significantly differentiate between
the REAL and the FAKE distributions in the title of the articles. Past research has also applied feature
extraction method on the title of articles, and our model would benefit from any important features.
Summary extraction
Our summary extraction extracted most of the correct meaning from the text. However two main issues
remained causing it to produce imperfect results:
Words would be unordered and if read together in a sentence, wouldn’t make sense to a human.
Junk such as connectives from poor preprocessing or words not too related to the article content
would be left in the summary. This could be a sign of a ineffective topic extraction method.
We believe future research could have looked into better tuned models that synthesized higher quality
topic sentences. From a cursory search for “keyword extractor” on HuggingFace, we found various pre-
trained models using modern transformer methods such as BERT to output keywords. We believe that
our method still worked well enough to demonstrate that this could be done but better summaries would
have yielded higher quality contextual articles. In addition, we could have also included the title since
semantically, the title is supposed to tell the reader what the rest of the article is about.
Article scraping
We had intended our pipeline to be fully automatic, using an API to scrape for articles based on the
keywords. For the scope of this project, we settled on manual scraping to show that using contextual
articles would improve results. Unfortunately, after being passed to Google, some input summaries would
return no or limited results and required human intervention to produce any articles. This was a particular
problem.
For Real labelled articles, we believe an improvement on the summary extractor reducing the amount
of junk returned would have improved results. However, we hypothesize that if an outrageous Fake article
as introduced, we may very well find no results about the event online. One example is the following article
about Trump “snorting cocaine”:
10Fake: The Internet is buzzing today after white supremacist presidential candidate Donald
Trump was caught by hotel staff snorting cocaine.
Maria Gonzalez an employee at the Folks INN & Suites Hotel in Phoenix brought room service
to his room witnessed it all.
20
“When I walked in I saw 3 naked prostitutes and maybe 100,000 in hundred dollars bills and
a mountain of white powder on the table, I thought it was a dog on the floor sleep but it was
his hair piece, he was bald and sweating like crazy.” [...]
This event never happened and consequently, we could not find any contextual articles about it. For our
research, we skipped articles with no context. We believe one way to resolve this is to include the article
with no similarity score or a low one. More research needs to be done into whether mixed articles with and
without a similarity can perform well together especially considering in the real world, this is defiantly an
issue.
Article vectorisation
Similar to 6.2, for future work, we can construct a non-latent vectoriser from the title of the articles, and
perform ANOVA on this vectoriser applied on the input article against the context articles. It is likely
however that the non-latent ’title’ vectoriser might yield a p-value higher than the significance threshold,
since this is the case with the non-latent ’body’ vectoriser.
Similarity metric calculation
For our similarity metric measure, we could also build a simple classifier which takes an input article and
context article, and output a similarity measure, and maximising the output if the input article is REAL
and minimising the output if it is FAKE.
In addition, we also have a hypothesis that REAL articles have a higher level of semantic similarity
between the content and the title. We can vectorise the pair, and then apply similarity metric measures
on them and repeat the same methodology as outlined in Section 3.4.3, 3.4.4, 3.4.5.
Dataset
For the scope of our research, we used a fairly small dataset to present our contributions to contextual
article scraping. However, this dataset only concentrated on the political events surrounding the 2016
United States election. This causes two main problems for our model:
We are prone to overfit our models since such a small dataset will be easily segmented by most state
of the art methods.
Our model will learn specifics in the US election which we do not want to learn. This could cause
the models to be confused if we try to classify more recent news.
In the future research could be done to provide an automated API for scraping which would have allowed
for a much larger dataset to be used that covers multiple world events across different years. Along with
this, masking out event specific words could have been done to reduce learning of event specifics.
6.3 Attributions
First and foremost, we thank Professor Yang Song and Maurice Pagnucco, and Wenbin Wang, for providing
various insights and suggestions to our research.
We thank Horne & Adali for providing the base dataset in their research [9] to which we provided
extensions to. We also acknowledge all related works which we have used to experiment with our features.
Lastly the team has used open source software as part of developing our software. The licenses and
links to the open source projects have been included in and ATTRIBUTIONS file which has been distributed
alongside this report.
21
References
[1] Hunt Allcott and Matthew Gentzkow. “Social media and fake news in the 2016 election”. In: Journal
of economic perspectives 31.2 (2017), pp. 211–236.
[2] Alexandre Bovet and Hern´an A Makse. “Influence of fake news in Twitter during the 2016 US
presidential election”. In: Nature communications 10.1 (2019), p. 7.
[3] Twitter Help Center. How we address misinformation on Twitter. 2023. url: https : / / help .
twitter.com/en/resources/addressing-misleading-info.
[4] Wissam Antoun et al. “State of the art models for fake news detection tasks”. In: 2020 IEEE inter-
national conference on informatics, IoT, and enabling technologies (ICIoT). IEEE. 2020, pp. 519–
524.
[5] Fahad Alsuliman et al. “Social Media vs. News Platforms: A Cross-analysis for Fake News Detection
Using Web Scraping and NLP”. In: Proceedings of the 15th International Conference on PErvasive
Technologies Related to Assistive Environments. 2022, pp. 190–196.
[6] Sairamvinay Vijayaraghavan et al. “Fake news detection with different models”. In: arXiv preprint
arXiv:2003.04978 (2020).
[7] Xinyi Zhou and Reza Zafarani. “A survey of fake news: Fundamental theories, detection methods,
and opportunities”. In: ACM Computing Surveys (CSUR) 53.5 (2020), pp. 1–40.
[8] Sonal Garg and Dilip Kumar Sharma. “Linguistic features based framework for automatic fake news
detection”. In: Computers & Industrial Engineering 172 (2022), p. 108432.
[9] Benjamin Horne and Sibel Adali. “This just in: Fake news packs a lot in title, uses simpler, repetitive
content in text body, more similar to satire than real news”. In: Proceedings of the international AAAI
conference on web and social media. Vol. 11. 1. 2017, pp. 759–766.
[10] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing
text with the natural language toolkit. ”O’Reilly Media, Inc.”, 2009.
[11] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Under-
standing”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805. url: http://arxiv.org/abs/
1810.04805.
[12] Radim
ˇ
Reh˚uˇrek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”.
English. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.
Valletta, Malta: ELRA, May 2010, pp. 45–50.
[13] Lucas Ou-Yang. newspaper3k. 2013. url: https://newspaper.readthedocs.io/en/latest/.
[14] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning
Research 12 (2011), pp. 2825–2830.
[15] Google Code Archive. word2vec. 2013. url: https://code.google.com/archive/p/word2vec/.
[16] Radim
ˇ
Reh˚uˇrek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”.
English. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.
http://is.muni.cz/publication/884893/en. Valletta, Malta: ELRA, May 2010, pp. 45–50.
[17] Tomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv:
1301.3781 [cs.CL] .
[18] Pauli Virtanen et al. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python”. In:
Nature Methods 17 (2020), pp. 261–272. doi: 10.1038/s41592-019-0686-2.
[19] Jasmin Komi´c. “Harmonic Mean”. In: International Encyclopedia of Statistical Science. Ed. by Mio-
drag Lovric. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 622–624. isbn: 978-3-642-
04898-2. doi: 10.1007/978-3-642-04898-2_645. url: https://doi.org/10.1007/978-3-642-
04898-2_645.
22
[20] Frank Nielsen. “On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon
centroid”. In: Entropy 22.2 (2020), p. 221.
[21] Gurchtan Singh. ANOVA: Complete guide to Statistical Analysis & Applications. July 2023. url:
https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/.
[22] Craig Silverman. This Analysis Shows How Viral Fake Election News Stories Outperformed Real
News On Facebook. Nov. 2016. url: https://www.buzzfeednews.com/article/craigsilverman/
viral-fake-election-news-outperformed-real-news-on-facebook.
[23] Thomas Wolf et al. “Transformers: State-of-the-Art Natural Language Processing”. In: Association
for Computational Linguistics, Oct. 2020, pp. 38–45. url: https://www.aclweb.org/anthology/
2020.emnlp-demos.6.
23
A Article scraping
ID Article extract Summary
128_Real [...]I have a prediction. I know exactly what November 9
will bring. Another day of God’s perfect sovereignty.
He will still be in charge. His throne will still be occupied.
He will still manage the affairs of the world. Never before
has His providence depended on a king, president, or
ruler. And it won’t on November 9, 2016. “The LORD
can control a king’s mind as he controls a river; he can
direct it as he pleases” (Proverbs 21:1 NCV).
On one occasion the Lord turned the heart of the King
of Assyria so that he aided them in the construction of
the Temple. On another occasion, he stirred the heart of
Cyrus to release the Jews to return to Jerusalem. [...]
god wa one never
every king novem-
ber still heart
2_Fake Washington, D.C. South African Billionaire, Femi
Adenugame, has released a statement offering to help
African-Americans leave the United States if Donald
Trump is elected president. According to reports, he is
offering $1 Million, a home and car to every Black family
who wants to come to South Africa.
Concerns about Donald Trump becoming president has
prompted a South African billionaire to invest his fortune
in helping African-Americans leave the United States to
avoid further discrimination and inequality. [...]
ha adenugame
africanamericans
south femi united
states africa presi-
dent donald
10_Fake The Internet is buzzing today after white supremacist
presidential candidate Donald Trump was caught by ho-
tel staff snorting cocaine.
Maria Gonzalez an employee at the Folks INN & Suites
Hotel in Phoenix brought room service to his room wit-
nessed it all.
“When I walked in I saw 3 naked prostitutes and maybe
100,000 in hundred dollars bills and a mountain of white
powder on the table, I thought it was a dog on the floor
sleep but it was his hair piece, he was bald and sweating
like crazy.” [...]
wa room hotel
maria told em-
ployee gonzalez hit
video get
34_Fake It has been more than fifteen years since Rage Against
The Machine have released new music. The members
of the band have involved themselves in various other
projects during their lengthy hiatus, but one pressing is-
sue has forced the band to team up once again.
In a statement posted online, Rage Against The Machine
announced they would be releasing a brand new album
aimed at spreading awareness about “how awful Donald
Trump is”. [...]
trump rage album
machine band ha
donald music out-
side year
Table 6: Articles we were not able to find context articles for. Articles like 10 Fake describe a situation that
plainly does not happen whereas articles like 128 Real and 34 Fake produces summaries that confuse Google News
and don’t produce grate articles.
24
B Word Count Histogram
Figure 8: Word count histogram of input (left) and context articles (right)
25
C Non-Latent Features
Table 7: Non-latent feature names (far-right column) and their corresponding category, feature type, and calcu-
lation method
Category Feature Type Calculation Feature Name
Diversity
Noun (Unique)
Count
div NOUN sum
Verb (Unique) div VERB sum
Adjective (Unique) div ADJ sum
Adverb (Unique) div ADV sum
Lexical Word (Unique) div LEX sum
Content Word (Unique) div CONT sum
Function Word (Unique) div FUNC sum
Noun (Unique)
Percent
div NOUN percent
Verb (Unique) div VERB percent
Adjective (Unique) div ADJ percent
Adverb (Unique) div ADV percent
Lexical Word (Unique) div LEX percent
Content Word (Unique) div CONT percent
Function Word (Unique) div FUNC percent
Quantity
Noun
Count
div NOUN sum
Verb div VERB sum
Adjective div ADJ sum
Adverb div ADV sum
Pronoun div PRON sum
Personal Pronoun div DET sum
Possessive Pronoun div NUM sum
Determinant div PUNCT sum
Number div SYM sum
Punctuation div PRP sum
Symbol div PRP$ sum
Wh-Determinant div WDT sum
Cardinal Number div CD sum
Verb (Past Tense) div VBD sum
Stop Word div STOP sum
Lowercase Word div LOW sum
Uppercase Word div UP sum
Negation div NEG sum
Noun
Percent
div NOUN percent
Verb div VERB percent
Adjective div ADJ percent
Adverb div ADV percent
Pronoun div PRON percent
Personal Pronoun div DET percent
Possessive Pronoun div NUM percent
Determinant div PUNCT percent
Number div SYM percent
Punctuation div PRP percent
Symbol div PRP$ percent
Wh-Determinant div WDT percent
Cardinal Number div CD percent
26
Verb (Past Tense) div VBD percent
Stop Word div STOP percent
Lowercase Word div LOW percent
Uppercase Word div UP percent
Negation div NEG percent
Quote
Count
div QUOTE sum
Noun Phrase div NP sum
Character div CHAR sum
Word div WORD sum
Sentence div SENT sum
Syllable div SYLL sum
Sentiment
Exclamation Mark
Count
div ! sum
Question Mark div ? sum
All-Cap Word div CAPS sum
Polarity
Index
div POL sum
Subjectivity div SUBJ sum
Pronoun
First Person Singular
Count
div FPS sum
First Person Plural div FPP sum
Second / Third Person div STP sum
First Person Singular
Percent
div FPS percent
First Person Plural div FPP percent
Second / Third Person div STP percent
Average
Character Per Word
Average
div chars per word sum
Word Per Sentence div words per sent sum
Clause Per Sentence div claus per sent sum
Punctuation Per Sentence div puncts per sent sum
Syntax Tree Depth
Median Syntax Tree Depth
Median
div ALL sum
Median Noun Phrase Syntax
Tree Depth
div NP sum
Readability
Gunning-Fog Index
Index
div gunning-fog sum
Coleman-Liau Index div coleman-liau sum
Flesch Kincaid Grade Level div dale-chall sum
Linsear Write div flesch-kincaid sum
SPACHE div linsear-write sum
Dale Chall Readability div spache sum
Automated Readability Index
(ARI)
div automatic sum
Flesch Reading Ease div flesch sum
27
Figure 9: Box Plot of Non-Latent Box Plot Features, displaying the Median, Interquartile Range (IQR), Minimum,
Maximum, does not display Outliers which would inflate plot size. Each subplot contains features with mean µ in
the corresponding powers of 10.
28
D Non-latent Feature Pearson Correlation Matrix
Figure 10: Pearson correlation matrix of selected non-latent features (which are red and white in Figure 11).
Brighter areas show higher rate of correlation, and vice-versa. An explanation of features is available in Appendix
C
29
E ANOVA of Non-Latent Feature against Labels
Figure 11: Table of features followed by their p-value when ANOVA is performed with features as dependent
variables against labels as the independent variable. Features are sorted by their p-value. The table on the right
continues from the last row of the left table. The significance level is α = 0.05, features with p-value above the
threshold is in blue. Features below the level are subjected to correlation clustering algorithm which keeps features
with lowest p-value and remove all their correlated features as outlined in Section 3.3, selected features are in white,
whilst removed features are in red.
30
F Machine learning
Model Parameter Selection Reasoning
Logistic
Regression
Inverse L2
coefficient
0.2:1.2:0.2 This is the main regularization parameter.
We chose a range around the default 1.0 but
shifted our range to be more biased towards
higher regularization.
Solver lbfgs, liblinear The liblinear solver was suggested by the
documentation as an alternative for small
datasets.
SVM
Inverse L2
coefficient
0.2:1.2:0.2 Same reasoning as LR regularization.
Kernel
rbf, poly,
sigmoid
Selecting the right kernel for a dataset will
make our methods perform better.
Kernel
coefficient
1
n features×var(X)
,
0.01, 0.05
Same reason as above.
Decision Tree
Criterion gini, entropy To test different methods of measuring split
quality on the node.
Max depth no limit, 3:9:2 Controls how complex the tree is. A less
deeper tree is more regularized.
Max features
0.3 × n features,
n features, all
features
Standard defaults suggested by documenta-
tion. Controls regularization so not all fea-
tures are considered at each split.
Min samples
for splitting
node
2:4:1 Reduce the number of leafs with only one sam-
ple of representation to increase regulariza-
tion.
XGBoost
Learning
rate
0.1:0.5:0.1 Smaller learning rates reduce overfitting.
Max depth 1:6:1 Same as max depth for DTs.
L2
coefficient
0.8:1.6:0.2 Testing higher regularization. Default is 1.
L1
coefficient
0:0.4:0.2 Testing higher regularization. Default is 0.0.
Table 8: Table of all the models chosen and the hyperparameters selected for each model. We describe a range
of values in the format start:end:step, where start and end are inclusive. Our main focus was to investigate
regularization parameters that would better fit our smaller dataset.
31