This paper is based on a research project from June-August 2023. It uses a variety of machine learning and language processing techniques to detect fake news.

Automated fake news detection through contextual

similarity comparison

Dhruv Agrawal

---@unsw.edu.au

Duke Nguyen

---@unsw.edu.au

Jim Tang

---@unsw.edu.au

August 17, 2023

Contents

1 Introduction 3

2 Related work 4

2.1 Similarity Comparison using Titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Similarity Comparison using Fixed News Database . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Linguistic and Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Methods 5

3.1 Preprocessing and tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Feature — BERT embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Feature — Non-latent features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.4 Feature — Similarity model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.4.1 Summary extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4.2 Article scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4.3 Article vectorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4.4 Similarity metric calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4.5 Similarity metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Feature Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.6 Model — Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.7 Model — Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Experimental setup 13

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Results and discussion 14

5.1 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.1 Non-Latent Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.2 Similarity Metrics Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.3 Data Analysis Using PCA and KMeans . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Conclusion 18

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.3 Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Appendix A Article scraping 24

Appendix B Word Count Histogram 25

Appendix C Non-Latent Features 26

Appendix D Non-latent Feature Pearson Correlation Matrix 29

Appendix E ANOVA of Non-Latent Feature against Labels 30

Appendix F Machine learning 31

1 Introduction

As the distribution of news shifts towards social media, there is a rise in the dissemination of fake news

[1]. We deﬁne fake news as the creation of information presented as legitimate journalism that describes

a ﬁctitious event or fabricates details of an event in an attempt to mislead readers. This phenomenon

became a major focal point in journalism during the 2016 US elections with political parties labelling

many publications and articles as fake news. There are two huge issues with how this occurred in 2016:

1. Many prominent political ﬁgures highlighted any news they disagreed with as fake news. This led

to the political isolation of the party, whereby any news that had portrayed them in a negative light

had the potential to be dismissed as fake news This reduced the accountability of political ﬁgures in

the US, a country where federal legislation has a sweeping impact across the country and the rest of

the world.

2. There was a lack of fake news detection tools on social media and due to the polarisation of the media

climate, it was extremely diﬃcult for social media to regulate articles published on the platforms or

remove factually incorrect articles posted or shared by politicians and Americans [2].

Since then, there have been many attempts to introduce ways to deal with these issues such as Politifact

which manually reviews articles and social media posts for factual correctness. It posts its ﬁndings on its

website and is easily accessible for people. Other similar websites exist but the reason manual fact-checking

tools are not as prominent in spaces with high amounts of fake news is because it is impossible for manual

review tools to scale to the number of news articles and journalistic social media posts published every

day.

There are also many automated fake news detection algorithms which rely on linguistic features of

the text, comparisons between the title of the article and its content, the medium of transmission, and

any suspicious activity linked with its engagement online. These tools have become more eﬀective since

COVID-19 and Twitter employs its own automated algorithms to automatically label articles about certain

topics it chooses as fake news [3]. However, these tools are known to be very unreliable as it is increasingly

common for fake news to read the same way as real news articles and be shared by humans.

Therefore, in order for fake news detection to become more widespread and eﬀective in combating fake

news, there are a few diﬀerent criteria it must fulﬁl:

1. The algorithms need to automatically classify news as real or fake so that they can scale with the

growth of social media and the increase in fake news dissemination.

2. The algorithms need to incorporate current methods of fake news detection as these have been highly

researched and are eﬀective in many situations such as when fake news has been automatically

generated or constructed in a highly polarised manner designed to provoke intense responses from

readers.

3. The algorithms need to examine an article’s content and meaning beyond its writing style to combat

fake news that is well-written and designed to look like real news.

4. The dataset used to train and assess the algorithms must contain real and fake articles that are

written in the same style so that it is not apparent simply from the way an article is written whether

it is real or fake.

Our approach improves upon existing approaches that focus on the ﬁrst and second criteria, or the

third one. Most models rarely focus on both the second and third criteria and our aims to combine those

to analyse both the content and style of articles. The model then makes a signiﬁcantly more informed

decision on its classiﬁcation. The model is restricted to a binary classiﬁcation - it outputs either real or

fake rather than giving a conﬁdence metric of an article’s legitimacy. This is done as the aim is for the

tool be easily adopted and focusing on the simplicity of the input and output is a priority.

We compiled a list of commonly used linguistic features for fake news detection. Multiple diﬀerent

pairings of features were formed and analysis was conducted to determine the most eﬀective linguistic

features for the task. This takes existing research into fake news detection and puts our model in line with

current methods. A new feature - lexicographical similarity - is used to achieve the third criterion above.

At a high level, this feature compares the queried article to other articles on Google news which are at the

top of Google’s searches and applies various algorithms based on the individual words shared between the

articles. As these Google news articles have high PageRank scores, the model can be conﬁdent that they

are real articles and it compares the similarity of the content between the queries article and each of these

top searched articles. This is done as a way for the model to infer context before making a judgement on

an article’s legitimacy. This approach brings out model in line with the way humans manually fact-check

articles which usually involves ﬁnding known trustworthy sources and comparing the content between the

articles to determine whether the queried one is consistent with the trustworthy ones.

Section 2 discusses related work in the study of automated fake news detection. It includes research

on existing similarity comparison models and linguistic features useful for this task. Section 3 covers the

methodology of our approach. Speciﬁcally, it covers preprocessing the input and queried articles, the

BERT, linguistic and similarity feature calculations, and classiﬁcation with machine learning and neural

networks. Section 4 analyses the experimental setup and explains the reasoning behind the evaluation

metrics being used. Section 5 discusses the results of feature analysis of linguistic features, PCA and

KMeans analysis on the collected features, and a comparison between machine learning and neural net-

work approaches for classiﬁcation. Section 6 covers our contributions, limitations and future work, and

attributions for the work.

Through this research, we have analysed and determined the most eﬀective linguistic features for fake

news detection and shown that the use of similarity as a metric is eﬀective in building upon these current

metrics to increase accuracy. We have also compared the use of the similarity metric with diﬀerent machine

learning classiﬁers and discovered that it greatly increases the accuracy of less complex machine learning

methods and brings their performance in line with complex models.

2 Related work

2.1 Similarity Comparison using Titles

Methods of similarity comparison have been attempted in a few fake news detection papers. Antoun et al.

[4] presents an analysis of various fake news detection methods including an automated model winning an

international competition on this topic. This model used the title of the queried article to search Google

for similar articles and compared the word embeddings for the title against those of the top 5 searched

articles. It used cosine similarity only and tested the similarity scores along with other features such as

lexicon based features. Many classiﬁers were tested including SVM, XGBoost and random forest classiﬁers.

This resulted in a model that was eﬀective in detecting articles where there was a lot of emphasis on the

title for fake news classiﬁcation.

2.2 Similarity Comparison using Fixed News Database

A diﬀerent approach to similarity detection is used in Alsuliman et al. [5] where similarity scores are

generated between the queried article and every article in a database of news articles. This method takes

the highest similarity score for each queried article and then uses a greedy approach to set a similarity

score threshold for real articles that maximises the overall accuracy. This paper uses three well-studied

techniques for similarity score calculation - cosine similarity, word appearance and TF-IDF. It explains

how these metrics are used within this problem and contrasts the results of each. The most signiﬁcant

limitation of this paper is that the similarity articles that were selected were rarely relevant to the topic

of the queried article.

This paper forms the basis of our set of similarity metrics and we analyse each of the three metrics given.

Our approach improves upon the methodology of this approach by capitalising on Google’s PageRank

algorithm to only select articles with a high chance of being relevant to the queried topic. Additionally,

our dataset consists of around 10 times as many articles and our classiﬁcation is done using machine

learning and deep learning classiﬁers rather than a greedy approach to ﬁt a more realistic situation where

fake news detection is required.

2.3 Linguistic and Additional Features

Vijayaraghavan et al. [6] presents a series of diﬀerent fake news model alongside useful linguistic features for

this task. It addresses preprocessing techniques including the removal of punctuation and stop words which

is an eﬀective technique in this domain. It analyses the polarity of news articles and determines that both

real and fake news have similar polarity distributions making this an ineﬀective method of distinguishing

between them. It also analyses the part of speech distribution between fake and real news and determines

that features such as the number of adverbs and adjectives is higher in fake news but the number of nouns

and pronouns is higher in real news. This is because fake news relies on descriptive language to establish

facts whereas real news refers to research and experts to infer its legitimacy. Zhou & Zafarani [7] produced

a comprehensive survey on non-latent features which this research refers to extensively in Section 3.3,

along with Garg & Sharma [8], and Horne & Adali [9].

3 Methods

Figure 1: Our classiﬁcation pipeline.

Figure 1 shows our mostly lin-

ear classiﬁcation pipeline. After

preprocessing and tokenization, we

extract contextual articles which

are fed into a similarity model to

form our ﬁrst feature. Addition-

ally, non-latent features from raw

text and BERT embeddings form

the rest of our features. The con-

catenation of all the features are

fed into our classiﬁcation models

which infers a binary classiﬁcation

label.

3.1 Preprocessing and tokenization

Before extracting any features, we will preprocess our input and convert the long form text into tokens.

We perform the following preprocessing methods in order:

Remove non-ascii: Our input articles contained unnecessary unicode tokens such as unicode

double quotation marks. These can be removed safely since they do not add any extra semantics

to the input articles and may confuse feature extraction.

Convert to lowercase: In our research, we converted all text to lowercase. However upon

further analysis, converting all text to lowercase hid acronyms such as “US” which could have

aﬀected the main themes of the text. Further, all proper nouns such as names and places were

also hidden. We will discuss this limitation in Section 6.2.

Lemmatization: We used the nltk [10] library to reduce words down to their lemma in the

hopes of reducing the complexity within our text which may beneﬁt feature extraction. This

ID Article extract Tokens

118 Real FBI Director James Comey said Sunday that the bureau

won’t change the conclusion it made in July after it exam-

ined newly revealed emails related to the Hillary Clinton

probe.

“Based on our review, we have not changed our conclu-

sions that we expressed in July with respect to Secre-

tary Clinton” Comey wrote in a letter to 16 members of

Congress. [...]

[’fbi’, ’director’, ’james’, ’comey’, ’said’,

’sunday’, ’bureau’, ’change’, ’conclusion’,

’made’, ’july’, ’examined’, ’newly’, ’re-

vealed’, ’email’, ’related’, ’hillary’, ’clin-

ton’, ’probe’, ’.’, ’”’, ’based’, ’review’, ’,’,

’changed’, ’conclusion’, ’expressed’, ’july’,

’respect’, ’secretary’, ’clinton’,. . . ]

15 Fake After hearing about 200 Marines left stranded after re-

turning home from Operation Desert Storm back in 1991,

Donald J.Trump came to the aid of those Marines by

sending one of his planes to Camp Lejuene, North Car-

olina to transport them back home to their families in

Miami, Florida.

Corporal Ryan Stickney was amongst the group that was

stuck in North Carolina and could not make their way

back to their homes. [...]

[’hearing’, ’200’, ’marines’, ’left’,

’stranded’, ’returning’, ’home’, ’oper-

ation’, ’desert’, ’storm’, ’back’, ’1991’, ’,’,

’donald’, ’j’, ’.’, ’trump’, ’came’, ’aid’,

’marines’, ’sending’, ’one’, ’plane’, ’camp’,

’lejuene’, ’,’, ’north’, ’carolina’, ’trans-

port’, ’back’, ’home’, ’family’, ’miami’,. . . ]

Table 1: Examples of preprocessing and tokenization extraction on items in dataset.

looks up the work in the WordNet corpus to get the lemma. Later in the research, we realised

that this hypothesis may have not been accurate.

Firstly the nltk library we were using does not automatically detect the part of speech and will

by default, only lemmatize nouns. While it is arguably better for us to maintain the tense of

nouns, we are technically not lemmatizing fully. Secondly, from more research, lemmatization

may not be ideal for BERT embeddings since it removes some semantics that could be learnt

by the BERT model. We will discuss these limitations further in Section 6.2.

Remove stopwords: Stopwords were removed from the text in order to reduce complexity.

Apart from the above methods, we also tested removing punctuation. However, this was not used in

the end since we added non-latent features to measure punctuation counts and also to maintain semantics

for BERT.

After preprocessing, tokens are then generated based on any whitespace and punctuation in the re-

maining text. Table 1 shows samples of tokenized input articles.

3.2 Feature — BERT embeddings

BERT (Bidirectional Encoder Representations from Transformers) is a language representation model

proposed by Devlin et al [11]. It is pre-trained on BookCorpus and the English Wikipedia using masked

language modelling (MLM) and next sentence prediction (NSP). MLM masks some of the input tokens with

a training objective to predict the masked token simply based on context, and the model also concatenates

two sentences with 50% chance of being neighbouring sentences, and the model is pre-trained in the NSP

layer to predict if the two are indeed neighbours [11]. BERT obtains SOTA results on several tasks and is

suitable for representing document and textual data. Hence, we will be using BERT as the main feature

to encode our articles. We will use HuggingFace’s ’bert-base-uncased’ pretrained model which trains on

uncased data. To encode an article, we truncate it to the ﬁrst 512 tokens pass it through ’bert-base-

uncased’ and output the CLS token’s vector as BERT features for our classiﬁcation model. Due to this

truncation, the BERT encoding won’t be able to capture the entire article content. However, we hypothesis

that the ’fake news’ quality of an article is at the minimum visible at a range of 512 tokens (see BERT

Features in Section 6.2 for further discussion). In addition, any information not visible in this range will

be captured by non-latent features or similarity metrics.

3.3 Feature — Non-latent features

From our literature review and survey, we are able to identify a signiﬁcant amount of non-latent features

[7, 8, 9]. After combining features that are similar, and removing features which we cannot calculate due to

the need for proprietary softwares (i.e. LIWC), or due to the computational complexity of the algorithms,

or related reasons, we are able to identify 81 numerical features suitable for our experiments. Table 2

shows the 7 main categories of our features.

Type Description Examples

Diversity Number of unique words, or percentage of all

unique words of a particular part of speech

Noun, verb

Quantity Number of words, or percentage of all words

of a particular part of speech or linguistic unit

Noun, adjective,

quote

Sentiment Number of linguistic features denoting senti-

ment

Exclamation mark,

all-cap words,

polarity,

subjectivity

Pronoun Number of pronouns of a speciﬁc class

First person

singular pronoun:

I, me, my, mine

Average Average number of linguistic unit per other

linguistic unit

Characters per

word

Syntax Tree Depth The median syntax tree depth of a given unit

Median noun

phrase syntax tree

depth

Readability Measures complexity and how interpretable a

text is

Gunning-Fog,

Coleman-Liau

Table 2: The category that our 81 features fall under along with a description and examples.

We display all our features including our internal names in Table 7. Using the features, we apply

ANOVA (Analysis of Variance, see 3.4.5 for an overview) and ﬁlter for those with a p-value less than the

α signiﬁcance level of 0.05. We remove the features with p-value above α and apply Pearson correlation

and identify all correlation clusters. Afterwards, we can apply a selection method on the cluster where

we remove all but one with the lowest p-value, speciﬁcally, we sort the selected features by p-value in

ascending order with the lowest p at the front of the list, we then search for all features which have a

correlation of at least 0.95 and remove them from the current list, and continue until the end. These will

be our non-latent features for our classiﬁcation models.

3.4 Feature — Similarity model

As a novel feature, we investigate the similarity of our input articles to contextual articles found online.

In Sections 3.4.1 and 3.4.2 we will discuss our process of gathering three contextual articles from online

sources which we treat as truth. These three articles extend our original dataset and are vectorized then

fed into a similarity model which we describe in Sections 3.4.3 and 3.4.4. We use this similarity as a feature

to ascertain whether our input article contains misinformation.

3.4.1 Summary extraction

To get the context articles, we need to summarize the main topic of our input article down to at most

10 keywords. We use the Python gensim [12] library which provides various topic modelling interfaces

for text inputs. We use the ldamodel which implements Latent Dirichlet Allocation (LDA) to extract a

single topic. LDA is a probabilistic model which assumes you have a number of documents representing

some latent topics characterized by a distribution over words. By feeding in the preprocessed sentences

of our input article as each document, we are able to get the main themes. We sort the output keywords

by the probability they represent the topic then cap the amount of words to 10 at most. We chose LDA

because it is a simple algorithm that outputs reproducible and reliable results and can be customized in

the number of topics and keywords it generates.

For the scope of our research, we are able to perform manual validation of the summaries extracted

to check the summary represented the article content well. Table 3 shows some samples of items in our

dataset after applying LDA. We see that while the summaries extracted are not perfect, they still represent

the general meaning of the article and were suﬃcient for the purposes of our research. Two common issues

we saw were:

• Unordered words in the summary — words representing the topics seemed to be unordered. To a

human reading the summary by itself, they might be able to see that the words are all keywords of

the article but put together in a sentence, will not completely make sense. We hypothesize that this

could have caused sub-optimal results when we started scraping articles using the summaries.

• Appearance of stop words and other meaningless non-topic words in the summary — As a ﬂow on

issue from our preprocessing, our summary was left with words such as “wa” (from “was”) or “ha”

(from “has”). This would have impacted the meaning of our summary and later article scraping.

We will discuss the possibility of extracting better summaries using a more robust model in Section 6.2.

3.4.2 Article scraping

We feed the summary of the input article into Google News and collect the top three articles. We chose

Google News since it is a well know search engine that will return the most popular articles on the internet.

This sort of PageRank algorithm is important since we assume that the contextual articles are real and

describe what is the current word-of-mouth-truth from the internet. Therefore for purposes of comparison,

an input article that is very diﬀerent to our contextual article is likely to be of the FAKE class.

For our research, we will only manually feed in all summaries for our dataset. Our motivation for this

research was to develop a tool that a user could potentially use to ﬁgure out if the current news they are

reading contains misinformation. We acknowledge there exists APIs that provide either a wrapper around

Google News or implement their own news search algorithm that we could have looked into. However,

given the size of the dataset and our scope, this was not necessary to demonstrate our system.

SETUP: We use a virtual machine with a freshly installed latest version of Google Chrome.

Searches are conducted in “Incognito Mode” tabs. We also use a VPN to the West coast of the

US. These invariants serve the main purpose so that Google’s does not give any personalized

results based on a browser ﬁngerprint or IP address. We chose the US as the VPN destination

since our dataset articles were extracted from US news sources and we wanted to scrape for

articles with a similar style of writing. If you were to use the tool in Australia, Google would

usually return articles from local sources. We restrict our scope to speciﬁcally this dataset

rather than train on a wide dataset from all sources.

ID Article extract Summary

118 Real FBI Director James Comey said Sunday that the bureau

won’t change the conclusion it made in July after it exam-

ined newly revealed emails related to the Hillary Clinton

probe.

“Based on our review, we have not changed our conclu-

sions that we expressed in July with respect to Secre-

tary Clinton” Comey wrote in a letter to 16 members of

Congress. [...]

email review fbi

clinton said july

comey news new

15 Fake After hearing about 200 Marines left stranded after re-

turning home from Operation Desert Storm back in 1991,

Donald J.Trump came to the aid of those Marines by

sending one of his planes to Camp Lejuene, North Car-

olina to transport them back home to their families in

Miami, Florida.

Corporal Ryan Stickney was amongst the group that was

stuck in North Carolina and could not make their way

back to their homes. [...]

home marines

trump wa stickney

way north plane

family

Table 3: Examples of summary extraction on items in dataset.

Another invariant we implement is to add a before:2020 to our summary. This forces Google

News to only ﬁnd articles before this year so that the news we get won’t be from recent news.

A common discussion topic from our dataset was Donald Trump’s 2016 election campaign

and we know that the news regarding Trump in 2023 is much diﬀerent to that of 2016. This

makes sense as we are not using a very recent dataset so clamping the date we ﬁnd contextual

articles assumes that if were looking for fake articles at the time of reading the input article,

we wouldn’t have too much future articles available.

PROCESS: We attempt to get the top three articles and save the URL for each input article.

Not all summaries returned three articles so we perform scraping in three passes:

1. We enter the whole summary without any changes. This is the most ideal approach and

most machine-replicable. This covered 70% of our dataset.

2. Still performing only generic actions, we remove any bad words or non-important connec-

tives then searched again. This should still be machine-replicable with further work. This

covered the next 20% of our dataset.

3. For the last 10% of our dataset, we had to manually look at the input article content and

summary generated to ﬁgure out why we still received no results. Our hypothesis was

that this was a combination of our non-tuned summary extraction and the fact that some

outrageous Fake articles simply didn’t have any similar articles that could be found. We

will discuss this limitation in Section 6.2.

From the above passes, we were not able to ﬁnd context articles for four input articles described

in a table in Appendix A. Furthermore, we were only able to ﬁnd one or two articles for some

inputs but we can still continue with our similarity model.

Figure 2: Sample of articles found in Google after searching an article summary.

After gathering three URL links for each context article, we use the Python newspaper3k [13] library to

download the article and automatically extract its title and content.

3.4.3 Article vectorisation

Past research by Alsuliman, et. al in [5] proposes two diﬀerent ways to vectorise the articles: TF-IDF, and

Word2Vec. In addition to these two vectorisation methods, we propose a third – “non-latent vectoriser”.

TF-IDF [14]: The term frequency times inverse document frequency, which is a “common

term weighting scheme in information retrieval”. The formula given as follows:

article.count(term)

len(article)

∗ log



len(articles)

df(articles, term)

+ 1



The ﬁrst term is the term frequency, and the second term is the inverse document frequency,

where article is the article we are applying TF-IDF on, article.count(term) is the fre-

quency of term in article, len(article) is the number of words in the article, len(articles)

is the number of articles, df(articles, term) is the document frequency of term in our

articles dataset, or the number of articles that contains this term [14]. The formula given

has a ’+1’ term in idf so that the algorithm would not ignore terms that appear in all articles,

this is the sklearn implementation and diﬀers from the standard textbook formula which has

the ’+1’ in the denominator in log2 of idf [14]. TF-IDF will be ﬁtted on the original dataset

of input articles (for articles in IDF term), then be used as vectorize to transform both the

input and context articles. We will apply TF-IDF in two n-gram ranges of (1,1) and (1,2).

Word2Vec [15]: A word embedding architecture introduced by Google in 2013. It uses

continuous bag-of-words (CBOW), which uses neighbouring words to predict target words, and

continuous skip-gram which uses target words to predict the neighbouring words, using either

hierarchical softmax or negative sampling [16] [17]. We will be using the gensim implementation

of word2vec using the pre-trained ’word2vec-google-news-300’ model which is trained on the

Google News dataset of about 100 billion words containing 3 million words and phrases in 300

dimensions. We chose this model due to the similarity between the domain of its dataset and

out dataset being news. To calculate the article vector, we retrieve the vector of every single

word in an article, existing those that do not exist in the embeddings, and then taking the

average from the list [5].

Non-Latent Vectoriser: The non-latent vectoriser uses the non-latent feature selected in

Section 3.3 and apply them on an article into a vector. This is the non-latent vector presentation

of the respective article.

3.4.4 Similarity metric calculation

Past research by Alsuliman et al. in [5] proposes three diﬀerent metrics to calculate the similarity between

two documents: cosine distance, word appearance (word app), and matching score. In addition, we also

propose a third metric being the harmonic mean of the three, to harmonise any statistical diﬀerence and

incorporate all distributional diﬀerences between the measures.

Cosine distance: Calculated as one minus the cosine similarity of two vectors u and v.

The cosine similarity is the cosine of the angle between the two vectors (calculated as the dot

product of u and v) divided by the product of the Euclidean L2 norm of the two vectors to

scale the range to [0, 1] [14]. Lower values denote higher similarity between the two vectors,

and vice versa. The formula is given as follows [18]:

1 −

u · v

∥u∥∥v∥

Word app: Calculated as the number of unique common words between the prediction and

the context articles divided by the number of unique words in the context article. Given that

input

unique

is the set of unique words in the input document, context

unique

is the set of unique

words in the context document. The formula is as follows:

|input

unique

∩ context

unique

|context

unique

Matching score: Calculated as the L1 norm of the vector of the unique common words

between the prediction and the context article, divided by L1 norm of vectorized unique words

in the context article. Given that input

unique

is the set of unique words in the input document,

context

unique

is the set of unique words in the context document, vec(x) is a function which

vectorises the set of words x, the formula is as follows:

∥vec(input

unique

∩ context

unique

)∥

∥vec(context

unique

)∥

Harmonic mean [19]: Calculated by the following formula :

¯x

i=1

When n = 3, x

= c is the cosine distance, x

= w is the word app, and x

= m is the matching

score, we have the following formula:

¯x

All these metrics are between the range of [0, 1]. Higher values for matching score and word app denote

higher similarity, and vice verse. This is the opposite for cosine distance. Since we scrape up to three

context articles per input article, we can apply similarity metric smoothing by calculating the similarity

metric for the input article as the average of the similarity between the input article and each of the context

article to reduce variance.

3.4.5 Similarity metric selection

Since we have diﬀerent similarity metrics, we ought to compare them and select for the one that helps

diﬀerentiate the REAL and the FAKE articles the best. We will use three methods to aid our selection: δµ,

Jensen-Shannon Divergence, and ANOVA.

δµ, or the diﬀerence between the mean of REAL and FAKE articles is a very naive and simplistic measure

to compare how diﬀerentiated the two distributions are. This measure does not remedy the behaviour of

outliers which might signiﬁcantly shift the mean of the distributions. It also ignores the variance of the

distributions. However, it is still a numerically simple metric which would aid us when the distributions

are well-formed.

Jensen-Shannon Divergence (JSD) [20] is a symmetric Kullback-Leibler Divergence (KL Diver-

gence) to compute the metric distance between two probability distributions [18, 20]. It is bounded

between 0 and 1 with values closer to 0 denoting more similar distributions and closer to 1 more dissimilar

distributions. This is a value we hope to maximise. We will use the square root of the Jensen-Shannon

divergence to normalise p and q, which is given by the formula [18]:

D(p ∥ m) + D(q ∥ m)

D is the KL Divergence and m is the “point-wise mean of p and q” [18].

ANOVA (Analysis of Variance) [21] is a statistical method to determine if there are signiﬁcant

diﬀerences between the means of groups. We will be performing one-way repeated-measures ANOVA which

test for any signiﬁcant diﬀerence between the means of the dependent variables (our non-latent features)

among the groups deﬁned by the independent variable (our label of REAL and FAKE) to reject the null

hypothesis that the group means(FAKE and REAL) are equal. We reject this hypothesis when the p-value

associated with the f-statistic of the repetitive feature is below the signiﬁcant level α = 0.05.

3.5 Feature Normalisation

We experimented with diﬀerent feature normalisation techniques across the entire dataset and for speciﬁc

features. Normalisation was conducted column-wise for each feature so that each feature was normalised

individually.

MinMaxScaler: This was used for features where there was a clear range of values in order

to convert these values into the range of 0-1. This was not implemented frequently across our

features as they did not have a ﬁxed range of values making this technique ineﬀective.

StandardScaler: This was primarily used for linguistic features that did not have a clear

range. This technique works by ﬁnding the Z-score for each value within the feature where

Z-score =

X−µ

. This is done by ﬁrst calculating the mean and standard deviation of the

feature and then transforming each value into its Z-score.

RobustScaler: We experimented with this in place of StandardScaler for values prone to

outliers. This was done as RobustScaler uses IQR = Q3−Q1 instead of the standard deviation

in the transformation calculation from StandardScaler. This is used because standard deviation

is highly aﬀected by outlier values and this was noticeable in features such as word count context

articles.

Ultimately, after each scaling technique was applied and the classiﬁcation was conducted, the results

with each type of normalisation and various combinations of them underperformed compared to the original

dataset without normalisation. There are a few diﬀerent reasons why this was the case but after testing

various classiﬁcation models, the most probable reasons are the small size of the dataset used and the lack

of appropriate scaling methods available for the linguistic features that rely on counting. Due to this, we

decided not to use normalisation for our results.

3.6 Model — Machine learning

We used four state of the art machine learning models (commonly used in fake news detection) to perform

our classiﬁcation. We chose Logistic Regression (LR), Support Vector Machines (SVM), Decision Trees

(DT), and XGBoost (XGB). Due to the small size of our dataset, we needed to tune the regularization

hyperparameters to ensure our models didn’t overﬁt. In particular, tree-based models such as DT and

XGB should be able to fully segment our classes so we need to control the depth of the tree and splitting

criteria. Models such as LR and SVM will need to control L2 regularization. In SVM, we will test the

type of kernel used.

To ﬁnd the best hyperparameters, we perform 5-Fold cross validation across 80% of our dataset and

average the validation score. We pick the parameters with the best validation score and test our model

with the remaining 20% of the dataset. The table in the Appendix F shows the hyperparameters tested

for each model and a reason for why the ranges were selected.

3.7 Model — Neural networks

The second approach taken for classiﬁcation was using deep learning as we hypothesised that it would

be able to ﬁnd more complex relationships between the large variety of BERT, linguistic, and similarity

features. Initially both convolutional neural networks and simple fully connected layers were considered.

After analysing the features and noting that they did not have any spacial relation, only the second

approach was considered.

We used a very simple structure with 4 hidden layers using the ReLU activation function and an output

layer with one neuron and the sigmoid activation function as the task required binary classiﬁcation. We

kept the structure simple due to the small size of the dataset and to prevent overﬁtting. Additionally,

a dropout of 0.2 was added to further prevent overﬁtting. The dataset was split in a 60:20:20 train,

validation, test split to allow for hyperparameter training and binary-crossentropy loss was used. Multiple

optimisers were used during training, however, based on the validation accuracy we decided to use Adam

as the optimiser with default parameters but a modiﬁed learning rate of 0.0001.

4 Experimental setup

4.1 Dataset

For our research, we use the FakeNewsData dataset collated by Horne & Adali in [9] on research regarding

fake news in the 2016 presidential elections. This dataset contains two subsets, “Buzzfeed Political News”,

and “Random Political News”. We make use of the Buzzfeed subset since this contains long form text

articles that are binary categorized in with Fake and Real labels. The Random subset contains an extra

label, Satire, which is out of scope for our research.

The original dataset was collated by Craig Silverman (BuzzFeed News Editor) in an article [22]

analysing fake news. The analysis concentrates on the Facebook engagement on real and fake news articles

shared to the social media website. Various keywords related to events during the election were searched

and articles with highest engagement were collected. A ground truth was assigned by manual analysis

using a list of known fake and hyperpartisan news sites. A details description of their process can be found

in their article.

Following BuzzFeed’s analysis, Horne & Adali extracted the content and title from the articles and

formed the dataset. In total, there were 53 REAL and 48 FAKE articles. Notable events during the election

such as Donald Trump’s campaign and various Hillary Clinton scandals and rumours were features in the

articles.

After extending this dataset with our novel context article scraping and similarity methods, we used

a 60/20/20 train/validation/test set. This was stratiﬁed and randomized to ensure the best results. An

example of items in our dataset can be found in Table 4.

118 Real 1 Fake

FBI Completes Review of Newly Revealed Hillary

Clinton Emails Finds No Evidence of Criminality

5 Million Uncounted Sanders Ballots Found On Clin-

ton’s Email Server

FBI Director James Comey said Sunday that the bureau

won’t change the conclusion it made in July after it exam-

ined newly revealed emails related to the Hillary Clinton

probe.

“Based on our review, we have not changed our conclu-

sions that we expressed in July with respect to Secre-

tary Clinton” Comey wrote in a letter to 16 members of

Congress. [...]

Hillary in hot water over her email server, again. Sacra-

mento, CA — Democratic nominee Hillary Clinton is in

hot water again after nearly 5 million uncounted Cali-

fornia electronic ballots were found on her email server

by the F.B.I. The majority of those ballots cast were by

Bernie Sanders supporters. [...]

Table 4: A sample of one FAKE and REAL article in our dataset. The article ID, title and content are shown in the

rows. Both articles are regarding a scandal with Hillary Clinton using a private server to store emails. The FAKE

article reports on an event that never happens whereas the real article reports the true event – that Clinton was

exonerated from criminality.

4.2 Evaluation metrics

To evaluate our classiﬁcation models, we will use accuracy and F1 score. These metrics are commonly

used for binary classiﬁcation problems as well as in the misinformation detection domain. When referring

to these metrics, we will label REAL articles as positive and FAKE articles as negative.

Accuracy is measured as the proportion of the total number of correctly classiﬁed samples over the

total count of samples. Mathematically this is represented as

Accuracy =

T P + T N

T P + T N + F P + F N

where T P, T N, F P, F N refers to the true and false, positive and negative. Accuracy is an eﬀective metric

for this analysis as it is a holistic measure of the model’s ability to identify a real article as REAL and a

fake article as FAKE, which is the core component of the problem of fake news detection. Our dataset is

quite balanced so this will be a good general ﬁrst step measure.

Recall refers to how likely the model is able to identify an article is real when its ground truth is real.

Precision is a measure of the proportion of all articles the model predicts to be real that have a ground

truth of real. As such, recall is a measure of the quantity of the predictions whereas precision is a measure

of the quality of the predictions. We are focusing on F1-score as it is the harmonic mean of recall and

precision where

= 2

precision × recall

precision + recall

This allows for a fair assessment of both the quantity and quality of the predictions especially in

imbalanced datasets. While our dataset is mostly balanced, we still output this metric to compare the

model’s performance on our positive class.

5 Results and discussion

5.1 Feature Analysis

5.1.1 Non-Latent Feature Selection

We see that the majority of our features have a value that falls into the range of 0.1 10. Several quantity

and some diversity features are in the higher range of 100 1000. A box plot showing the means and

standard deviations of all our features can be found in Figure 9 in the Appendix.

After applying ANOVA, we identify that the removed features account to almost half of all features, and

they include readability indices, the syntax tree depth features, and a few features from other categories,

and diversity features are all below α. Subsequently, we create a Pearson correlation matrix for features

with p-value below α, resulting in the matrix in Appendix D. We then apply the algorithm to cluster

correlations and keep only the one with lowest p-value as outlined in Section 3.3. Figure 11 shows our

results. After grouping them by their original category, we have 29 features shown in Figure 3.

Figure 3: The number of features we have for each category.

5.1.2 Similarity Metrics Comparison

Figure 4 shows our results with similarity metrics comparison. A good metric should have the FAKE (in

red) and the REAL (in green) distributions be somewhat well separated. The brown area indicates the

overlapping region. Non-latent Cosine Distance is the worst metric with 0δµ, 0.02JSD, and 6.10e − 01p,

where the two FAKE and REAL distributions are overlapping each other almost at the same point. Word2vec

Cosine Distance and TF-IDF(1-1) also perform quite poorly with δµ of 0.03 and 0.05, JSD of 0.02 and

0.06, and p-value of 1.74e-03 and 2.38e-02 respectively. Both these plots have the REAL distribution shifted

further to the right but the distributions still overlap signiﬁcantly, unlike the Non-Latent Cosine Distance,

however, these metrics fall below the typical signiﬁcance level α of 0.05. Although TF-IDF (1-2) Cosine

Distance has a similar δµ and JSD of 0.06 and 0.08 respectively, its p-value is markedly low at 2.81e-05.

Word App follows a similar pattern where TF-IDF (1-1) and (1-2) produce distinctly diﬀerent results

of 5.17e-05 and 1.63e-02 in p-value, but similar δµ of 0.13 and JSD of 0.22 and 0.23 respectively. We

can conclude that the ngram-range of TF-IDF deﬁnitely contribute to marked diﬀerences in the metrics

output. Matching score metrics have similar δµ of 0.12 and JSD range of [0.23, 0.28] with low p-values

of 2.81e-05 and 6.11e-05 respectively. The most signiﬁcant features however are the harmonic mean,

combining the previous metrics, with TF-IDF (1-1) at δµ = 0.14, JSD = 0.24, p = 1.08e −05 and TF-IDF

(1-2) at δµ = 0.14, JSD = 0.17, p = 1.11e − 05. Since TF-IDF (1-1) yields more diﬀerence between the

distributions, we will use this as our deﬁnitive similarity metric.

5.1.3 Data Analysis Using PCA and KMeans

In order to observe relationships between data within the entire dataset, the BERT, linguistic and similarity

features for each data point were collected and dimensionality reduction was conducted using Principle

Component Analysis to reduce the dimensions of the data to 2. Plot of these values were constructed

Figure 4: Similarity metrics comparison results. Each plot contains the name of the vectoriser, the similarity

metric, and δµ, JSD (Jensen-Shannon Distance), and p-value when ANOVA is perform with the metrics as depen-

dent variables against the labels which are independent variables. The red distribution is the dependent variable

when the label is label FAKE, and the green distribution is the dependent variable when the label is REAL

(a) PCA Graph (b) KMeans Graph

Figure 5: Analysis on PCA based on BERT features

with two diﬀerent combinations of features: one using only BERT features for PCA and the other using

linguistic and similarity features in addition to BERT features.

Additionally, to observe whether the data was forming clusters based on its reduced dimensions,

KMeans was used with 5 clusters. The reason for using 5 clusters instead of 2 was because both FAKE and

REAL articles about diﬀerent topics or written diﬀerently can form clusters with other articles in a similar

style and we were observing for distinct sets of REAL and FAKE clusters rather than simply one region of

FAKE articles and another with REAL ones.

(a) PCA Graph (b) KMeans Graph

Figure 6: Analysis on PCA based on BERT, linguistic and similarity features

As can be seen in Figure 5a, there is very little relationship between the principle components of

FAKE and REAL articles. When clustering is performed, as shown in Figure 5b, each cluster has an even

distribution of REAL and FAKE articles implying there is no evident pattern in the data.

Whereas in Figure 6a, there is a clear grouping of values for fake data. Interestingly, real data does not

have a clear cluster of values and instead occupies a large region of area to the right of the fake data. One

potential reason for this is that there are a lot of similar features between fake articles such as the number

of adverbs and the types of polarising language. Additionally, the similarity metric could be playing a

strong role in inﬂuencing the ﬁrst principle component, pc1, which would potentially result in fake news

having low scores and real news having a broader range of similarity scores. Regardless, there is a clear

distinction and this is also shown in Figure 6b where the majority of fake values are clustered in the purple

cluster and the majority of real values are present in two separate clusters.

This indicates that the features selected are eﬀective in separating the dataset into distinct FAKE and

REAL clusters.

5.2 Model Results

For our analysis, we will be comparing the top 3 performing machine learning methods alongside the best

performing neural network for each task. There are three combinations of inputs that are compared:

1. Only BERT features,

2. BERT and linguistic features,

3. BERT, linguistic and similarity features.

Figure 7 presents the classiﬁcation results. As hypothesised from dataset analysis with PCA, only

using BERT does not perform well. With a testing accuracy around 50% for each classiﬁer, it is simply

random guessing on the data. This is used as the baseline for our analysis as it is does incorporate any of

our novel contribution to this knowledge domain.

When incorporating the most eﬀective linguistic features in addition to just BERT features, there is

a signiﬁcant increase in testing accuracy and F1-score across every model. Evident in the more complex

models of neural networks and XGBoost, it is clear that there is a complex relationship being captured.

There is an increase in test accuracy to around 85% and increase in F1-score to 0.87 for the top performing

Figure 7: Accuracy and F1-score metrics for top performing models across diﬀerent features

model. However, the decision tree classiﬁer which is a far simpler model than XGBoost underperforms

with this set of features.

Adding our novel similarity feature,he neural network accuracy increases, going from 85% to 90%.

Additionally, F1-score increases from 0.86 to 0.91. Although SVC and XGBoost have similar accuracy

with and without similarity, their F1-score increases slightly. This is because there is a slight increase

in precision and decrease in recall. This means that fewer articles that have a ground truth of REAL are

classiﬁed as REAL but the quality of the REAL predictions increases. Another interesting note is that the

simpler classiﬁer - decision tree - performs much better when the similarity score is added. This indicates

that the use of similarity score results in a split within the decision tree with much higher information

gain than was possible with only BERT and linguistic features.

The reason these combinations of features were chosen for analysis was that BERT and linguistic

features have a signiﬁcant amount of prior research. These were included to represent the current state-

of-the-art research within this section of long-form journalistic fake news detection. In terms of this, we

conducted analysis and reﬁned these features to develop a set of the most eﬀective features where the ﬁnal

inputs adds our novel similarity. As shown in the results, the linguistic features selected perform extremely

well in this task and the additional similarity metric reduces the complexity of the classiﬁcation whilst

increasing the accuracy and F1-score for the neural network and decision tree classiﬁers.

Table 5 shows the results for the four machine learning classiﬁers used and adds more details than

visible in Figure 7. It displays the training and testing metrics for each model to show the eﬀect of

changing the combination of features.

6 Conclusion

6.1 Summary of Contributions

Through this work, we show that real-time contextual information in the form of similarity scores with

related articles is eﬀective in distinguishing between real and fake news. Additionally, through our analysis

Features Model Train Acc. Train F1 Test Acc. Test F1

BERT

LR 0.99 0.99 0.5 0.5

SVC 1.0 1.0 0.4 0.4

DT 0.99 0.99 0.45 0.56

XGB 1.0 1.0 0.6 0.6

BERT

+ Linguistic

LR 0.94 0.94 0.70 0.73

SVC 0.79 0.79 0.85 0.87

DT 1.0 1.0 0.65 0.70

XGB 1.0 1.0 0.85 0.87

BERT

+ Linguistic

+ Similarity

LR 0.94 0.94 0.7 0.73

SVC 0.79 0.79 0.85 0.87

DT 1.0 1.0 0.85 0.88

XGB 1.0 1.0 0.85 0.87

Table 5: Table showing all training and test accuracies and F1 scores on our machine learning models with

diﬀerent feature inputs. The best test accuracy and F1 score have been emboldened.

into a range of linguistic features, we collate the features that provide the highest distinction between fake

and real articles, along with visualisations and classiﬁcation models based on them.

Our work brings a more in-depth study of linguistic features and literature to ﬁnd the most eﬀective

one. It builds upon the limitations of previous similarity models including the lack of relevant between

context articles and the queried article, the usage of only the title rather than full body of text, and the

focus on only one main type of feature rather than a study of linguistic and similarity techniques together.

6.2 Limitations and Future Work

Preprocessing and tokenization

While our preprocessing was quite generic for NLP tasks, we believe there were a few oversights that

caused unreliable results. Two issues were:

• Converting everything to lowercase destroyed acronyms such as “US”. This changed the meaning of

some sentences.

• The nltk lemmatizer required manually specifying the part of speech to work. This caused some

verbs to be incorrectly lemmatized. We could have used a diﬀerent library such as spaCy to extract

the POS automatically. Alternatively we could have investigated not lemmatizing at all to maintain

proper structure for non-latent features and BERT embeddings.

This research would have really beneﬁted from less or no preprocessing at all. For all our features, there

could be an argument where no preprocessing would have been better. For example summary extraction

or BERT features could have learnt meaning from the unﬁltered text. This could be investigated for future

research to strengthen our features.

BERT Features

In this research, we have used the last layer hidden state of the CLS token (processed by Linear layer

with Tanh activation) to encode the features as BERT features due to its common usage as a common

feature for classiﬁcation [23]. However, the HuggingFace transformer documentation has also mentioned

that this output is not necessarily a ’good summary of the semantic content of the input’, and suggests

applying pooling or averaging the sequence of hidden-states for the entire input [23]. In addition, we can

also use the fully connected layers on top of the frozen BERT, which is the typical approach for textual

classiﬁcation. These are things that we could experiment with to improve our BERT features input.

The plot in Appendix B shows the length distribution of the articles. We can see that input articles

and context articles have means of 1090.58 and 1827.29 respectively, with 50% percentile being 583 and

1032 respectively. This means that the input token size of 512 for BERT cannot encode a signiﬁcant

amount of textual information in an article. We assume in our research that fake news is always visible

within 512 tokens of an article. However, we could conduct our experiments with LongFormer to test for

our hypothesis and for any further improvement.

Non-Latent Features

Our non-latent feature extraction pipeline currently applies on the content of the article. However, we could

easily use the same procedure to detect for any features which would signiﬁcantly diﬀerentiate between

the REAL and the FAKE distributions in the title of the articles. Past research has also applied feature

extraction method on the title of articles, and our model would beneﬁt from any important features.

Summary extraction

Our summary extraction extracted most of the correct meaning from the text. However two main issues

remained causing it to produce imperfect results:

• Words would be unordered and if read together in a sentence, wouldn’t make sense to a human.

• Junk such as connectives from poor preprocessing or words not too related to the article content

would be left in the summary. This could be a sign of a ineﬀective topic extraction method.

We believe future research could have looked into better tuned models that synthesized higher quality

topic sentences. From a cursory search for “keyword extractor” on HuggingFace, we found various pre-

trained models using modern transformer methods such as BERT to output keywords. We believe that

our method still worked well enough to demonstrate that this could be done but better summaries would

have yielded higher quality contextual articles. In addition, we could have also included the title since

semantically, the title is supposed to tell the reader what the rest of the article is about.

Article scraping

We had intended our pipeline to be fully automatic, using an API to scrape for articles based on the

keywords. For the scope of this project, we settled on manual scraping to show that using contextual

articles would improve results. Unfortunately, after being passed to Google, some input summaries would

return no or limited results and required human intervention to produce any articles. This was a particular

problem.

For Real labelled articles, we believe an improvement on the summary extractor reducing the amount

of junk returned would have improved results. However, we hypothesize that if an outrageous Fake article

as introduced, we may very well ﬁnd no results about the event online. One example is the following article

about Trump “snorting cocaine”:

10Fake: The Internet is buzzing today after white supremacist presidential candidate Donald

Trump was caught by hotel staﬀ snorting cocaine.

Maria Gonzalez an employee at the Folks INN & Suites Hotel in Phoenix brought room service

to his room witnessed it all.

“When I walked in I saw 3 naked prostitutes and maybe 100,000 in hundred dollars bills and

a mountain of white powder on the table, I thought it was a dog on the ﬂoor sleep but it was

his hair piece, he was bald and sweating like crazy.” [...]

This event never happened and consequently, we could not ﬁnd any contextual articles about it. For our

research, we skipped articles with no context. We believe one way to resolve this is to include the article

with no similarity score or a low one. More research needs to be done into whether mixed articles with and

without a similarity can perform well together especially considering in the real world, this is deﬁantly an

issue.

Article vectorisation

Similar to 6.2, for future work, we can construct a non-latent vectoriser from the title of the articles, and

perform ANOVA on this vectoriser applied on the input article against the context articles. It is likely

however that the non-latent ’title’ vectoriser might yield a p-value higher than the signiﬁcance threshold,

since this is the case with the non-latent ’body’ vectoriser.

Similarity metric calculation

For our similarity metric measure, we could also build a simple classiﬁer which takes an input article and

context article, and output a similarity measure, and maximising the output if the input article is REAL

and minimising the output if it is FAKE.

In addition, we also have a hypothesis that REAL articles have a higher level of semantic similarity

between the content and the title. We can vectorise the pair, and then apply similarity metric measures

on them and repeat the same methodology as outlined in Section 3.4.3, 3.4.4, 3.4.5.

Dataset

For the scope of our research, we used a fairly small dataset to present our contributions to contextual

article scraping. However, this dataset only concentrated on the political events surrounding the 2016

United States election. This causes two main problems for our model:

• We are prone to overﬁt our models since such a small dataset will be easily segmented by most state

of the art methods.

• Our model will learn speciﬁcs in the US election which we do not want to learn. This could cause

the models to be confused if we try to classify more recent news.

In the future research could be done to provide an automated API for scraping which would have allowed

for a much larger dataset to be used that covers multiple world events across diﬀerent years. Along with

this, masking out event speciﬁc words could have been done to reduce learning of event speciﬁcs.

6.3 Attributions

First and foremost, we thank Professor Yang Song and Maurice Pagnucco, and Wenbin Wang, for providing

various insights and suggestions to our research.

We thank Horne & Adali for providing the base dataset in their research [9] to which we provided

extensions to. We also acknowledge all related works which we have used to experiment with our features.

Lastly the team has used open source software as part of developing our software. The licenses and

links to the open source projects have been included in and ATTRIBUTIONS ﬁle which has been distributed

alongside this report.

References

[1] Hunt Allcott and Matthew Gentzkow. “Social media and fake news in the 2016 election”. In: Journal

of economic perspectives 31.2 (2017), pp. 211–236.

[2] Alexandre Bovet and Hern´an A Makse. “Inﬂuence of fake news in Twitter during the 2016 US

presidential election”. In: Nature communications 10.1 (2019), p. 7.

[3] Twitter Help Center. How we address misinformation on Twitter. 2023. url: https : / / help .

twitter.com/en/resources/addressing-misleading-info.

[4] Wissam Antoun et al. “State of the art models for fake news detection tasks”. In: 2020 IEEE inter-

national conference on informatics, IoT, and enabling technologies (ICIoT). IEEE. 2020, pp. 519–

524.

[5] Fahad Alsuliman et al. “Social Media vs. News Platforms: A Cross-analysis for Fake News Detection

Using Web Scraping and NLP”. In: Proceedings of the 15th International Conference on PErvasive

Technologies Related to Assistive Environments. 2022, pp. 190–196.

[6] Sairamvinay Vijayaraghavan et al. “Fake news detection with diﬀerent models”. In: arXiv preprint

arXiv:2003.04978 (2020).

[7] Xinyi Zhou and Reza Zafarani. “A survey of fake news: Fundamental theories, detection methods,

and opportunities”. In: ACM Computing Surveys (CSUR) 53.5 (2020), pp. 1–40.

[8] Sonal Garg and Dilip Kumar Sharma. “Linguistic features based framework for automatic fake news

detection”. In: Computers & Industrial Engineering 172 (2022), p. 108432.

[9] Benjamin Horne and Sibel Adali. “This just in: Fake news packs a lot in title, uses simpler, repetitive

content in text body, more similar to satire than real news”. In: Proceedings of the international AAAI

conference on web and social media. Vol. 11. 1. 2017, pp. 759–766.

[10] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing

text with the natural language toolkit. ”O’Reilly Media, Inc.”, 2009.

[11] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Under-

standing”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805. url: http://arxiv.org/abs/

1810.04805.

[12] Radim

Reh˚uˇrek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”.

English. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.

Valletta, Malta: ELRA, May 2010, pp. 45–50.

[13] Lucas Ou-Yang. newspaper3k. 2013. url: https://newspaper.readthedocs.io/en/latest/.

[14] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning

Research 12 (2011), pp. 2825–2830.

[15] Google Code Archive. word2vec. 2013. url: https://code.google.com/archive/p/word2vec/.

[16] Radim

Reh˚uˇrek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”.

English. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.

http://is.muni.cz/publication/884893/en. Valletta, Malta: ELRA, May 2010, pp. 45–50.

[17] Tomas Mikolov et al. Eﬃcient Estimation of Word Representations in Vector Space. 2013. arXiv:

1301.3781 [cs.CL] .

[18] Pauli Virtanen et al. “SciPy 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python”. In:

Nature Methods 17 (2020), pp. 261–272. doi: 10.1038/s41592-019-0686-2.

[19] Jasmin Komi´c. “Harmonic Mean”. In: International Encyclopedia of Statistical Science. Ed. by Mio-

drag Lovric. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 622–624. isbn: 978-3-642-

04898-2. doi: 10.1007/978-3-642-04898-2_645. url: https://doi.org/10.1007/978-3-642-

04898-2_645.

[20] Frank Nielsen. “On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon

centroid”. In: Entropy 22.2 (2020), p. 221.

[21] Gurchtan Singh. ANOVA: Complete guide to Statistical Analysis & Applications. July 2023. url:

https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/.

[22] Craig Silverman. This Analysis Shows How Viral Fake Election News Stories Outperformed Real

News On Facebook. Nov. 2016. url: https://www.buzzfeednews.com/article/craigsilverman/

viral-fake-election-news-outperformed-real-news-on-facebook.

[23] Thomas Wolf et al. “Transformers: State-of-the-Art Natural Language Processing”. In: Association

for Computational Linguistics, Oct. 2020, pp. 38–45. url: https://www.aclweb.org/anthology/

2020.emnlp-demos.6.

A Article scraping

ID Article extract Summary

128_Real [...]I have a prediction. I know exactly what November 9

will bring. Another day of God’s perfect sovereignty.

He will still be in charge. His throne will still be occupied.

He will still manage the aﬀairs of the world. Never before

has His providence depended on a king, president, or

ruler. And it won’t on November 9, 2016. “The LORD

can control a king’s mind as he controls a river; he can

direct it as he pleases” (Proverbs 21:1 NCV).

On one occasion the Lord turned the heart of the King

of Assyria so that he aided them in the construction of

the Temple. On another occasion, he stirred the heart of

Cyrus to release the Jews to return to Jerusalem. [...]

god wa one never

every king novem-

ber still heart

2_Fake Washington, D.C. – South African Billionaire, Femi

Adenugame, has released a statement oﬀering to help

African-Americans leave the United States if Donald

Trump is elected president. According to reports, he is

oﬀering $1 Million, a home and car to every Black family

who wants to come to South Africa.

Concerns about Donald Trump becoming president has

prompted a South African billionaire to invest his fortune

in helping African-Americans leave the United States to

avoid further discrimination and inequality. [...]

ha adenugame

africanamericans

south femi united

states africa presi-

dent donald

10_Fake The Internet is buzzing today after white supremacist

presidential candidate Donald Trump was caught by ho-

tel staﬀ snorting cocaine.

Maria Gonzalez an employee at the Folks INN & Suites

Hotel in Phoenix brought room service to his room wit-

nessed it all.

“When I walked in I saw 3 naked prostitutes and maybe

100,000 in hundred dollars bills and a mountain of white

powder on the table, I thought it was a dog on the ﬂoor

sleep but it was his hair piece, he was bald and sweating

like crazy.” [...]

wa room hotel

maria told em-

ployee gonzalez hit

video get

34_Fake It has been more than ﬁfteen years since Rage Against

The Machine have released new music. The members

of the band have involved themselves in various other

projects during their lengthy hiatus, but one pressing is-

sue has forced the band to team up once again.

In a statement posted online, Rage Against The Machine

announced they would be releasing a brand new album

aimed at spreading awareness about “how awful Donald

Trump is”. [...]

trump rage album

machine band ha

donald music out-

side year

Table 6: Articles we were not able to ﬁnd context articles for. Articles like 10 Fake describe a situation that

plainly does not happen whereas articles like 128 Real and 34 Fake produces summaries that confuse Google News

and don’t produce grate articles.

B Word Count Histogram

Figure 8: Word count histogram of input (left) and context articles (right)

C Non-Latent Features

Table 7: Non-latent feature names (far-right column) and their corresponding category, feature type, and calcu-

lation method

Category Feature Type Calculation Feature Name

Diversity

Noun (Unique)

Count

div NOUN sum

Verb (Unique) div VERB sum

Adjective (Unique) div ADJ sum

Adverb (Unique) div ADV sum

Lexical Word (Unique) div LEX sum

Content Word (Unique) div CONT sum

Function Word (Unique) div FUNC sum

Noun (Unique)

Percent

div NOUN percent

Verb (Unique) div VERB percent

Adjective (Unique) div ADJ percent

Adverb (Unique) div ADV percent

Lexical Word (Unique) div LEX percent

Content Word (Unique) div CONT percent

Function Word (Unique) div FUNC percent

Quantity

Noun

Count

div NOUN sum

Verb div VERB sum

Adjective div ADJ sum

Adverb div ADV sum

Pronoun div PRON sum

Personal Pronoun div DET sum

Possessive Pronoun div NUM sum

Determinant div PUNCT sum

Number div SYM sum

Punctuation div PRP sum

Symbol div PRP$ sum

Wh-Determinant div WDT sum

Cardinal Number div CD sum

Verb (Past Tense) div VBD sum

Stop Word div STOP sum

Lowercase Word div LOW sum

Uppercase Word div UP sum

Negation div NEG sum

Noun

Percent

div NOUN percent

Verb div VERB percent

Adjective div ADJ percent

Adverb div ADV percent

Pronoun div PRON percent

Personal Pronoun div DET percent

Possessive Pronoun div NUM percent

Determinant div PUNCT percent

Number div SYM percent

Punctuation div PRP percent

Symbol div PRP$ percent

Wh-Determinant div WDT percent

Cardinal Number div CD percent

Verb (Past Tense) div VBD percent

Stop Word div STOP percent

Lowercase Word div LOW percent

Uppercase Word div UP percent

Negation div NEG percent

Quote

Count

div QUOTE sum

Noun Phrase div NP sum

Character div CHAR sum

Word div WORD sum

Sentence div SENT sum

Syllable div SYLL sum

Sentiment

Exclamation Mark

Count

div ! sum

Question Mark div ? sum

All-Cap Word div CAPS sum

Polarity

Index

div POL sum

Subjectivity div SUBJ sum

Pronoun

First Person Singular

Count

div FPS sum

First Person Plural div FPP sum

Second / Third Person div STP sum

First Person Singular

Percent

div FPS percent

First Person Plural div FPP percent

Second / Third Person div STP percent

Average

Character Per Word

Average

div chars per word sum

Word Per Sentence div words per sent sum

Clause Per Sentence div claus per sent sum

Punctuation Per Sentence div puncts per sent sum

Syntax Tree Depth

Median Syntax Tree Depth

Median

div ALL sum

Median Noun Phrase Syntax

Tree Depth

div NP sum

Readability

Gunning-Fog Index

Index

div gunning-fog sum

Coleman-Liau Index div coleman-liau sum

Flesch Kincaid Grade Level div dale-chall sum

Linsear Write div ﬂesch-kincaid sum

SPACHE div linsear-write sum

Dale Chall Readability div spache sum

Automated Readability Index

(ARI)

div automatic sum

Flesch Reading Ease div ﬂesch sum

Figure 9: Box Plot of Non-Latent Box Plot Features, displaying the Median, Interquartile Range (IQR), Minimum,

Maximum, does not display Outliers which would inﬂate plot size. Each subplot contains features with mean µ in

the corresponding powers of 10.

D Non-latent Feature Pearson Correlation Matrix

Figure 10: Pearson correlation matrix of selected non-latent features (which are red and white in Figure 11).

Brighter areas show higher rate of correlation, and vice-versa. An explanation of features is available in Appendix

E ANOVA of Non-Latent Feature against Labels

Figure 11: Table of features followed by their p-value when ANOVA is performed with features as dependent

variables against labels as the independent variable. Features are sorted by their p-value. The table on the right

continues from the last row of the left table. The signiﬁcance level is α = 0.05, features with p-value above the

threshold is in blue. Features below the level are subjected to correlation clustering algorithm which keeps features

with lowest p-value and remove all their correlated features as outlined in Section 3.3, selected features are in white,

whilst removed features are in red.

F Machine learning

Model Parameter Selection Reasoning

Logistic

Regression

Inverse L2

coeﬃcient

0.2:1.2:0.2 This is the main regularization parameter.

We chose a range around the default 1.0 but

shifted our range to be more biased towards

higher regularization.

Solver lbfgs, liblinear The liblinear solver was suggested by the

documentation as an alternative for small

datasets.

SVM

Inverse L2

coeﬃcient

0.2:1.2:0.2 Same reasoning as LR regularization.

Kernel

rbf, poly,

sigmoid

Selecting the right kernel for a dataset will

make our methods perform better.

Kernel

coeﬃcient

n features×var(X)

0.01, 0.05

Same reason as above.

Decision Tree

Criterion gini, entropy To test diﬀerent methods of measuring split

quality on the node.

Max depth no limit, 3:9:2 Controls how complex the tree is. A less

deeper tree is more regularized.

Max features

0.3 × n features,

√

n features, all

features

Standard defaults suggested by documenta-

tion. Controls regularization so not all fea-

tures are considered at each split.

Min samples

for splitting

node

2:4:1 Reduce the number of leafs with only one sam-

ple of representation to increase regulariza-

tion.

XGBoost

Learning

rate

0.1:0.5:0.1 Smaller learning rates reduce overﬁtting.

Max depth 1:6:1 Same as max depth for DTs.

coeﬃcient

0.8:1.6:0.2 Testing higher regularization. Default is 1.

coeﬃcient

0:0.4:0.2 Testing higher regularization. Default is 0.0.

Table 8: Table of all the models chosen and the hyperparameters selected for each model. We describe a range

of values in the format start:end:step, where start and end are inclusive. Our main focus was to investigate

regularization parameters that would better ﬁt our smaller dataset.