ID Article extract Tokens
118 Real FBI Director James Comey said Sunday that the bureau
won’t change the conclusion it made in July after it exam-
ined newly revealed emails related to the Hillary Clinton
probe.
“Based on our review, we have not changed our conclu-
sions that we expressed in July with respect to Secre-
tary Clinton” Comey wrote in a letter to 16 members of
Congress. [...]
[’fbi’, ’director’, ’james’, ’comey’, ’said’,
’sunday’, ’bureau’, ’change’, ’conclusion’,
’made’, ’july’, ’examined’, ’newly’, ’re-
vealed’, ’email’, ’related’, ’hillary’, ’clin-
ton’, ’probe’, ’.’, ’”’, ’based’, ’review’, ’,’,
’changed’, ’conclusion’, ’expressed’, ’july’,
’respect’, ’secretary’, ’clinton’,. . . ]
15 Fake After hearing about 200 Marines left stranded after re-
turning home from Operation Desert Storm back in 1991,
Donald J.Trump came to the aid of those Marines by
sending one of his planes to Camp Lejuene, North Car-
olina to transport them back home to their families in
Miami, Florida.
Corporal Ryan Stickney was amongst the group that was
stuck in North Carolina and could not make their way
back to their homes. [...]
[’hearing’, ’200’, ’marines’, ’left’,
’stranded’, ’returning’, ’home’, ’oper-
ation’, ’desert’, ’storm’, ’back’, ’1991’, ’,’,
’donald’, ’j’, ’.’, ’trump’, ’came’, ’aid’,
’marines’, ’sending’, ’one’, ’plane’, ’camp’,
’lejuene’, ’,’, ’north’, ’carolina’, ’trans-
port’, ’back’, ’home’, ’family’, ’miami’,. . . ]
Table 1: Examples of preprocessing and tokenization extraction on items in dataset.
looks up the work in the WordNet corpus to get the lemma. Later in the research, we realised
that this hypothesis may have not been accurate.
Firstly the nltk library we were using does not automatically detect the part of speech and will
by default, only lemmatize nouns. While it is arguably better for us to maintain the tense of
nouns, we are technically not lemmatizing fully. Secondly, from more research, lemmatization
may not be ideal for BERT embeddings since it removes some semantics that could be learnt
by the BERT model. We will discuss these limitations further in Section 6.2.
Remove stopwords: Stopwords were removed from the text in order to reduce complexity.
Apart from the above methods, we also tested removing punctuation. However, this was not used in
the end since we added non-latent features to measure punctuation counts and also to maintain semantics
for BERT.
After preprocessing, tokens are then generated based on any whitespace and punctuation in the re-
maining text. Table 1 shows samples of tokenized input articles.
3.2 Feature — BERT embeddings
BERT (Bidirectional Encoder Representations from Transformers) is a language representation model
proposed by Devlin et al [11]. It is pre-trained on BookCorpus and the English Wikipedia using masked
language modelling (MLM) and next sentence prediction (NSP). MLM masks some of the input tokens with
a training objective to predict the masked token simply based on context, and the model also concatenates
two sentences with 50% chance of being neighbouring sentences, and the model is pre-trained in the NSP
layer to predict if the two are indeed neighbours [11]. BERT obtains SOTA results on several tasks and is
suitable for representing document and textual data. Hence, we will be using BERT as the main feature
to encode our articles. We will use HuggingFace’s ’bert-base-uncased’ pretrained model which trains on
uncased data. To encode an article, we truncate it to the first 512 tokens pass it through ’bert-base-
uncased’ and output the CLS token’s vector as BERT features for our classification model. Due to this
truncation, the BERT encoding won’t be able to capture the entire article content. However, we hypothesis
6