Topic Modeling: How the news has shifted from the Coronavirus to Black Lives Matter

Published in

Analytics Vidhya

5 min readJun 28, 2020

The ongoing BLM protests have caused a shift in the news coverage from the coronavirus to racism, which made me curious to look into how the headlines have altered from the time before George Floyd’s death and after.

In this article, we will use content from the New York Times collected from their archives through their API and form word clouds based on the frequency of words as well as use topic modeling to recognize and visualize the dramatic change. To accomplish this, we will use a mix of spaCy and Scikit Learn.

All the code and *.csv files can be found at my GitHub repository, only snippets of the code are provided here.

Extracting data

The data was extracted using the New York Times API at https://developer.nytimes.com/apis . As a developer you will need to create an account and choose the API that you are going to use, in our case it was the Archive API which returns NYT articles for a given month. I made a request on 19 Jun 2020 for both May and June, which gave me data for the entirety of May and 19 days of June, a period that includes 24 days before the killing and 25 days after.

The JSON file was dumped into a *.txt file, which was then parsed to extract the data into a CSV file. The features extracted were ‘abstract’, ‘pub_date’, ‘type_of_material’, ‘word_count’, ‘print_page’, ‘headline’, ‘section_name’, ‘subsection_name’.

Preprocessing and WordCloud

After obtaining the data in CSV format it was time to explore! Given below are some of the visualizations(the code can be found here). Most articles have a word count between 700 and 1400. “US” turned out to be the most popular section name which is not surprising from an American publication. “News” has the highest count for the type of material published. There was no specific pattern of the type of material published with respect to the publication date and hence that is not posted here. The distribution of word count based on the print page can also be found in the repository.

Distribution of wordcount and number of articles

Other than the huge emphasis on “Trump” making word clouds using word frequencies highlight the change in popular topics as can be observed from “pandemic” and “reopen” giving way to “police”, ”protest”, “black”. Of course the coronavirus is still a major issue and talking point, but less so.

Topic Modelling

Topic modeling is a step above playing with the frequency of words in a document. It is a process by which we try to capture meaning in documents. Topic modeling is used for dimensionality reduction as well as capture meaning in the document. It is inherently the same process as PCA in which we project data into new dimensions and can eliminate dimensions in the new vector space that do not contribute much to the variance in vectors from document to document.

Each dimension becomes a linear combination of word frequencies rather than a single word frequency, and these are called “topic vectors”. We will be using truncated SVD which requires a document-term matrix as the input. We we will be going through the following steps which might vary based on the kind of dataset being analyzed:

Drop news articles corresponding to sections we are less interested in and keep only the following sections ‘New York’, ‘Opinion’, ‘Sports’,’Your Money’, ‘World’, ’Science’, ‘Business Day’, ’Today’s Paper’, ‘U.S.’, ‘Technology’,
‘Reader Center’, ‘Health’, ’Sunday Review’, ‘Real Estate’, ‘Briefing’,
‘Climate’,’Times Insider’.
Use spaCy for lowercasing, tokenization(splitting text into tokens generally done by splitting based on white space), lemmatization (removing inflected forms of words and retaining the base form, for example, “changing”, “changes” and “changed” all have the base word “change”) and dropping stop words (“the”, “a”, etc.).
Drop the tokens containing only numbers and days of the week as the words, such as “monday” and “tuesday”, which are quite common in news headlines but do not add any relevant information.
Form the BOW (bag-of-words) Data Frame containing tokens as columns and each index corresponding to a headline with values corresponding to the frequency of each token in the headline.
Form the TF-IDF (term frequency–inverse document frequency) document-term matrix, TF-IDF is built on the concept that a word in a document should be given more weightage if it is rarer in other documents.
Perform truncated SVD using scikit learn to find the new dimensions.

The code for the above steps is given below and the steps are performed just for the Data Frame before May 25th, 2020 although the same steps were repeated for the Data Frame for after May 25th, which can be found in the repository.

Topics Before May 25th:

Topic 0: 
 briefing coronavirus happen today test late update
Topic 1: 
 face test mask bad suicide threat stress
Topic 2: 
 test people antibody pandemic f.d.a coronavirus positive
Topic 3: 
 coronavirus late update n.y.c new york pandemic
Topic 4: 
 new pandemic york home market city end

Topic after May 25th:

Topic 0: 
 briefing happen today coronavirus evening protest george
Topic 1: 
 result primary election district house california texas
Topic 2: 
 police protest trump new america coronavirus floyd
Topic 3: 
 briefing police protest floyd george trump definitely
Topic 4: 
 trump twitter president rally tweet campaign biden

The topics before May 25th are unsurprisingly dominated by the coronavirus pandemic, while topics after the 25th consistently include George Floyd, the protests, and even the battle between Trump and Twitter.

A meaningful next step to this analysis would be to study the change in sentiment between the two periods. We now know what we are writing and talking about has changed, it would be interesting to know if how we are writing and talking is changing as well.

Topic Modeling: How the news has shifted from the Coronavirus to Black Lives Matter

Extracting data

Preprocessing and WordCloud

Topic Modelling

Written by Apurva Misra