A method for measuring investigative journalism in local newspapers

Major changes to the operation of local newsrooms—ownership restructuring, layoffs, and a reorientation away from print advertising—have become commonplace in the last few decades. However, there have been few systematic attempts to characterize the impact of these changes on the types of reporting that local newsrooms produce. In this paper, we propose a method to measure the investigative content of news articles based on article text and influence on subsequent articles. We use our method to examine over-time and cross-sectional patterns in news production by local newspapers in the United States over the past decade. We find surprising stability in the quantity of investigative articles produced over most of the time period examined, but a notable decline in the last 2 y of the decade, corresponding to a recent wave of newsroom layoffs.


Supporting Information Text Extended Materials and Methods
The full data set used for this study, in addition to code that can be used to build the dynamic influence model and the word embedding representations is publicly available (1). Data Collection. The data set utilized for the analysis described in this paper was provided by NewsBank, a news database resource that provides a collection of millions of articles and media publications from thousands of different sources. We collected all available articles from 50 US newspapers among those that have historically won journalistic awards, choosing a geographically diverse set of newspapers. The articles from these sources were systematically queried, and the resultant data was reformatted and aggregated into the finalized raw data set. After removing advertisements and obituaries, the final data set consisted of 5,926,763 articles, 894 of which were winners or nominees of journalistic awards. We use all the articles between Jan. 1, 2010-Dec. 31, 2017 as our training set, Jan. 1-Dec. 31, 2018 as our validation set, and Jan. 1 -Dec. 31 2019 as our test set. The number of articles in the training, validation, and test sets were 5005696, 511834 and 409233, with 562, 213 and 119 award winning articles, respectively in each group.
Data Preprocessing. The first preprocessing step was reducing the words in the article text to their linquistic stems, and tokenizing the text into n-grams [n ∈ {1, 2}] to generate a document-frequency matrix based on the 20,000 most commonly found n-grams within each newspaper.
Next, for every newspaper, we split our corpus into monthly groups and trained a dynamic document influence model with 8 topics. This method has previously been used to evaluate the scholarly impact of scientific articles (2,3). Influence scores, a measure of how an article affected next months articles on every topic, were obtained from this model. Each dynamic LDA topic for the 50 newspapers in our data set consisted of [20,000 x 120] matrices, where every row was a probability distribution over the terms in the vocabulary for every month. For every newspaper, these topic distributions were averaged over time, creating 50 distinct average distribution vectors for 8 topics. A greedy optimization algorithm was then used to match these topics across newspapers (4) based on their shared vocabulary, creating 8 topic clusters. The influences of articles on each topic were labeled according to these clusters to achieve consistency across different newspapers.
A separate step was to perform feature extraction on the text data in order to retrieve further information that could provide helpful information for our classification model. We conducted a human-designed feature extraction process, which tagged certain common article behaviors as potentially influential: any indication or mention of involvement in a series of articles, words related to investigative acts and court cases, inclusion of statistical, anecdotal, legislative data, references to contemporary investigations, and the number of words and sentences in the article.
Further steps were taken to extract useful information from the article sections. Namely, to capture the fact that some sections in newspapers were more likely to produce investigative articles compared to others (e.g. metro news versus sports), for every newspaper, we generated a document-frequency matrix using the words used in the section names of every article. We used this document-frequency matrix to train two separate gradient-boosting classification models on the training set. The first one was trained to predict whether a given section had any award winning articles at all, using section names (i.e., the ground truth label was positive for every article belonging to a section that had an award winning article). The second one was trained to predict award winning articles using section names (i.e., the ground truth label was positive only for award winning articles). The outputs (probabilities) from these two models were used as additional features for our final classification task.
Further, we used the pre-trained word embedding model FastText (5-7) which consisted of 1 million word vectors trained on Wikipedia, UMBC webbase corpus, and statmt.org news data set to create 300-dimensional latent representations for articles in our data set. For every article, we used the tf-idf weighted average of the word embeddings of the constituent words to generate article embeddings (8,9).
Our final set of 320 features for each news article included the 300-dimensional article embeddings, 10 features from the document influence model (8 features representing the article's influence score on 8 topics, the sum of the influence values on all topics, and the maximum of the 8 influence scores), 2 features which are the probabilities extracted from the section-name based classifier described above, and 8 features from our human-designed feature extraction process (number of words, number of sentences, mention of a series of articles, number of references to other parts of an article series, accountability-related word count, investigation-related word count, mentioned investigation duration, mention of any investigation documents).

Classifier Model.
We used a three-layer neural network with the focal loss function (10), that has been found to be useful in classification tasks involving imbalanced data sets. Hyperparameter experiments were run to find the best model architecture using a grid search over the number of nodes within every layer, the regularization parameters, and the parameters of the focal loss function. When choosing the final model to deploy, we used a holistic approach and analyzed performance on the validation data set. We prioritized recall over precision, and investigated the false positives belonging to non-investigative sections within every newspaper. Our final neural network architecture is a three layer model with [256,128,128]  On the test data set, using a threshold value of 0.5, our model correctly identifies 92/119 award winning investigative articles. Our model also predicts that 8949 articles who did not win awards are investigative, and classifies the remaining 400163 articles as not investigative. With a threshold value of 0.9, the model correctly identifies 80/119 award winners, predicts achieved by the model is 0.99, and AUC-PR is 0.02.

List of Newspapers.
Full list of newspapers in our data set is provided below. For each newspaper, we queried all articles on Newsbank between Jan 1, 2010 -Dec 31, 2019. For some newspapers, articles were only available for a subset of this period, which we specify below.