Using Papercheck for MetaScience • papercheck

library(papercheck)
library(readr) # reading and writing CSV files
library(dplyr) # for data wrangling

In this vignette, we will demonstrate the steps towards creating a module that detects sentences reporting a power analysis. This could be used for metascientific enquiry, or as a first step in a module that aims to give advice about whether a power analysis reports all of the necessary information for interpretation.

Initial Text Search

First, we need to set up the sample of papers we will code for ground truth. See the batch processing vignette for information on how to load multiple PDFs. Here, we will load 250 open access papers from Psychological Science, which have been previously read in to papercheck.

papers <- psychsci

If you want to be completely thorough, you can manually code every single sentence in every single paper for your target concept.

text_all <- search_text(papers)

Here, that results in 67844 sentences. However, we can narrow that down a LOT with some simple text searches.

Fixed Terms

Let’s start with a fixed search term: “power analysis”. We’ll keep track of our iteratively developed search terms by naming the resulting table text_#.

text_1 <- search_text(papers, pattern = "power analysis")

Here we have 104 results. We’ll just show the paper id and text columns for the first 10 rows of the returned table, but the table also provides the section type, header, and section, paragraph, and sentence numbers (div, p, and s).

We caught a lot of sentences with that term, but are probably missing a few. Let’s try a more general fixed search term: “power”.

text_2 <- search_text(papers, pattern = "power")

Here we have 772 results. Inspect the first 100 rows to see if there are any false positives.

Regex

After a quick skim through the 772 results, we can see that words like “powerful” or “Power-Point” are never reporting a power analysis, so we should try to exclude them.

We can use regex to make our text search a bit more specific. The following pattern requires that power is followed optionally by “ed” and then by a word border (like a space or full stop), so will match “power” and “powered”, but not “powerful”.

pattern <- "(\\b|G*)power(ed)?\\b"

# test some examples to check the pattern
yes <- c("power",
         "power.",
         "Power",
         "power analysis",
         "powered",
         "G*Power")
no  <- c("powerful",
         "powerful analysis", 
         "empower")
grepl(pattern, yes, ignore.case = TRUE)

#> [1] TRUE TRUE TRUE TRUE TRUE TRUE

grepl(pattern, no, ignore.case = TRUE)

#> [1] FALSE FALSE FALSE

text_3 <- search_text(papers, pattern)

Here we have 653 results. Inspect them for false positives again.

Refining the search

You can repeat this process of skimming the results and refining the search term iteratively until you are happy that you have probably caught all of the relevant text and don’t have too many false positives.

Let’s also have a quick look at any papers that mention power more than 10 times, as they are probably talking about a different sense of power.

count(text_3, id, sort = TRUE) |>
  filter(n > 10)

#> # A tibble: 6 × 2
#>   id                    n
#>   <chr>             <int>
#> 1 0956797616647519     94
#> 2 09567976211001317    23
#> 3 09567976241254312    16
#> 4 09567976221147259    12
#> 5 09567976211068070    11
#> 6 09567976221094036    11

That first paper is a definite outlier, and indeed, the title is “Researchers’ Intuitions About Power in Psychological Research”. Excluding that one, what sentences are in the other papers?

outliers <- count(text_3, id, sort = TRUE) |>
  filter(n > 10, n < 90) |>
  semi_join(text_3, y = _, by = "id")

It looks like a lot of this text is about alpha/beta/theta oscillations. We can pipe our results to another search_text() function to return only sentences that do not contain the strings “beta”, “theta” or “oscillat” (we won’t exclude “alpha” because specifying your critical alpha threshold is part is good reporting for a power analysis).

# exclude outlier paper from sample
to_exclude <- names(papers) == "0956797616647519"
papers <- papers[!to_exclude]

text_4 <- papers |>
  # search for power sentences
  search_text("(\\b|G*)power(ed)?\\b") |>
  # exclude oscillations sentences
  search_text("^(?!.*\\b(beta|theta|oscillat)).*", perl = TRUE)

One useful technique is to use dplyr::anti_join() to check which text was excluded when you make a search term more specific, to make sure there are no or few false negatives.

# rows in text_3 that were excluded in text_4
excluded <- anti_join(text_3, text_4, 
                      by = c("id", "div", "p", "s"))

Screening

Once you are happy that your search term includes all of the relevant text and not too much irrelevant text (we’ve narrowed our candidate sentences down now to 0.8% of the full text!), the next step is to save this data frame so you can open it in a spreadsheet application and code each row for ground truth.

readr::write_csv(text_4, "power/power_screening.csv")

Be careful opening files in spreadsheet apps like Excel. Sometimes they will garble special characters like ü or , which will make the validation process below inaccurate, since the expected values from your spreadsheet will not exactly match the calculated values from the modules you’re testing. One way to fix this if it has happened, is to read the excel file into R and replace the text column with the text column from the data frame above, and re-save it as a CSV file.

ground_truth <- read_csv("power/power_screening_coded.csv", 
                         show_col_types = FALSE)

# fix problem with excel and special chars
ground_truth$text <- text_4$text
write_csv(ground_truth, "power/power_screening_coded.csv")

Validating a Module

Module Creation

To validate a module, you need to write your search term into a module. See the modules vignette for details. Creating a module for a text search is very straightforward. Just save the following text in a file called “power0.R”. You can omit the description in the roxygen section for now.

#' Power Analysis v0
power0 <- function(paper) {
  table <- paper |>
    # search for power sentences
    search_text("(\\b|G*)power(ed)?\\b") |>
    # exclude oscillations sentences
    search_text("^(?!.*\\b(beta|theta|osscil)).*", perl = TRUE)
  
  summary_table <- dplyr::count(table, id, name = "n_power")
  
  list(
    table = table,
    summary = summary_table,
    na_replace = 0
  )
}

Now test your module by running it on the papers. The returned table should be identical to text_4.

mod_test <- module_run(papers, "power/power0.R")
all.equal(mod_test$table, text_4)

#>  [1] "Attributes: < Component \"row.names\": Numeric: lengths (529, 522) differ >"   
#>  [2] "Component \"text\": Lengths (529, 522) differ (string compare on first 522)"   
#>  [3] "Component \"text\": 338 string mismatches"                                     
#>  [4] "Component \"section\": Lengths (529, 522) differ (string compare on first 522)"
#>  [5] "Component \"section\": 172 string mismatches"                                  
#>  [6] "Component \"header\": Lengths (529, 522) differ (string compare on first 522)" 
#>  [7] "Component \"header\": 'is.NA' value mismatch: 2 in current 2 in target"        
#>  [8] "Component \"div\": Numeric: lengths (529, 522) differ"                         
#>  [9] "Component \"p\": Numeric: lengths (529, 522) differ"                           
#> [10] "Component \"s\": Numeric: lengths (529, 522) differ"                           
#> [11] "Component \"id\": Lengths (529, 522) differ (string compare on first 522)"     
#> [12] "Component \"id\": 290 string mismatches"

We also returned a summary table, which gives a single row per paper, and the number of matching sentences from the main table.

Set Up Validation Files

Once you have the ground truth coded from your best inclusive search term, you can validate your module and start trying to improve its performance.

First, let’s use the over-inclusive search term. This will, by definition, have no false negatives, but further refining of your module will start to produce both false positives and negatives.

You have to set up two files to match the module output. First, a table of the expected text matches. You can get this by filtering your ground truth table to just the rows that are true positives (hand-coded here as the column power_computation).

ground_truth <- read_csv("power/power_screening_coded.csv", 
                         show_col_types = FALSE)

table_exp <- ground_truth |>
  filter(power_computation == 1) |>
  select(id, text)

Next, determine the expected summary table. Since not all papers are in the expected table above, you need to add them manually with a count of 0. The code below demonstrates one way to do that.

summary_exp <- papers |>
  # gets a table of just the paper IDs
  info_table(c()) |>
  # join in the expected table
  left_join(table_exp, by = "id") |>
  # count rows with text for each id
  summarise(n_power = sum(!is.na(text)), .by = "id")

Validate

Run a validation using the validate() function. Set the first argument to your sample of papers, the second to the path to your module, and the next arguments to the expected values of any items returned by your module (usually table and/or summary).

v0 <- validate(papers, 
               module = "power/power0.R",
               table = table_exp, 
               summary = summary_exp)

Printing the returned object will give you a summary of the validation.

v0

#>  Validated matches for module `power/power0.R`:
#> 
#> * N in validation sample: 249
#> * table: 
#>   * true_positive: 228
#>   * false_positive: 303
#>   * false_negative: 2
#> * summary: 
#>   * n_power: 0.56

You can access these values directly from the stats item of the list. See the validation vignette for further information about the contents of this list.

v0$stats |> str()

#> List of 3
#>  $ n_papers: int 249
#>  $ table   :List of 3
#>   ..$ true_positive : int 228
#>   ..$ false_positive: int 303
#>   ..$ false_negative: int 2
#>  $ summary :List of 1
#>   ..$ n_power: num 0.558

Refine and Iterate

Refine your module to improve it based on your coding of the ground truth. For example, perhaps we decide that almost all instances of real power analyses contain both the strings “power” and “analys”

pattern <- "(analys.*power|power.*analys)"
yes <- c("power analysis",
         "power analyses",
         "power has an analysis",
         "analyse power",
         "analysis is powered at")
no  <- c("powered",
         "power",
         "analysis")
grepl(pattern, yes)

#> [1] TRUE TRUE TRUE TRUE TRUE

grepl(pattern, no)

#> [1] FALSE FALSE FALSE

Duplicate the file “power0.R” as “power1.R” and change the search pattern to this new one and re-run the validation.

v1 <- validate(paper = papers, 
               module = "power/power1.R",
               table = table_exp, 
               summary = summary_exp)

v1

#>  Validated matches for module `power/power1.R`:
#> 
#> * N in validation sample: 249
#> * table: 
#>   * true_positive: 91
#>   * false_positive: 92
#>   * false_negative: 137
#> * summary: 
#>   * n_power: 0.56

This version has the same overall accuracy by paper, but fewer false positives and more false negatives. False positives in the context of a module that informs scientists about a potential problem are not necessarily undesirable. It may be better to be over-sensitive and catch almost all problems, even if you also catch many non-problems. You will need to evaluate the validation results in the context of what you want your module to do.

Iterate again

Let’s try a two-step process for finding sentences with the word power that also have numbers or percents.

table <- papers |>
    search_text("(\\b|G*)power(ed)?\\b") |> 
    search_text("(\\.[0-9]|[0-9]%)")

v2 <- validate(paper = papers, 
               module = "power/power2.R",
               table = table_exp, 
               summary = summary_exp)

v2

#>  Validated matches for module `power/power2.R`:
#> 
#> * N in validation sample: 249
#> * table: 
#>   * true_positive: 171
#>   * false_positive: 81
#>   * false_negative: 57
#> * summary: 
#>   * n_power: 0.69

It’s definitely doing better than the last version. Can you refine it to do even better?