Skip to contents
library(papercheck)
library(dplyr) # for data wrangling
library(readr) # reading and writing CSV files
library(ggplot2) # for dataviz

See the batch processing vignette for information on how to load multiple PDFs. Here, we will load 250 open access papers from Psychological Science, which have been previously converted to XML by grobid, read in to papercheck, and saved as an Rds object.

# load from RDS for efficiency
papers <- readRDS("psysci_oa.Rds")

Fixed Terms

Let’s start with a fixed search term: “power analysis”. We’ll keep track of our iteratively developed search terms by naming the resulting table text_#.

text_1 <- search_text(papers, pattern = "power analysis")

Here we have 104 results. We’ll just show the paper id and text columns of the returned table, but the table also provides the section type, header, and section, paragraph, and sentence numbers (div, p, and s).

We caught a lot of sentences with that term, but are probably missing a few. Let’s try a more general fixed search term: “power”.

text_2 <- search_text(papers, pattern = "power")

Here we have 744 results. Inspect them to see if there are any false positives.

Regex

After a quick skim through the 744 results, we can see that words like “powerful” are never reporting a power analysis, so we should try to exclude them.

We can use regex to make our text search a bit more specific. The following pattern requires that power is followed optionally by “ed” and then by a word border (like a space or full stop), so will match “power” and “powered”, but not “powerful”.

# test some examples to check the pattern
pattern <- "power(ed)?\\b"
yes <- c("power",
          "power.",
          "power analysis",
          "powered")
no  <- c("powerful",
          "powerful analysis")
grepl(pattern, yes)
#> [1] TRUE TRUE TRUE TRUE
grepl(pattern, no)
#> [1] FALSE FALSE
text_3 <- search_text(papers, pattern)

Here we have 651 results. Inspect them for false positives again.

You can repeat this process of skimming the results and refining the search term iteratively until you are happy that you have probably caught all of the relevant text and don’t have too many false positives.

One useful technique is to use dplyr::anti_join() to check which text was excluded when you make a search term more specific, to make sure there are no or few false negatives.

# rows in text_2 that were excluded in text_3
excluded <- anti_join(text_2, text_3, 
                      by = c("id", "div", "p", "s"))

Screening

Once you are happy that your search term includes all of the relevant text and not too much irrelevant text, the next step is to save this data frame so you can open it in a spreadsheet application and code each row for ground truth.

readr::write_csv(text_3, "power/power_screening.csv")

Be careful opening files in spreadsheet apps like Excel. Sometimes they will garble special characters like ü or , which will make the validation process below inaccurate, since the expected values from your spreadsheet will not exactly match the calculated values from the modules you’re testing. One way to fix this if it has happened, is to read the excel file into R and replace the text column with the text column from the data frame above, and re-save it as a CSV file.

Validating a Module

Module Creation

To validate a module, you need to write your search term into a module. See the modules vignette for details. Creating a module for a text search is very straightforward. Just save the following text in a file called “power_0.mod”. The traffic_light entry returns “green” for a paper if any sentences are found, and “red” if none are found.

{
  "title": "Power Analysis",
  "description": "List all sentences that contain a power analysis.",
  "text": {
    "pattern": "power(ed)?\\b"
  },
  "traffic_light": {
    "found": "green",
    "not_found": "red"
  }
}

Now test your module by running it on the papers. The returned table should be identical to text_3.

mod_test <- module_run(papers, "power/power0.mod")
all.equal(mod_test$table, text_3)
#> [1] TRUE

Set Up Validation Files

Once you have the ground truth coded from your best inclusive search term, you can validate your module and start trying to improve its performance.

First, let’s use the over-inclusive search term. This will, by definition, have no false negatives, but further refining of your module will start to produce both false positives and negatives.

You have to set up two files to match the module output. First, the sample of files to check.

sample <- data.frame(
  id = names(papers)
)

Second, a table of the expected text matches. You can get this by filtering your ground truth table to just the rows that are true positives (hand-coded here as the column power_computation).

ground_truth <- readxl::read_excel("power/power_screening_coded.xlsx")
ground_truth$text <- text_3$text # fix problem with excel and special chars

expected <- ground_truth |>
  filter(power_computation == 1) |>
  select(id, text)

Add the traffic light to the sample table by determining if there are any matches in the results table.

sample$traffic_light <- ifelse(
  sample$id %in% expected$id, 
  "green", "red")

Validate

v0 <- validate(module = "power/power0.mod",
               sample = sample, 
               expected = expected, 
               path = "xml")
v0

Refine and Iterate

Refine your module to improve it based on your coding of the ground truth. For example, perhaps we decide that almost all instances of real power analyses contain both the strings “power” and “analys”

pattern <- "(analys.*power|power.*analys)"
yes <- c("power analysis",
         "power analyses",
         "power has an analysis",
         "analyse power",
         "analysis is powered at")
no  <- c("powered",
         "power",
         "analysis")
grepl(pattern, yes)
#> [1] TRUE TRUE TRUE TRUE TRUE
grepl(pattern, no)
#> [1] FALSE FALSE FALSE

Duplicate the file “power.mod” as “power1.mod” and change the search pattern to this new one and re-run the validation.

v1 <- validate(module = "power/power1.mod",
               sample = sample, 
               expected = expected, 
               path = "xml")
v1

Now we are only matching % of the tables. There are a few ways to investigate this.

Traffic lights can be more complex than “green” and “red”, since they can also return values like “info”, or “na”. But if your values map straightforwardly onto yes and no, you can calculate signal detection measures for your module.

tl_accuracy(v1, yes = "green", no = "red") |> str()
#> List of 9
#>  $ hits              : int 77
#>  $ misses            : int 41
#>  $ false_alarms      : int 23
#>  $ correct_rejections: int 109
#>  $ accuracy          : num 0.744
#>  $ sensitivity       : num 0.653
#>  $ specificity       : num 0.174
#>  $ d_prime           : num 1.33
#>  $ beta              : num 1.44
#>  - attr(*, "class")= chr "ppchk_accuracy_measures"

You can plot the number of missing and extra results per paper to see if the problem is in false positives and/or false negatives.

ggplot(v1$sample) +
  geom_freqpoly(aes(x = misses), color = "red", binwidth = 1) +
  geom_freqpoly(aes(x = false_alarms), color = "blue", binwidth = 1) +
  scale_x_continuous(breaks = 0:10, limits = c(0, 10)) +
  labs(x = "Number of Results Missing (red) or Extra (blue) per Paper", 
       y = "Number of Papers")

Both seem to be happening here, so let’s look more specifically at false alarms and misses.

false_alarms <- anti_join(v1$table, expected, 
                          by = c("id", "text"))
misses <- anti_join(expected, v1$table, 
                          by = c("id", "text"))

Sometime it makes more sense to filter a set down in steps, such as all sentences that contain the word “power” and then all of those that contain an equal sign and at least one number in the format “.#” or “#%”.

results <- papers |>
  search_text("power(ed)?\\b") |>
  search_text("(\\.[0-9]|[0-9]%)")

You can add chained text searches to the JSON module file like this:

{
  "title": "Power Analysis",
  "description": "List all sentences that contain the string 'power' and a number.",
  "text": {
    "pattern": "power(ed)?\\b"
  },
  "text": {
    "pattern": "(\\.[0-9]|[0-9]%)"
  },
  "traffic_light": {
    "found": "green",
    "not_found": "red"
  }
}
v2 <- validate(module = "power/power2.mod",
               sample = sample, 
               expected = expected, 
               path = "xml")
v2
tl_accuracy(v2) |> str()
#> List of 9
#>  $ hits              : int 113
#>  $ misses            : int 5
#>  $ false_alarms      : int 11
#>  $ correct_rejections: int 121
#>  $ accuracy          : num 0.936
#>  $ sensitivity       : num 0.958
#>  $ specificity       : num 0.0833
#>  $ d_prime           : num 3.11
#>  $ beta              : num 0.589
#>  - attr(*, "class")= chr "ppchk_accuracy_measures"
false_alarms2 <- anti_join(v2$table, expected, 
                          by = c("id", "text"))
misses2 <- anti_join(expected, v2$table, 
                          by = c("id", "text"))

Compare modules

data.frame(
  #module = c(v0$module, v1$module, v2$module),
  tables = c(v0$table_matched, v1$table_matched, v2$table_matched),
  traffic_lights = c(v0$tl_matched, v1$tl_matched, v2$tl_matched)
)
#>   tables traffic_lights
#> 1  0.484          0.772
#> 2  0.540          0.744
#> 3  0.808          0.936