library(papercheck)
library(readr) # reading and writing CSV files
library(dplyr) # for data wrangling
In this vignette, we will demonstrate the steps towards creating a module that detects sentences reporting a power analysis. This could be used for metascientific enquiry, or as a first step in a module that aims to give advice about whether a power analysis reports all of the necessary information for interpretation.
Initial Text Search
First, we need to set up the sample of papers we will code for ground truth. See the batch processing vignette for information on how to load multiple PDFs. Here, we will load 250 open access papers from Psychological Science, which have been previously read in to papercheck.
papers <- psychsci
If you want to be completely thorough, you can manually code every single sentence in every single paper for your target concept.
text_all <- search_text(papers)
Here, that results in 67844 sentences. However, we can narrow that down a LOT with some simple text searches.
Fixed Terms
Let’s start with a fixed search term: “power analysis”. We’ll keep
track of our iteratively developed search terms by naming the resulting
table text_#
.
text_1 <- search_text(papers, pattern = "power analysis")
Here we have 104 results. We’ll just show the paper id and text columns for the first 10 rows of the returned table, but the table also provides the section type, header, and section, paragraph, and sentence numbers (div, p, and s).
We caught a lot of sentences with that term, but are probably missing a few. Let’s try a more general fixed search term: “power”.
text_2 <- search_text(papers, pattern = "power")
Here we have 772 results. Inspect the first 100 rows to see if there are any false positives.
Regex
After a quick skim through the 772 results, we can see that words like “powerful” or “Power-Point” are never reporting a power analysis, so we should try to exclude them.
We can use regex to make our text search a bit more specific. The following pattern requires that power is followed optionally by “ed” and then by a word border (like a space or full stop), so will match “power” and “powered”, but not “powerful”.
pattern <- "(\\b|G*)power(ed)?\\b"
# test some examples to check the pattern
yes <- c("power",
"power.",
"Power",
"power analysis",
"powered",
"G*Power")
no <- c("powerful",
"powerful analysis",
"empower")
grepl(pattern, yes, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
grepl(pattern, no, ignore.case = TRUE)
#> [1] FALSE FALSE FALSE
text_3 <- search_text(papers, pattern)
Here we have 653 results. Inspect them for false positives again.
Refining the search
You can repeat this process of skimming the results and refining the search term iteratively until you are happy that you have probably caught all of the relevant text and don’t have too many false positives.
Let’s also have a quick look at any papers that mention power more than 10 times, as they are probably talking about a different sense of power.
#> # A tibble: 6 × 2
#> id n
#> <chr> <int>
#> 1 0956797616647519 94
#> 2 09567976211001317 23
#> 3 09567976241254312 16
#> 4 09567976221147259 12
#> 5 09567976211068070 11
#> 6 09567976221094036 11
That first paper is a definite outlier, and indeed, the title is “Researchers’ Intuitions About Power in Psychological Research”. Excluding that one, what sentences are in the other papers?
outliers <- count(text_3, id, sort = TRUE) |>
filter(n > 10, n < 90) |>
semi_join(text_3, y = _, by = "id")
It looks like a lot of this text is about alpha/beta/theta
oscillations. We can pipe our results to another
search_text()
function to return only sentences that do not
contain the strings “beta”, “theta” or “oscillat” (we won’t exclude
“alpha” because specifying your critical alpha threshold is part is good
reporting for a power analysis).
# exclude outlier paper from sample
to_exclude <- names(papers) == "0956797616647519"
papers <- papers[!to_exclude]
text_4 <- papers |>
# search for power sentences
search_text("(\\b|G*)power(ed)?\\b") |>
# exclude oscillations sentences
search_text("^(?!.*\\b(beta|theta|oscillat)).*", perl = TRUE)
One useful technique is to use dplyr::anti_join()
to
check which text was excluded when you make a search term more specific,
to make sure there are no or few false negatives.
Screening
Once you are happy that your search term includes all of the relevant text and not too much irrelevant text (we’ve narrowed our candidate sentences down now to 0.8% of the full text!), the next step is to save this data frame so you can open it in a spreadsheet application and code each row for ground truth.
readr::write_csv(text_4, "power/power_screening.csv")
Be careful opening files in spreadsheet apps like Excel. Sometimes
they will garble special characters like ü
or , which will
make the validation process below inaccurate, since the expected values
from your spreadsheet will not exactly match the calculated values from
the modules you’re testing. One way to fix this if it has happened, is
to read the excel file into R and replace the text
column
with the text
column from the data frame above, and re-save
it as a CSV file.
Validating a Module
Module Creation
To validate a module, you need to write your search term into a module. See the modules vignette for details. Creating a module for a text search is very straightforward. Just save the following text in a file called “power0.R”. You can omit the description in the roxygen section for now.
#' Power Analysis v0
power0 <- function(paper) {
table <- paper |>
# search for power sentences
search_text("(\\b|G*)power(ed)?\\b") |>
# exclude oscillations sentences
search_text("^(?!.*\\b(beta|theta|osscil)).*", perl = TRUE)
summary_table <- dplyr::count(table, id, name = "n_power")
list(
table = table,
summary = summary_table,
na_replace = 0
)
}
Now test your module by running it on the papers. The returned table
should be identical to text_4
.
mod_test <- module_run(papers, "power/power0.R")
all.equal(mod_test$table, text_4)
#> [1] "Attributes: < Component \"row.names\": Numeric: lengths (529, 522) differ >"
#> [2] "Component \"text\": Lengths (529, 522) differ (string compare on first 522)"
#> [3] "Component \"text\": 338 string mismatches"
#> [4] "Component \"section\": Lengths (529, 522) differ (string compare on first 522)"
#> [5] "Component \"section\": 172 string mismatches"
#> [6] "Component \"header\": Lengths (529, 522) differ (string compare on first 522)"
#> [7] "Component \"header\": 'is.NA' value mismatch: 2 in current 2 in target"
#> [8] "Component \"div\": Numeric: lengths (529, 522) differ"
#> [9] "Component \"p\": Numeric: lengths (529, 522) differ"
#> [10] "Component \"s\": Numeric: lengths (529, 522) differ"
#> [11] "Component \"id\": Lengths (529, 522) differ (string compare on first 522)"
#> [12] "Component \"id\": 290 string mismatches"
We also returned a summary table, which gives a single row per paper, and the number of matching sentences from the main table.
Set Up Validation Files
Once you have the ground truth coded from your best inclusive search term, you can validate your module and start trying to improve its performance.
First, let’s use the over-inclusive search term. This will, by definition, have no false negatives, but further refining of your module will start to produce both false positives and negatives.
You have to set up two files to match the module output. First, a
table of the expected text matches. You can get this by filtering your
ground truth table to just the rows that are true positives (hand-coded
here as the column power_computation
).
ground_truth <- read_csv("power/power_screening_coded.csv",
show_col_types = FALSE)
table_exp <- ground_truth |>
filter(power_computation == 1) |>
select(id, text)
Next, determine the expected summary table. Since not all papers are in the expected table above, you need to add them manually with a count of 0. The code below demonstrates one way to do that.
Validate
Run a validation using the validate()
function. Set the
first argument to your sample of papers, the second to the path to your
module, and the next arguments to the expected values of any items
returned by your module (usually table
and/or
summary
).
v0 <- validate(papers,
module = "power/power0.R",
table = table_exp,
summary = summary_exp)
Printing the returned object will give you a summary of the validation.
v0
#> Validated matches for module `power/power0.R`:
#>
#> * N in validation sample: 249
#> * table:
#> * true_positive: 228
#> * false_positive: 303
#> * false_negative: 2
#> * summary:
#> * n_power: 0.56
You can access these values directly from the stats
item
of the list. See the validation vignette for
further information about the contents of this list.
v0$stats |> str()
#> List of 3
#> $ n_papers: int 249
#> $ table :List of 3
#> ..$ true_positive : int 228
#> ..$ false_positive: int 303
#> ..$ false_negative: int 2
#> $ summary :List of 1
#> ..$ n_power: num 0.558
Refine and Iterate
Refine your module to improve it based on your coding of the ground truth. For example, perhaps we decide that almost all instances of real power analyses contain both the strings “power” and “analys”
pattern <- "(analys.*power|power.*analys)"
yes <- c("power analysis",
"power analyses",
"power has an analysis",
"analyse power",
"analysis is powered at")
no <- c("powered",
"power",
"analysis")
grepl(pattern, yes)
#> [1] TRUE TRUE TRUE TRUE TRUE
grepl(pattern, no)
#> [1] FALSE FALSE FALSE
Duplicate the file “power0.R” as “power1.R” and change the search pattern to this new one and re-run the validation.
v1 <- validate(paper = papers,
module = "power/power1.R",
table = table_exp,
summary = summary_exp)
v1
#> Validated matches for module `power/power1.R`:
#>
#> * N in validation sample: 249
#> * table:
#> * true_positive: 91
#> * false_positive: 92
#> * false_negative: 137
#> * summary:
#> * n_power: 0.56
This version has the same overall accuracy by paper, but fewer false positives and more false negatives. False positives in the context of a module that informs scientists about a potential problem are not necessarily undesirable. It may be better to be over-sensitive and catch almost all problems, even if you also catch many non-problems. You will need to evaluate the validation results in the context of what you want your module to do.
Iterate again
Let’s try a two-step process for finding sentences with the word power that also have numbers or percents.
table <- papers |>
search_text("(\\b|G*)power(ed)?\\b") |>
search_text("(\\.[0-9]|[0-9]%)")
v2 <- validate(paper = papers,
module = "power/power2.R",
table = table_exp,
summary = summary_exp)
v2
#> Validated matches for module `power/power2.R`:
#>
#> * N in validation sample: 249
#> * table:
#> * true_positive: 171
#> * false_positive: 81
#> * false_negative: 57
#> * summary:
#> * n_power: 0.69
It’s definitely doing better than the last version. Can you refine it to do even better?