This is a demo of the workflow for module validation. We are still piloting this workflow and it is likely to change.
Validation sample
Set up the papers in your validation sample. You will need a
directory of XML files created by pdf2grobid()
. In this
example, we’ll set everything up in a temporary directory.
# create validation directory in temp dir
valdir <- tempdir() |> file.path("validate")
dir.create(valdir, showWarnings = FALSE)
# copy built-in XML files to xml directory
xmldir <- file.path(valdir, "xml")
dir.create(xmldir, showWarnings = FALSE)
xmls <- list.files(demodir(), "\\.xml$", full.names = TRUE)
file.copy(xmls, xmldir)
#> [1] TRUE TRUE TRUE
Sample data
Create a data frame with info about each paper. One column must be called “id” and contain paths to the xml files (relative to the validation file location). Other possible columns are “table”, “report”, and “traffic_light”, which should contain the expected values of those items from the module you’re testing. If you want to check more than just the first row of the text column from the return table, omit the table column and use the method in the next section.
sample <- data.frame(
id = file.path("xml", list.files(xmldir)),
table = c("faceresearch.org", "https://osf.io/mwzuq", "https://osf.io/pwtrh"),
report = rep("", 3), # this module has no report
traffic_light = c("info", "info", "red")
)
The code above has one inaccurate traffic light (“red”) for demonstration purposes.
Expected Table
If the tables returned by the module you’re validating can have more than one row, or you want to check columns other than “text”, you will need to add the expected values to a separate data frame. One column must be “xml” to join it to the other table. The other columns should have the same names as the columns returned by the module. You can omit any columns and they will not be checked in the validation. Here, we will only validate the text and header columns (making one mistake in the header column for demonstration purposes).
expected <- data.frame(
id = rep(sample$id, c(2, 3, 2)),
text = c("faceresearch.org", "stumbleupon.com",
rep("https://osf.io/mwzuq", 3),
rep("https://osf.io/pwtrh", 2)),
header = c("Participants", "Participants",
"Methods", "Procedure", "Analysis",
"Intro", "Attitude")
)
expected
#> id text header
#> 1 xml/eyecolor.xml faceresearch.org Participants
#> 2 xml/eyecolor.xml stumbleupon.com Participants
#> 3 xml/incest.xml https://osf.io/mwzuq Methods
#> 4 xml/incest.xml https://osf.io/mwzuq Procedure
#> 5 xml/incest.xml https://osf.io/mwzuq Analysis
#> 6 xml/prereg.xml https://osf.io/pwtrh Intro
#> 7 xml/prereg.xml https://osf.io/pwtrh Attitude
Run Validation
If you don’t include the expected results table, the table check will just check the first result in the text column of the module results and match it to the table column of your sample.
v <- validate("all-urls", sample, path = valdir)
v
If you include the expected results table, it will assess all the data for matching the module results table.
v <- validate("all-urls", sample, expected, path = valdir)
v
We can further explore any problems by looking at the sample and returned tables.
# show rows where the traffic light check is false
v$sample |>
dplyr::filter(!tl_check)
#> id table report traffic_light report_ver tl_ver
#> 1 xml/prereg.xml https://osf.io/pwtrh red info
#> misses false_alarms table_check report_check tl_check
#> 1 1 1 FALSE TRUE FALSE
The table check is false, and there is one missing and one extra result.
# show rows where the table check is false
v$sample |>
dplyr::filter(!table_check)
#> id table report traffic_light report_ver tl_ver
#> 1 xml/prereg.xml https://osf.io/pwtrh red info
#> misses false_alarms table_check report_check tl_check
#> 1 1 1 FALSE TRUE FALSE
You can view the validated results table for that paper…
v$table |>
dplyr::filter(id == "xml/prereg.xml")
#> id text
#> 1 xml/prereg.xml https://osf.io/pwtrh
#> 2 xml/prereg.xml https://osf.io/pwtrh
#> header
#> 1 Are One-tailed Tests and Sequential Analyses Appropriate in Psychology?
#> 2 Attitude
…and compare it with the expected results.
expected |>
dplyr::filter(id == "xml/prereg.xml")
#> id text header
#> 1 xml/prereg.xml https://osf.io/pwtrh Intro
#> 2 xml/prereg.xml https://osf.io/pwtrh Attitude