Skip to contents

This is a demo of the workflow for module validation. We are still piloting this workflow and it is likely to change.

Validation sample

Set up the papers in your validation sample. You will need a directory of XML files created by pdf2grobid(). In this example, we’ll set everything up in a temporary directory.

# create validation directory in temp dir
valdir <- tempdir() |> file.path("validate")
dir.create(valdir, showWarnings = FALSE)

# copy built-in XML files to xml directory
xmldir <- file.path(valdir, "xml")
dir.create(xmldir, showWarnings = FALSE)
xmls <- list.files(demodir(), "\\.xml$", full.names = TRUE)
file.copy(xmls, xmldir)
#> [1] TRUE TRUE TRUE

Sample data

Create a data frame with info about each paper. One column must be called “id” and contain paths to the xml files (relative to the validation file location). Other possible columns are “table”, “report”, and “traffic_light”, which should contain the expected values of those items from the module you’re testing. If you want to check more than just the first row of the text column from the return table, omit the table column and use the method in the next section.

sample <- data.frame(
  id =  file.path("xml", list.files(xmldir)),
  table = c("faceresearch.org", "https://osf.io/mwzuq", "https://osf.io/pwtrh"),
  report = rep("", 3), # this module has no report
  traffic_light = c("info", "info", "red")
)

The code above has one inaccurate traffic light (“red”) for demonstration purposes.

Expected Table

If the tables returned by the module you’re validating can have more than one row, or you want to check columns other than “text”, you will need to add the expected values to a separate data frame. One column must be “xml” to join it to the other table. The other columns should have the same names as the columns returned by the module. You can omit any columns and they will not be checked in the validation. Here, we will only validate the text and header columns (making one mistake in the header column for demonstration purposes).

expected <- data.frame(
  id = rep(sample$id, c(2, 3, 2)),
  text = c("faceresearch.org", "stumbleupon.com",
           rep("https://osf.io/mwzuq", 3),
           rep("https://osf.io/pwtrh", 2)),
  header = c("Participants", "Participants", 
             "Methods", "Procedure", "Analysis",
             "Intro", "Attitude")
)

expected
#>                 id                 text       header
#> 1 xml/eyecolor.xml     faceresearch.org Participants
#> 2 xml/eyecolor.xml      stumbleupon.com Participants
#> 3   xml/incest.xml https://osf.io/mwzuq      Methods
#> 4   xml/incest.xml https://osf.io/mwzuq    Procedure
#> 5   xml/incest.xml https://osf.io/mwzuq     Analysis
#> 6   xml/prereg.xml https://osf.io/pwtrh        Intro
#> 7   xml/prereg.xml https://osf.io/pwtrh     Attitude

Run Validation

If you don’t include the expected results table, the table check will just check the first result in the text column of the module results and match it to the table column of your sample.

v <- validate("all-urls", sample, path = valdir)

v

If you include the expected results table, it will assess all the data for matching the module results table.

v <- validate("all-urls", sample, expected, path = valdir)

v

We can further explore any problems by looking at the sample and returned tables.

# show rows where the traffic light check is false
v$sample |>
  dplyr::filter(!tl_check)
#>               id                table report traffic_light report_ver tl_ver
#> 1 xml/prereg.xml https://osf.io/pwtrh                  red              info
#>   misses false_alarms table_check report_check tl_check
#> 1      1            1       FALSE         TRUE    FALSE

The table check is false, and there is one missing and one extra result.

# show rows where the table check is false
v$sample |>
  dplyr::filter(!table_check)
#>               id                table report traffic_light report_ver tl_ver
#> 1 xml/prereg.xml https://osf.io/pwtrh                  red              info
#>   misses false_alarms table_check report_check tl_check
#> 1      1            1       FALSE         TRUE    FALSE

You can view the validated results table for that paper…

v$table |>
  dplyr::filter(id == "xml/prereg.xml")
#>               id                 text
#> 1 xml/prereg.xml https://osf.io/pwtrh
#> 2 xml/prereg.xml https://osf.io/pwtrh
#>                                                                    header
#> 1 Are One-tailed Tests and Sequential Analyses Appropriate in Psychology?
#> 2                                                                Attitude

…and compare it with the expected results.

expected |>
  dplyr::filter(id == "xml/prereg.xml")
#>               id                 text   header
#> 1 xml/prereg.xml https://osf.io/pwtrh    Intro
#> 2 xml/prereg.xml https://osf.io/pwtrh Attitude