Skip to contents

Installation

You can install the development version of papercheck from GitHub with:

# install.packages("devtools")
devtools::install_github("scienceverse/papercheck")

You can launch an interactive shiny app version of the code below with:

Load from PDF

The function pdf2grobid() can read PDF files and save them in the TEI format created by grobid. This requires an internet connection and takes a few seconds per paper, so should only be done once and the results saved for later use.

If the server is unavailable, you can use a grobid web interface.

pdf_file <- demopdf()
xml_file <- pdf2grobid(pdf_file)
xml_file <- demoxml()

Load from XML

The function read_grobid() can read XML files parsed by grobid.

paper <- read_grobid(xml_file)

Full Text

You can access a parsed table of the full text of the paper via s$full_text, but you may find it more convenient to use the function search_text(). The defaults return a data table of each sentence, with the section type, header, div, paragraph and sentence numbers, and file name. (The section type is a best guess from the headers, so may not always be accurate.)

text <- search_text(paper)
text section header div p s id
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. abstract Abstract 0 1 1 to_err_is_human.xml
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. intro Introduction 1 1 1 to_err_is_human.xml
In this study we examine whether automated checks reduce the amount of errors that researchers make in scientific manuscripts. method Method and Participants 2 1 1 to_err_is_human.xml
All data needed to reproduce these analyses is available from https://osf.io/5tbm9 and code is available from https://osf.io/629bx. results Results 3 1 1 to_err_is_human.xml
It seems automated tools can help prevent errors by providing researchers with feedback about potential mistakes, and researchers feel the app is useful. discussion Discussion 4 1 1 to_err_is_human.xml

Pattern

You can search for a specific word or phrase by setting the pattern argument. The pattern is a regex string by default; set fixed = TRUE if you want to find exact text matches.

text <- search_text(paper, pattern = "papercheck")
text section header div p s id
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. abstract Abstract 0 1 1 to_err_is_human.xml
In this study we examine the usefulness of Papercheck to improve best practices. intro Introduction 1 1 4 to_err_is_human.xml

Section

Set section to a vector of the sections to search in.

text <- search_text(paper, "papercheck", 
                    section = "abstract")
text section header div p s id
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. abstract Abstract 0 1 1 to_err_is_human.xml

Return

Set return to one of “sentence”, “paragraph”, “section”, or “match” to control what gets returned.

text <- search_text(paper, "papercheck", 
                    section = "intro", 
                    return = "paragraph")
text section header div p s id
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). Automation can be use to check for errors in scientific manuscripts, and inform authors about possible corrections. In this study we examine the usefulness of Papercheck to improve best practices. intro Introduction 1 1 NA to_err_is_human.xml

Regex matches

You can also return just the matched text from a regex search by setting return = "match". The extra ... arguments in search_text() are passed to grep(), so perl = TRUE allows you to use more complex regex, like below.

pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- search_text(paper, pattern, return = "match", perl = TRUE)
text section header div p s id
M = 9.12 results Results 3 1 2 to_err_is_human.xml
M = 10.9 results Results 3 1 2 to_err_is_human.xml
t(97.7) = 2.9 results Results 3 1 2 to_err_is_human.xml
p = 0.005 results Results 3 1 2 to_err_is_human.xml
M = 5.06 results Results 3 2 1 to_err_is_human.xml
M = 4.5 results Results 3 2 1 to_err_is_human.xml
t(97.2) = -1.96 results Results 3 2 1 to_err_is_human.xml
p = 0.152 results Results 3 2 1 to_err_is_human.xml

ChatGPT

You can query the extracted text of papers with ChatGPT.

Setup

Run

gpt_setup()

You will need to provide your own API key. You can enter it manually for each call to gpt(), or add the following lines in the .Renviron file for your user or project.

CHATGPT_KEY="sk-proj-abcdefghijklmnopqrs0123456789ABCDEFGHIJKLMNOPQRS"
RETICULATE_PYTHON="~/.virtualenvs/r-reticulate/bin/python"
# useful if you aren't sure where this file is
usethis::edit_r_environ()

GPT Queries

You can ask ChatGPT to process text. Use search_text() first to narrow down the text into what you want to query. Below, we returned the first two papers’ introduction sections, and returned the full section. Then we asked ChatGPT “What is the hypothesis of this study?”.

hypotheses <- search_text(papers[1:2], 
                          section = "intro", 
                          return = "section")
query <- "What is the hypothesis of this study?"
gpt_hypo <- gpt(hypotheses, query)
id answer cost
eyecolor.xml The hypothesis of this study is to test the matching hypothesis, sex-linked heritable preference hypothesis, and positive sexual imprinting hypothesis in relation to eye color and partner selection. 0.000612
incest.xml The hypothesis of this study is that humans possess adaptations to reduce inbreeding, and that the strong moral opposition to incest plays an important role in preventing inbreeding. 0.000216

Batch Processing

The functions pdf2grobid() and read_grobid() also work on a folder of files, returning a list of XML file paths or paper objects, respectively. The functions search_text() and gpt() also work on a list of paper objects.

grobid_dir <- demodir()

papers <- read_grobid(grobid_dir)

hypotheses <- search_text(papers, "hypothesi", 
                          section = "intro", 
                          return = "paragraph")

Modules

Papercheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.

Module List

You can see the list of built-in modules with the function below.

module_list()
#>  * ai-summarise: Generate a 1-sentence summary for each section
#>  * all-p-values: List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.
#>  * all-urls: List all the URLs in the main text
#>  * imprecise-p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
#>  * marginal: List all sentences that describe an effect as 'marginally significant'.
#>  * osf-check: List all OSF links and whether they are open, closed, or do not exist.
#>  * ref-consistency: Check if all references are cited and all citations are referenced
#>  * retractionwatch: Flag any cited papers in the RetractionWatch database
#>  * statcheck: Check consistency of p-values and test statistics

Running modules

To run a built-in module on a paper, you can reference it by name.

p <- module_run(paper, "all-p-values")
text section header div p s id
p = 0.005 results Results 3 1 2 to_err_is_human.xml
p = 0.152 results Results 3 2 1 to_err_is_human.xml
p > .05 results Results 3 2 2 to_err_is_human.xml

Creating modules

You can create your own modules by specifying the arguments to search_text() or gpt() in JSON format and/or including R code. Modules can also contain instructions for reporting, to give “traffic lights” for whether a check passed or failed, and to include appropriate text feedback in a report. See the modules vignette for more details.

Below is an abbreviated example of the module that detects all p-values in the text and returns the matching text.

{
  "title": "List All P-Values",
  "description": "List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.",
  "text": {
    "pattern": "(?<=[^a-z])p-?(value)?\\s*[<>=≤≥]{1,2}\\s*(n\\.?s\\.?|\\d?\\.\\d+e?-?\\d*)",
    "return": "match",
    "perl": true
  }
}

Reports

You can generate a report from any set of modules. The default set is c("imprecise-p", "marginal", "osf-check", "retractionwatch", "ref-consistency")

report(paper, output_format = "qmd")

See the example report.