Skip to contents

Installation

You can install the development version of papercheck from GitHub with:

# install.packages("devtools")
devtools::install_github("scienceverse/papercheck")

You can launch an interactive shiny app version of the code below with:

Load from PDF

The function pdf2grobid() can read PDF files and save them in the TEI format created by grobid. This requires an internet connection and takes a few seconds per paper, so should only be done once and the results saved for later use.

If the server is unavailable, you can use a grobid web interface.

pdf_file <- demopdf()
xml_file <- pdf2grobid(pdf_file)

You can set up your own local grobid server following instructions from https://grobid.readthedocs.io/. The easiest way is to use Docker.

docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.1

Then you can set your grobid_url to the local path http://localhost:8070.

xml_file <- pdf2grobid(pdf_file, grobid_url = "http://localhost:8070")

Load from XML

The function read_grobid() can read XML files parsed by grobid.

paper <- read_grobid(xml_file)

Load from non-PDF document

To take advantage of grobid’s ability to parse references and other aspects of papers, for now the best way is to convert your papers to PDF. However, papercheck can read in plain text from a character object or text/docx file with read_text().

text <- "Abstract

This is my very short paper. It has two sentences."
shortpaper <- read_text(text, id = "shortpaper")
shortpaper$full_text
#> # A tibble: 3 × 7
#>   text                         section  header     div     p     s id        
#>   <chr>                        <chr>    <chr>    <int> <dbl> <int> <chr>     
#> 1 Abstract                     abstract Abstract     1     0     1 shortpaper
#> 2 This is my very short paper. abstract Abstract     1     1     1 shortpaper
#> 3 It has two sentences.        abstract Abstract     1     1     2 shortpaper
filename <- system.file("extdata/to_err_is_human.docx", 
                        package = "papercheck")
paper_from_doc <- read_text(filename)

Search Text

You can access a parsed table of the full text of the paper via paper$full_text, but you may find it more convenient to use the function search_text(). The defaults return a data table of each sentence, with the section type, header, div, paragraph and sentence numbers, and file name. (The section type is a best guess from the headers, so may not always be accurate.)

text <- search_text(paper)
text section header div p s id
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. abstract Abstract 0 1 1 to_err_is_human.xml
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. intro Introduction 1 1 1 to_err_is_human.xml
In this study we examine whether automated checks reduce the amount of errors that researchers make in scientific manuscripts. method Method and Participants 2 1 1 to_err_is_human.xml
All data needed to reproduce the analyses in Table 1 is available from https://osf.io/5tbm9 and code is available from the OSF. results Results 3 1 1 to_err_is_human.xml
It seems automated tools can help prevent errors by providing researchers with feedback about potential mistakes, and researchers feel the app is useful. discussion Discussion 4 1 1 to_err_is_human.xml

Pattern

You can search for a specific word or phrase by setting the pattern argument. The pattern is a regex string by default; set fixed = TRUE if you want to find exact text matches.

text <- search_text(paper, pattern = "papercheck")
text section header div p s id
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. abstract Abstract 0 1 1 to_err_is_human.xml
In this study we examine the usefulness of Papercheck to improve best practices. intro Introduction 1 1 4 to_err_is_human.xml

Section

Set section to a vector of the sections to search in.

text <- search_text(paper, "papercheck", 
                    section = "abstract")
text section header div p s id
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. abstract Abstract 0 1 1 to_err_is_human.xml

Return

Set return to one of “sentence”, “paragraph”, “section”, or “match” to control what gets returned.

text <- search_text(paper, "papercheck", 
                    section = "intro", 
                    return = "paragraph")
text section header div p s id
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). Automation can be use to check for errors in scientific manuscripts, and inform authors about possible corrections. In this study we examine the usefulness of Papercheck to improve best practices. intro Introduction 1 1 NA to_err_is_human.xml

Regex matches

You can also return just the matched text from a regex search by setting return = "match". The extra ... arguments in search_text() are passed to grep(), so perl = TRUE allows you to use more complex regex, like below.

pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- search_text(paper, pattern, return = "match", perl = TRUE)
text section header div p s id
M = 9.12 results Results 3 1 2 to_err_is_human.xml
M = 10.9 results Results 3 1 2 to_err_is_human.xml
t(97.7) = 2.9 results Results 3 1 2 to_err_is_human.xml
p = 0.005 results Results 3 1 2 to_err_is_human.xml
M = 5.06 results Results 3 2 1 to_err_is_human.xml
M = 4.5 results Results 3 2 1 to_err_is_human.xml
t(97.2) = -1.96 results Results 3 2 1 to_err_is_human.xml
p = 0.152 results Results 3 2 1 to_err_is_human.xml

Expand Text

You can expand the text returned by search_text() or a module with expand_text().

marginal <- search_text(paper, "marginal") |>
  expand_text(paper, plus = 1, minus = 1)

marginal[, c("text", "expanded")]
#> # A tibble: 2 × 2
#>   text                                                                  expanded
#>   <chr>                                                                 <chr>   
#> 1 "The paper shows examples of (1) open and closed OSF links; (2) cita… "All da…
#> 2 "On average researchers in the experimental condition found the app … "On ave…

Large Language Models

You can query the extracted text of papers with LLMs using groq.

Setup

You will need to get your own API key from https://console.groq.com/keys. To avoid having to type it out, add it to the .Renviron file in the following format (you can use usethis::edit_r_environ() to access the .Renviron file).

GROQ_GPT_KEY="sk-proj-abcdefghijklmnopqrs0123456789ABCDEFGHIJKLMNOPQRS"
# useful if you aren't sure where this file is
usethis::edit_r_environ()

LLM Queries

You can ask an LLM to process text. Use search_text() first to narrow down the text into what you want to query. Below, we returned the first two papers’ introduction sections, and returned the full section. Then we asked an LLM “What is the hypothesis of this study?”.

hypotheses <- search_text(papers[1:2], 
                          section = "intro", 
                          return = "section")
query <- "What is the hypothesis of this study? Answer as briefly as possible."
llm_hypo <- llm(hypotheses, query)
id answer
eyecolor.xml The hypothesis of this study is that humans exhibit positive sexual imprinting, where individuals choose partners with physical characteristics similar to those of their opposite-sex parent.
incest.xml The hypothesis is that moral opposition to third-party sibling incest is greater among individuals with other-sex siblings than among individuals with same-sex siblings.

Batch Processing

The functions pdf2grobid() and read_grobid() also work on a folder of files, returning a list of XML file paths or paper objects, respectively. The functions search_text(), expand_text() and llm() also work on a list of paper objects.

grobid_dir <- demodir()

papers <- read_grobid(grobid_dir)

hypotheses <- search_text(papers, "hypothesi", 
                          section = "intro", 
                          return = "paragraph")

Modules

Papercheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.

Module List

You can see the list of built-in modules with the function below.

#>  * all-p-values: List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.
#>  * all-urls: List all the URLs in the main text
#>  * imprecise-p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
#>  * llm-summarise: Generate a 1-sentence summary for each section
#>  * marginal: List all sentences that describe an effect as 'marginally significant'.
#>  * osf-check: List all OSF links and whether they are open, closed, or do not exist.
#>  * ref-consistency: Check if all references are cited and all citations are referenced
#>  * retractionwatch: Flag any cited papers in the RetractionWatch database
#>  * statcheck: Check consistency of p-values and test statistics

Running modules

To run a built-in module on a paper, you can reference it by name.

p <- module_run(paper, "all-p-values")
text section header div p s id
p = 0.005 results Results 3 1 2 to_err_is_human.xml
p = 0.152 results Results 3 2 1 to_err_is_human.xml
p > .05 results Results 3 2 2 to_err_is_human.xml

Creating modules

You can create your own modules by specifying the arguments to search_text() or llm() in JSON format and/or including R code. Modules can also contain instructions for reporting, to give “traffic lights” for whether a check passed or failed, and to include appropriate text feedback in a report. See the modules vignette for more details.

Below is an abbreviated example of the module that detects all p-values in the text and returns the matching text.

{
  "title": "List All P-Values",
  "description": "List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.",
  "text": {
    "pattern": "(?<=[^a-z])p-?(value)?\\s*[<>=≤≥]{1,2}\\s*(n\\.?s\\.?|\\d?\\.\\d+e?-?\\d*)",
    "return": "match",
    "perl": true
  }
}

Reports

You can generate a report from any set of modules. The default set is c("imprecise-p", "marginal", "osf-check", "retractionwatch", "ref-consistency")

report(paper, output_format = "qmd")

See the example report.