Installation
You can install the development version of papercheck from GitHub with:
# install.packages("devtools")
devtools::install_github("scienceverse/papercheck")
You can launch an interactive shiny app version of the code below with:
Load from PDF
The function pdf2grobid()
can read PDF files and save
them in the TEI format created by grobid. This requires an
internet connection and takes a few seconds per paper, so should only be
done once and the results saved for later use.
If the server is unavailable, you can use a grobid web interface.
pdf_file <- demopdf()
xml_file <- pdf2grobid(pdf_file)
xml_file <- demoxml()
Load from XML
The function read_grobid()
can read XML files parsed by
grobid.
paper <- read_grobid(xml_file)
Full Text
You can access a parsed table of the full text of the paper via
s$full_text
, but you may find it more convenient to use the
function search_text()
. The defaults return a data table of
each sentence, with the section type, header, div, paragraph and
sentence numbers, and file name. (The section type is a best guess from
the headers, so may not always be accurate.)
text <- search_text(paper)
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. | abstract | Abstract | 0 | 1 | 1 | to_err_is_human.xml |
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. | intro | Introduction | 1 | 1 | 1 | to_err_is_human.xml |
In this study we examine whether automated checks reduce the amount of errors that researchers make in scientific manuscripts. | method | Method and Participants | 2 | 1 | 1 | to_err_is_human.xml |
All data needed to reproduce these analyses is available from https://osf.io/5tbm9 and code is available from https://osf.io/629bx. | results | Results | 3 | 1 | 1 | to_err_is_human.xml |
It seems automated tools can help prevent errors by providing researchers with feedback about potential mistakes, and researchers feel the app is useful. | discussion | Discussion | 4 | 1 | 1 | to_err_is_human.xml |
Pattern
You can search for a specific word or phrase by setting the
pattern
argument. The pattern is a regex string by default;
set fixed = TRUE
if you want to find exact text
matches.
text <- search_text(paper, pattern = "papercheck")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. | abstract | Abstract | 0 | 1 | 1 | to_err_is_human.xml |
In this study we examine the usefulness of Papercheck to improve best practices. | intro | Introduction | 1 | 1 | 4 | to_err_is_human.xml |
Section
Set section
to a vector of the sections to search
in.
text <- search_text(paper, "papercheck",
section = "abstract")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. | abstract | Abstract | 0 | 1 | 1 | to_err_is_human.xml |
Return
Set return
to one of “sentence”, “paragraph”, “section”,
or “match” to control what gets returned.
text <- search_text(paper, "papercheck",
section = "intro",
return = "paragraph")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). Automation can be use to check for errors in scientific manuscripts, and inform authors about possible corrections. In this study we examine the usefulness of Papercheck to improve best practices. | intro | Introduction | 1 | 1 | NA | to_err_is_human.xml |
Regex matches
You can also return just the matched text from a regex search by
setting return = "match"
. The extra ...
arguments in search_text()
are passed to
grep()
, so perl = TRUE
allows you to use more
complex regex, like below.
pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- search_text(paper, pattern, return = "match", perl = TRUE)
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
M = 9.12 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
M = 10.9 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
t(97.7) = 2.9 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
p = 0.005 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
M = 5.06 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
M = 4.5 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
t(97.2) = -1.96 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
p = 0.152 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
ChatGPT
You can query the extracted text of papers with ChatGPT.
Setup
Run
gpt_setup()
You will need to provide your own API key. You can enter
it manually for each call to gpt()
, or add the following
lines in the .Renviron file for your user or project.
CHATGPT_KEY="sk-proj-abcdefghijklmnopqrs0123456789ABCDEFGHIJKLMNOPQRS"
RETICULATE_PYTHON="~/.virtualenvs/r-reticulate/bin/python"
# useful if you aren't sure where this file is
usethis::edit_r_environ()
GPT Queries
You can ask ChatGPT to process text. Use search_text()
first to narrow down the text into what you want to query. Below, we
returned the first two papers’ introduction sections, and returned the
full section. Then we asked ChatGPT “What is the hypothesis of this
study?”.
hypotheses <- search_text(papers[1:2],
section = "intro",
return = "section")
query <- "What is the hypothesis of this study?"
gpt_hypo <- gpt(hypotheses, query)
id | answer | cost |
---|---|---|
eyecolor.xml | The hypothesis of this study is to test the matching hypothesis, sex-linked heritable preference hypothesis, and positive sexual imprinting hypothesis in relation to eye color and partner selection. | 0.000612 |
incest.xml | The hypothesis of this study is that humans possess adaptations to reduce inbreeding, and that the strong moral opposition to incest plays an important role in preventing inbreeding. | 0.000216 |
Batch Processing
The functions pdf2grobid()
and
read_grobid()
also work on a folder of files, returning a
list of XML file paths or paper objects, respectively. The functions
search_text()
and gpt()
also work on a list of
paper objects.
grobid_dir <- demodir()
papers <- read_grobid(grobid_dir)
hypotheses <- search_text(papers, "hypothesi",
section = "intro",
return = "paragraph")
Modules
Papercheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.
Module List
You can see the list of built-in modules with the function below.
module_list()
#> * ai-summarise: Generate a 1-sentence summary for each section
#> * all-p-values: List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.
#> * all-urls: List all the URLs in the main text
#> * imprecise-p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
#> * marginal: List all sentences that describe an effect as 'marginally significant'.
#> * osf-check: List all OSF links and whether they are open, closed, or do not exist.
#> * ref-consistency: Check if all references are cited and all citations are referenced
#> * retractionwatch: Flag any cited papers in the RetractionWatch database
#> * statcheck: Check consistency of p-values and test statistics
Running modules
To run a built-in module on a paper, you can reference it by name.
p <- module_run(paper, "all-p-values")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
p = 0.005 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
p = 0.152 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
p > .05 | results | Results | 3 | 2 | 2 | to_err_is_human.xml |
Creating modules
You can create your own modules by specifying the arguments to
search_text()
or gpt()
in JSON format and/or
including R code. Modules can also contain instructions for reporting,
to give “traffic lights” for whether a check passed or failed, and to
include appropriate text feedback in a report. See the modules vignette for more details.
Below is an abbreviated example of the module that detects all p-values in the text and returns the matching text.
{
"title": "List All P-Values",
"description": "List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.",
"text": {
"pattern": "(?<=[^a-z])p-?(value)?\\s*[<>=≤≥]{1,2}\\s*(n\\.?s\\.?|\\d?\\.\\d+e?-?\\d*)",
"return": "match",
"perl": true
}
}
Reports
You can generate a report from any set of modules. The default set is
c("imprecise-p", "marginal", "osf-check", "retractionwatch", "ref-consistency")
report(paper, output_format = "qmd")
See the example report.