Installation
You can install the development version of papercheck from GitHub with:
# install.packages("devtools")
devtools::install_github("scienceverse/papercheck")
You can launch an interactive shiny app version of the code below with:
Load from PDF
The function pdf2grobid()
can read PDF files and save
them in the TEI format created by grobid. This requires an
internet connection and takes a few seconds per paper, so should only be
done once and the results saved for later use.
If the server is unavailable, you can use a grobid web interface.
pdf_file <- demopdf()
xml_file <- pdf2grobid(pdf_file)
You can set up your own local grobid server following instructions from https://grobid.readthedocs.io/. The easiest way is to use Docker.
Then you can set your grobid_url to the local path http://localhost:8070.
xml_file <- pdf2grobid(pdf_file, grobid_url = "http://localhost:8070")
Load from XML
The function read_grobid()
can read XML files parsed by
grobid.
paper <- read_grobid(xml_file)
Load from non-PDF document
To take advantage of grobid’s ability to parse references and other
aspects of papers, for now the best way is to convert your papers to
PDF. However, papercheck can read in plain text from a character object
or text/docx file with read_text()
.
text <- "Abstract
This is my very short paper. It has two sentences."
shortpaper <- read_text(text, id = "shortpaper")
shortpaper$full_text
#> # A tibble: 3 × 7
#> text section header div p s id
#> <chr> <chr> <chr> <int> <dbl> <int> <chr>
#> 1 Abstract abstract Abstract 1 0 1 shortpaper
#> 2 This is my very short paper. abstract Abstract 1 1 1 shortpaper
#> 3 It has two sentences. abstract Abstract 1 1 2 shortpaper
filename <- system.file("extdata/to_err_is_human.docx",
package = "papercheck")
paper_from_doc <- read_text(filename)
Search Text
You can access a parsed table of the full text of the paper via
paper$full_text
, but you may find it more convenient to use
the function search_text()
. The defaults return a data
table of each sentence, with the section type, header, div, paragraph
and sentence numbers, and file name. (The section type is a best guess
from the headers, so may not always be accurate.)
text <- search_text(paper)
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. | abstract | Abstract | 0 | 1 | 1 | to_err_is_human.xml |
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. | intro | Introduction | 1 | 1 | 1 | to_err_is_human.xml |
In this study we examine whether automated checks reduce the amount of errors that researchers make in scientific manuscripts. | method | Method and Participants | 2 | 1 | 1 | to_err_is_human.xml |
All data needed to reproduce the analyses in Table 1 is available from https://osf.io/5tbm9 and code is available from the OSF. | results | Results | 3 | 1 | 1 | to_err_is_human.xml |
It seems automated tools can help prevent errors by providing researchers with feedback about potential mistakes, and researchers feel the app is useful. | discussion | Discussion | 4 | 1 | 1 | to_err_is_human.xml |
Pattern
You can search for a specific word or phrase by setting the
pattern
argument. The pattern is a regex string by default;
set fixed = TRUE
if you want to find exact text
matches.
text <- search_text(paper, pattern = "papercheck")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. | abstract | Abstract | 0 | 1 | 1 | to_err_is_human.xml |
In this study we examine the usefulness of Papercheck to improve best practices. | intro | Introduction | 1 | 1 | 4 | to_err_is_human.xml |
Section
Set section
to a vector of the sections to search
in.
text <- search_text(paper, "papercheck",
section = "abstract")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
This paper demonstrates some good and poor practices for use with the {papercheck} R package and Shiny app. | abstract | Abstract | 0 | 1 | 1 | to_err_is_human.xml |
Return
Set return
to one of “sentence”, “paragraph”, “section”,
or “match” to control what gets returned.
text <- search_text(paper, "papercheck",
section = "intro",
return = "paragraph")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
Although intentional dishonestly might be a successful way to boost creativity (Gino & Wiltermuth, 2014), it is safe to say most mistakes researchers make are unintentional. From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). Automation can be use to check for errors in scientific manuscripts, and inform authors about possible corrections. In this study we examine the usefulness of Papercheck to improve best practices. | intro | Introduction | 1 | 1 | NA | to_err_is_human.xml |
Regex matches
You can also return just the matched text from a regex search by
setting return = "match"
. The extra ...
arguments in search_text()
are passed to
grep()
, so perl = TRUE
allows you to use more
complex regex, like below.
pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- search_text(paper, pattern, return = "match", perl = TRUE)
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
M = 9.12 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
M = 10.9 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
t(97.7) = 2.9 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
p = 0.005 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
M = 5.06 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
M = 4.5 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
t(97.2) = -1.96 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
p = 0.152 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
Expand Text
You can expand the text returned by search_text()
or a
module with expand_text()
.
marginal <- search_text(paper, "marginal") |>
expand_text(paper, plus = 1, minus = 1)
marginal[, c("text", "expanded")]
#> # A tibble: 2 × 2
#> text expanded
#> <chr> <chr>
#> 1 "The paper shows examples of (1) open and closed OSF links; (2) cita… "All da…
#> 2 "On average researchers in the experimental condition found the app … "On ave…
Large Language Models
You can query the extracted text of papers with LLMs using groq.
Setup
You will need to get your own API key from https://console.groq.com/keys. To avoid having to type
it out, add it to the .Renviron file in the following format (you can
use usethis::edit_r_environ()
to access the .Renviron
file).
# useful if you aren't sure where this file is
usethis::edit_r_environ()
LLM Queries
You can ask an LLM to process text. Use search_text()
first to narrow down the text into what you want to query. Below, we
returned the first two papers’ introduction sections, and returned the
full section. Then we asked an LLM “What is the hypothesis of this
study?”.
hypotheses <- search_text(papers[1:2],
section = "intro",
return = "section")
query <- "What is the hypothesis of this study? Answer as briefly as possible."
llm_hypo <- llm(hypotheses, query)
id | answer |
---|---|
eyecolor.xml | The hypothesis of this study is that humans exhibit positive sexual imprinting, where individuals choose partners with physical characteristics similar to those of their opposite-sex parent. |
incest.xml | The hypothesis is that moral opposition to third-party sibling incest is greater among individuals with other-sex siblings than among individuals with same-sex siblings. |
Batch Processing
The functions pdf2grobid()
and
read_grobid()
also work on a folder of files, returning a
list of XML file paths or paper objects, respectively. The functions
search_text()
, expand_text()
and
llm()
also work on a list of paper objects.
grobid_dir <- demodir()
papers <- read_grobid(grobid_dir)
hypotheses <- search_text(papers, "hypothesi",
section = "intro",
return = "paragraph")
Modules
Papercheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.
Module List
You can see the list of built-in modules with the function below.
#> * all-p-values: List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.
#> * all-urls: List all the URLs in the main text
#> * imprecise-p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
#> * llm-summarise: Generate a 1-sentence summary for each section
#> * marginal: List all sentences that describe an effect as 'marginally significant'.
#> * osf-check: List all OSF links and whether they are open, closed, or do not exist.
#> * ref-consistency: Check if all references are cited and all citations are referenced
#> * retractionwatch: Flag any cited papers in the RetractionWatch database
#> * statcheck: Check consistency of p-values and test statistics
Running modules
To run a built-in module on a paper, you can reference it by name.
p <- module_run(paper, "all-p-values")
text | section | header | div | p | s | id |
---|---|---|---|---|---|---|
p = 0.005 | results | Results | 3 | 1 | 2 | to_err_is_human.xml |
p = 0.152 | results | Results | 3 | 2 | 1 | to_err_is_human.xml |
p > .05 | results | Results | 3 | 2 | 2 | to_err_is_human.xml |
Creating modules
You can create your own modules by specifying the arguments to
search_text()
or llm()
in JSON format and/or
including R code. Modules can also contain instructions for reporting,
to give “traffic lights” for whether a check passed or failed, and to
include appropriate text feedback in a report. See the modules vignette for more details.
Below is an abbreviated example of the module that detects all p-values in the text and returns the matching text.
Reports
You can generate a report from any set of modules. The default set is
c("imprecise-p", "marginal", "osf-check", "retractionwatch", "ref-consistency")
report(paper, output_format = "qmd")
See the example report.