library(papercheck)
library(dplyr) # for data wrangling
library(readr) # reading and writing CSV files
In this vignette, we will process 250 open access papers from Psychological Science.
Convert PDFs
Read in all of the PDF files from a directory called “pdf”, process them with a local version of grobid, and save the XML files in a directory called “xml”.
pdf2grobid(filename = "pdf",
save_path = "xml",
grobid_url = "http://localhost:8070")
Then read in the XML files to papercheck and save in an object called
papers
.
papers <- read_grobid("xml")
These steps can take some time if you are processing a lot of papers,
and only needs to happen once, so it is often useful to save the
papers
object as an Rds file, comment out the code above,
and load papers
from this object on future runs of your
script.
# load from RDS for efficiency
# saveRDS(papers, "psysci_oa.Rds")
papers <- readRDS("psysci_oa.Rds")
Paper Objects
Now papers
is a list of papercheck paper objects, each
of which contains structured information about the paper.
paper <- papers[[10]]
Authors
The authors
list contains a list of information for each
author. For now, CRediT roles are not detected, but this may be added in
the future.
paper$authors |> str()
#> List of 2
#> $ :List of 4
#> ..$ orcid: NULL
#> ..$ name :List of 2
#> .. ..$ surname: chr "Genevsky"
#> .. ..$ given : chr "Alexander"
#> ..$ roles: NULL
#> ..$ email: chr "genevsky@stanford.edu"
#> ..- attr(*, "class")= chr [1:2] "scivrs_author" "list"
#> $ :List of 4
#> ..$ orcid: NULL
#> ..$ name :List of 2
#> .. ..$ surname: chr "Knutson"
#> .. ..$ given : chr "Brian"
#> ..$ roles: NULL
#> ..$ email: chr ""
#> ..- attr(*, "class")= chr [1:2] "scivrs_author" "list"
#> - attr(*, "class")= chr [1:2] "scivrs_authors" "list"
Info
The info
item lists the filename, title, description
(abstract), keywords, doi, and submission info. Grobid sometimes makes
mistakes with the DOI, so be cautious about using this.
paper$info
#> $filename
#> [1] "xml/0956797615588467.xml"
#>
#> $title
#> [1] "Affective Mechanisms Predict Microlending"
#>
#> $description
#> [1] "Humans sometimes share with others whom they may never meet or know, in violation of the dictates of pure selfinterest. Research has not established which neuropsychological mechanisms support lending decisions, nor whether their influence extends to markets involving significant financial incentives. In two studies, we found that neural affective mechanisms influence the success of requests for microloans. In a large Internet database of microloan requests (N = 13,500), we found that positive affective features of photographs promoted the success of those requests. We then established that neural activity (i.e., in the nucleus accumbens) and self-reported positive arousal in a neuroimaging sample (N = 28) predicted the success of loan requests on the Internet, above and beyond the effects of the neuroimaging sample's own choices (i.e., to lend or not). These findings suggest that elicitation of positive arousal can promote the success of loan requests, both in the laboratory and on the Internet. They also highlight affective neuroscience's potential to probe neuropsychological mechanisms that drive microlending, enhance the effectiveness of loan requests, and forecast market-level behavior."
#>
#> $keywords
#> [1] "affect" "accumbens" "microlending" "preference" "fMRI"
#> [6] "prosocial" "human"
#>
#> $doi
#> [1] "10.1177/0956797615588467pss.sagepub."
#>
#> $submission
#> [1] "Received 2/2/15; Revision accepted 5/4/15"
References
The references
contains the items in the reference list,
including an id to link them to citations (bib_id), the DOI if available
(doi), and the full reference text (ref).
ref <- paper$references
dplyr::filter(ref, bib_id == "b5")
#> bib_id doi
#> 1 b5 10.1016/j.jcps.2011.05.001
#> ref
#> 1 A neural predictor of cultural popularity GSBerns SEMoore 10.1016/j.jcps.2011.05.001 Journal of Consumer Psychology 22 2012
Citations
The citations
contains each citation, including an id to
link them to references (bib_id) and the sentence that they are cited in
(text).
cite <- paper$citations
dplyr::filter(cite, bib_id == "b5")
#> bib_id
#> 1 b5
#> 2 b5
#> 3 b5
#> text
#> 1 Stimulus sample size was determined via power analysis of the sole existing similar study, which used neural activity to predict Internet downloads of music (Berns & Moore, 2012).
#> 2 Following the approach of Berns and Moore (2012), we calculated correlations between Internet loan-request success and anticipatory activity in regions drawn from targeted volumes of interest (i.e., NAcc, AIns, and MPFC) as well as whole-brain analyses.
#> 3 For instance, investigators have used group NAcc activity in response to music to predict the aggregate number of song downloads 2 years later (Berns & Moore, 2012) and have used group MPFC activity to predict call volume in response to antismoking advertisements (Falk, Berkman, Whalen, & Lieberman, 2011).
Full Text
The full_text
item is a table containing each sentence
from the main text (text
). The heading text
(header
) is used to automatically determine if the
section
is abstract, intro, method, results, or discussion.
Each section has a unique sequential div
number, and each
paragraph (p
) within the section and eeach sentence
(s
) within each paragraph are also sequentially numbered
(e.g., div = 1, p = 2, s = 3 is the third sentence of the second
paragraph of the first section after the abstract).
paper$full_text |> names()
#> [1] "text" "section" "header" "div" "p" "s" "id"
Text Search
The search_text()
function helps you search the text of
a paper or list of papers.
The default arguments give you a data frame containing a row for
every sentence in every paper in the set. The data frame has the same
column structure as the full_text
table above, so that you
can easily chain text searches.
all_sentences <- search_text(papers)
You can customise search_text()
to return paragraphs or
sections instead of sentences. The section
column contains
the automatically classified section types from the options “abstract”,
“intro”, “methods”, “results”, or “discussion” (this can be inaccurate
if grobid doesn’t detect headers or the header text doesn’t obviously
fall in one of these categories).
method_paragraphs <- search_text(papers, section = "method", return = "paragraph")
A random paragraph from a method section.
#> [1] "SES. Family SES was assessed by measures at first contact when the TEDS twins were 18 months and when they were 7, 9, and 16 years old. At first contact (SES Index 1), age 7 (SES Index 2), and age 16 (SES Index 4), parents reported their highest educational qualifications and occupation. Educational qualifications were assessed on an 8-point scale from 1 (no formal education) to 8 (postgraduate qualifications). Occupation was inferred on the basis of a standard classification (Office of Population and Census Surveys, 1991) that used reports of employment status, job title, employment type, and whether parents needed special qualifications for their role. At 9 years (SES Index 3) and 16 years, family income was assessed; parents reported their annual household income before tax on an 11-point scale from 1 (under £4,500) to 11 (more than £100,000). For SES Indexes 1 through 3, standardized mean scores were calculated and averaged at each assessment age. Previous analyses of these data showed that correlations (r) between these estimates were .77 for SES Index 1 and 2, .55 for SES Indexes 1 and 3, and .57 for SES Indexes 2 and 3 (Hanscombe et al., 2012). Correlations with SES Index 4 have not been previously reported and were .70 with SES Index 1, .76 with SES Index 2, and .65 with SES Index 6 in the current analysis sample. We summed the four indexes to achieve one SES composite score for each participant."
Pattern
You can just code every sentence or paragraph in a set of papers, but this is usually not very efficient, so we can use a search pattern to filter the text.
search <- search_text(papers, pattern = "Scotland")
Here we have 4 results. We’ll just show the paper id and text columns of the returned table, but the table also provides the section type, header, and section, paragraph, and sentence numbers (div, p, and s).
Chaining
You can chain together searches to iteratively narrow down results.
search <- papers |>
search_text("DeBruine") |>
search_text("2006")
Regex
You can also use regular expressions to refine your search. The pattern below returns every sentence that contains either “Scotland” or “Scottish”.
search <- search_text(papers, pattern = "(Scotland|Scottish)")
Match
You can return just the matching text for a regular expression by setting the results to “match”. This pattern searches for text like “p < .25” or p<0.01”.
match <- search_text(papers,
pattern = "p\\s*>\\s*0?\\.[0-9]+\\b",
return = "match")
You can expand this to the whole sentence, paragraph, or +/- some
number of sentences around the match using
expand_text()
.
expand <- expand_text(results_table = match,
paper = papers,
expand_to = "sentence",
plus = 0,
minus = 0)
expand$expanded[1]
#> [1] "No main effects or interactions with time were found (p > .29), which indicates that the action-specific effects of TMS on confidence are not specific to its delivery before or after a perceptual decision."