Skip to contents
library(papercheck)
library(dplyr) # for data wrangling
library(readr) # reading and writing CSV files

In this vignette, we will process 250 open access papers from Psychological Science.

Convert PDFs

Read in all of the PDF files from a directory called “pdf”, process them with a local version of grobid, and save the XML files in a directory called “xml”.

pdf2grobid(filename = "pdf", 
           save_path = "xml", 
           grobid_url = "http://localhost:8070")

Then read in the XML files to papercheck and save in an object called papers.

papers <- read_grobid("xml")

These steps can take some time if you are processing a lot of papers, and only needs to happen once, so it is often useful to save the papers object as an Rds file, comment out the code above, and load papers from this object on future runs of your script.

# load from RDS for efficiency
# saveRDS(papers, "psysci_oa.Rds")
papers <- readRDS("psysci_oa.Rds")

Paper Objects

Now papers is a list of papercheck paper objects, each of which contains structured information about the paper.

paper <- papers[[10]]

ID

The id is taken from the name of the xml file.

paper$id
#> [1] "0956797615588467"

Authors

The authors list contains a list of information for each author. For now, CRediT roles are not detected, but this may be added in the future.

paper$authors |> str()
#> List of 2
#>  $ :List of 5
#>   ..$ orcid      : NULL
#>   ..$ name       :List of 2
#>   .. ..$ surname: chr "Genevsky"
#>   .. ..$ given  : chr "Alexander"
#>   ..$ roles      : NULL
#>   ..$ email      : chr "genevsky@stanford.edu"
#>   ..$ affiliation:List of 1
#>   .. ..$ :List of 1
#>   .. .. ..$ department: chr "Department of Psychology"
#>   ..- attr(*, "class")= chr [1:2] "scivrs_author" "list"
#>  $ :List of 5
#>   ..$ orcid      : NULL
#>   ..$ name       :List of 2
#>   .. ..$ surname: chr "Knutson"
#>   .. ..$ given  : chr "Brian"
#>   ..$ roles      : NULL
#>   ..$ email      : chr ""
#>   ..$ affiliation:List of 2
#>   .. ..$ :List of 1
#>   .. .. ..$ department: chr "Department of Psychology"
#>   .. ..$ :List of 2
#>   .. .. ..$ department : chr "Stanford Neurosciences Institute"
#>   .. .. ..$ institution: chr "Stanford University"
#>   ..- attr(*, "class")= chr [1:2] "scivrs_author" "list"
#>  - attr(*, "class")= chr [1:2] "scivrs_authors" "list"

You can get the authors as a table for a paper object or list of papers.

author_table(psychsci) |> 
  dplyr::filter(grepl("Glasgow", affiliation))
#> # A tibble: 17 × 7
#>    name.surname name.given  email                  affiliation id        n orcid
#>    <chr>        <chr>       <chr>                  <chr>       <chr> <int> <chr>
#>  1 Schyns       Philippe G  ""                     University… 0956…     3 NA   
#>  2 Lages        Martin      "martin.lages@glasgow… School of … 0956…     1 NA   
#>  3 Boyle        Stephanie C ""                     Institute … 0956…     2 NA   
#>  4 Jones        Benedict C  "ben.jones@glasgow.ac… Institute … 0956…     1 NA   
#>  5 Fisher       Claire I    ""                     Institute … 0956…     3 NA   
#>  6 Wang         Hongyi      ""                     Institute … 0956…     4 NA   
#>  7 Kandrik      Michal      ""                     Institute … 0956…     5 NA   
#>  8 Han          Chengyang   ""                     Institute … 0956…     6 NA   
#>  9 Fasolt       Vanessa     ""                     Institute … 0956…     7 NA   
#> 10 Morrison     Danielle    ""                     Institute … 0956…     8 NA   
#> 11 Lee          Anthony J   ""                     Institute … 0956…     9 NA   
#> 12 Holzleitner  Iris J      ""                     Institute … 0956…    10 NA   
#> 13 O'shea       Kieran J    ""                     Institute … 0956…    11 NA   
#> 14 Debruine     Lisa M      ""                     Institute … 0956…    14 NA   
#> 15 Jones        Benedict C  ""                     Institute … 0956…     2 NA   
#> 16 Debruine     Lisa M      ""                     Institute … 0956…     3 NA   
#> 17 Fasolt       Vanessa     ""                     Institute … 0956…     5 NA

Info

The info item lists the filename, title, description (abstract), keywords, doi, and submission info. Grobid sometimes makes mistakes with the DOI, so be cautious about using this.

paper$info
#> $filename
#> [1] "./0956797615588467.xml"
#> 
#> $title
#> [1] "Affective Mechanisms Predict Microlending"
#> 
#> $description
#> [1] "Humans sometimes share with others whom they may never meet or know, in violation of the dictates of pure selfinterest. Research has not established which neuropsychological mechanisms support lending decisions, nor whether their influence extends to markets involving significant financial incentives. In two studies, we found that neural affective mechanisms influence the success of requests for microloans. In a large Internet database of microloan requests (N = 13,500), we found that positive affective features of photographs promoted the success of those requests. We then established that neural activity (i.e., in the nucleus accumbens) and self-reported positive arousal in a neuroimaging sample (N = 28) predicted the success of loan requests on the Internet, above and beyond the effects of the neuroimaging sample's own choices (i.e., to lend or not). These findings suggest that elicitation of positive arousal can promote the success of loan requests, both in the laboratory and on the Internet. They also highlight affective neuroscience's potential to probe neuropsychological mechanisms that drive microlending, enhance the effectiveness of loan requests, and forecast market-level behavior."
#> 
#> $keywords
#> [1] "affect"       "accumbens"    "microlending" "preference"   "fMRI"        
#> [6] "prosocial"    "human"       
#> 
#> $doi
#> [1] "10.1177/0956797615588467"
#> 
#> $submission
#> [1] "Received 2/2/15; Revision accepted 5/4/15"

You can get this as a table for a batch of papers using info_table().

info_table(papers, info = c("doi", "title")) |> 
  head()
#> # A tibble: 6 × 3
#>   id               doi                      title                               
#>   <chr>            <chr>                    <chr>                               
#> 1 0956797613520608 10.1177/0956797613520608 Continuous Theta-Burst Stimulation …
#> 2 0956797614522816 10.1177/0956797614522816 Beyond Gist: Strategic and Incremen…
#> 3 0956797614527830 10.1177/0956797614527830 Serotonin and Social Norms: Tryptop…
#> 4 0956797614557697 10.1177/0956797614557697 Action-Specific Disruption of Perce…
#> 5 0956797614560771 10.1177/0956797614560771 Emotional Vocalizations Are Cross-C…
#> 6 0956797614566469 10.1177/0956797614566469 Conspiracist Ideation as a Predicto…

References

The references contains the items in the reference list, including an id to link them to citations (bib_id), the DOI if available (doi), and the full reference text (ref).

ref <- paper$references

dplyr::filter(ref, bib_id == "b5")
#>   bib_id                        doi
#> 1     b5 10.1016/j.jcps.2011.05.001
#>                                                                                                                                                                                                     ref
#> 1 Berns GS, Moore SE (2012). “A neural predictor of cultural popularity.” _Journal of Consumer Psychology_, *22*, 154-160. doi:10.1016/j.jcps.2011.05.001 <https://doi.org/10.1016/j.jcps.2011.05.001>.

Citations

The citations contains each citation, including an id to link them to references (bib_id) and the sentence that they are cited in (text).

cite <- paper$citations

dplyr::filter(cite, bib_id == "b5")
#>   bib_id
#> 1     b5
#> 2     b5
#> 3     b5
#>                                                                                                                                                                                                                                                                                                                  text
#> 1                                                                                                                                 Stimulus sample size was determined via power analysis of the sole existing similar study, which used neural activity to predict Internet downloads of music (Berns & Moore, 2012).
#> 2                                                       Following the approach of Berns and Moore (2012), we calculated correlations between Internet loan-request success and anticipatory activity in regions drawn from targeted volumes of interest (i.e., NAcc, AIns, and MPFC) as well as whole-brain analyses.
#> 3 For instance, investigators have used group NAcc activity in response to music to predict the aggregate number of song downloads 2 years later (Berns & Moore, 2012) and have used group MPFC activity to predict call volume in response to antismoking advertisements (Falk, Berkman, Whalen, & Lieberman, 2011).

Full Text

The full_text item is a table containing each sentence from the main text (text). The heading text (header) is used to automatically determine if the section is abstract, intro, method, results, or discussion. Each section has a unique sequential div number, and each paragraph (p) within the section and eeach sentence (s) within each paragraph are also sequentially numbered (e.g., div = 1, p = 2, s = 3 is the third sentence of the second paragraph of the first section after the abstract).

paper$full_text |> names()
#> [1] "text"    "section" "header"  "div"     "p"       "s"       "id"

The search_text() function helps you search the text of a paper or list of papers.

The default arguments give you a data frame containing a row for every sentence in every paper in the set. The data frame has the same column structure as the full_text table above, so that you can easily chain text searches.

all_sentences <- search_text(papers)

You can customise search_text() to return paragraphs or sections instead of sentences. The section column contains the automatically classified section types from the options “abstract”, “intro”, “methods”, “results”, or “discussion” (this can be inaccurate if grobid doesn’t detect headers or the header text doesn’t obviously fall in one of these categories).

method_paragraphs <- search_text(papers, section = "method", return = "paragraph")

A random paragraph from a method section.

#> [1] "Method"

Pattern

You can just code every sentence or paragraph in a set of papers, but this is usually not very efficient, so we can use a search pattern to filter the text.

search <- search_text(papers, pattern = "Scotland")

Here we have 8 results. We’ll just show the paper id and text columns of the returned table, but the table also provides the section type, header, and section, paragraph, and sentence numbers (div, p, and s).

Chaining

You can chain together searches to iteratively narrow down results.

search <- papers |>
  search_text("DeBruine") |>
  search_text("2006")

Regex

You can also use regular expressions to refine your search. The pattern below returns every sentence that contains either “Scotland” or “Scottish”.

search <- search_text(papers, pattern = "(Scotland|Scottish)")

Match

You can return just the matching text for a regular expression by setting the results to “match”. This pattern searches for text like “p < .25” or p<0.01”.

match <- search_text(papers, 
                     pattern = "p\\s*>\\s*0?\\.[0-9]+\\b", 
                     return = "match")

You can expand this to the whole sentence, paragraph, or +/- some number of sentences around the match using expand_text().

expand <- expand_text(results_table = match, 
                      paper = papers,
                      expand_to = "sentence",
                      plus = 0,
                      minus = 0)

expand$expanded[1]
#> [1] "No main effects or interactions with time were found (p > .29), which indicates that the action-specific effects of TMS on confidence are not specific to its delivery before or after a perceptual decision."