Skip to contents
library(papercheck)
library(dplyr) # for data wrangling
library(readr) # reading and writing CSV files

In this vignette, we will process 250 open access papers from Psychological Science.

Convert PDFs

Read in all of the PDF files from a directory called “pdf”, process them with a local version of grobid, and save the XML files in a directory called “xml”.

pdf2grobid(filename = "pdf", 
           save_path = "xml", 
           grobid_url = "http://localhost:8070")

Then read in the XML files to papercheck and save in an object called papers.

papers <- read_grobid("xml")

These steps can take some time if you are processing a lot of papers, and only needs to happen once, so it is often useful to save the papers object as an Rds file, comment out the code above, and load papers from this object on future runs of your script.

# load from RDS for efficiency
# saveRDS(papers, "psysci_oa.Rds")
papers <- readRDS("psysci_oa.Rds")

Paper Objects

Now papers is a list of papercheck paper objects, each of which contains structured information about the paper.

paper <- papers[[10]]

Name

The name is taken from the name of the xml file.

paper$name
#> [1] "0956797615588467"

Authors

The authors list contains a list of information for each author. For now, CRediT roles are not detected, but this may be added in the future.

paper$authors |> str()
#> List of 2
#>  $ :List of 4
#>   ..$ orcid: NULL
#>   ..$ name :List of 2
#>   .. ..$ surname: chr "Genevsky"
#>   .. ..$ given  : chr "Alexander"
#>   ..$ roles: NULL
#>   ..$ email: chr "genevsky@stanford.edu"
#>   ..- attr(*, "class")= chr [1:2] "scivrs_author" "list"
#>  $ :List of 4
#>   ..$ orcid: NULL
#>   ..$ name :List of 2
#>   .. ..$ surname: chr "Knutson"
#>   .. ..$ given  : chr "Brian"
#>   ..$ roles: NULL
#>   ..$ email: chr ""
#>   ..- attr(*, "class")= chr [1:2] "scivrs_author" "list"
#>  - attr(*, "class")= chr [1:2] "scivrs_authors" "list"

Info

The info item lists the filename, title, description (abstract), keywords, doi, and submission info. Grobid sometimes makes mistakes with the DOI, so be cautious about using this.

paper$info
#> $filename
#> [1] "xml/0956797615588467.xml"
#> 
#> $title
#> [1] "Affective Mechanisms Predict Microlending"
#> 
#> $description
#> [1] "Humans sometimes share with others whom they may never meet or know, in violation of the dictates of pure selfinterest. Research has not established which neuropsychological mechanisms support lending decisions, nor whether their influence extends to markets involving significant financial incentives. In two studies, we found that neural affective mechanisms influence the success of requests for microloans. In a large Internet database of microloan requests (N = 13,500), we found that positive affective features of photographs promoted the success of those requests. We then established that neural activity (i.e., in the nucleus accumbens) and self-reported positive arousal in a neuroimaging sample (N = 28) predicted the success of loan requests on the Internet, above and beyond the effects of the neuroimaging sample's own choices (i.e., to lend or not). These findings suggest that elicitation of positive arousal can promote the success of loan requests, both in the laboratory and on the Internet. They also highlight affective neuroscience's potential to probe neuropsychological mechanisms that drive microlending, enhance the effectiveness of loan requests, and forecast market-level behavior."
#> 
#> $keywords
#> [1] "affect"       "accumbens"    "microlending" "preference"   "fMRI"        
#> [6] "prosocial"    "human"       
#> 
#> $doi
#> [1] "10.1177/0956797615588467pss.sagepub."
#> 
#> $submission
#> [1] "Received 2/2/15; Revision accepted 5/4/15"

References

The references contains the items in the reference list, including an id to link them to citations (bib_id), the DOI if available (doi), and the full reference text (ref).

ref <- paper$references

dplyr::filter(ref, bib_id == "b5")
#>   bib_id                        doi
#> 1     b5 10.1016/j.jcps.2011.05.001
#>                                                                                                                           ref
#> 1 A neural predictor of cultural popularity GSBerns SEMoore 10.1016/j.jcps.2011.05.001 Journal of Consumer Psychology 22 2012

Citations

The citations contains each citation, including an id to link them to references (bib_id) and the sentence that they are cited in (text).

cite <- paper$citations

dplyr::filter(cite, bib_id == "b5")
#>   bib_id
#> 1     b5
#> 2     b5
#> 3     b5
#>                                                                                                                                                                                                                                                                                                                  text
#> 1                                                                                                                                 Stimulus sample size was determined via power analysis of the sole existing similar study, which used neural activity to predict Internet downloads of music (Berns & Moore, 2012).
#> 2                                                       Following the approach of Berns and Moore (2012), we calculated correlations between Internet loan-request success and anticipatory activity in regions drawn from targeted volumes of interest (i.e., NAcc, AIns, and MPFC) as well as whole-brain analyses.
#> 3 For instance, investigators have used group NAcc activity in response to music to predict the aggregate number of song downloads 2 years later (Berns & Moore, 2012) and have used group MPFC activity to predict call volume in response to antismoking advertisements (Falk, Berkman, Whalen, & Lieberman, 2011).

Full Text

The full_text item is a table containing each sentence from the main text (text). The heading text (header) is used to automatically determine if the section is abstract, intro, method, results, or discussion. Each section has a unique sequential div number, and each paragraph (p) within the section and eeach sentence (s) within each paragraph are also sequentially numbered (e.g., div = 1, p = 2, s = 3 is the third sentence of the second paragraph of the first section after the abstract).

paper$full_text |> names()
#> [1] "text"    "section" "header"  "div"     "p"       "s"       "id"

The search_text() function helps you search the text of a paper or list of papers.

The default arguments give you a data frame containing a row for every sentence in every paper in the set. The data frame has the same column structure as the full_text table above, so that you can easily chain text searches.

all_sentences <- search_text(papers)

You can customise search_text() to return paragraphs or sections instead of sentences. The section column contains the automatically classified section types from the options “abstract”, “intro”, “methods”, “results”, or “discussion” (this can be inaccurate if grobid doesn’t detect headers or the header text doesn’t obviously fall in one of these categories).

method_paragraphs <- search_text(papers, section = "method", return = "paragraph")

A random paragraph from a method section.

#> [1] "SES. Family SES was assessed by measures at first contact when the TEDS twins were 18 months and when they were 7, 9, and 16 years old. At first contact (SES Index 1), age 7 (SES Index 2), and age 16 (SES Index 4), parents reported their highest educational qualifications and occupation. Educational qualifications were assessed on an 8-point scale from 1 (no formal education) to 8 (postgraduate qualifications). Occupation was inferred on the basis of a standard classification (Office of Population and Census Surveys, 1991) that used reports of employment status, job title, employment type, and whether parents needed special qualifications for their role. At 9 years (SES Index 3) and 16 years, family income was assessed; parents reported their annual household income before tax on an 11-point scale from 1 (under £4,500) to 11 (more than £100,000). For SES Indexes 1 through 3, standardized mean scores were calculated and averaged at each assessment age. Previous analyses of these data showed that correlations (r) between these estimates were .77 for SES Index 1 and 2, .55 for SES Indexes 1 and 3, and .57 for SES Indexes 2 and 3 (Hanscombe et al., 2012). Correlations with SES Index 4 have not been previously reported and were .70 with SES Index 1, .76 with SES Index 2, and .65 with SES Index 6 in the current analysis sample. We summed the four indexes to achieve one SES composite score for each participant."

Pattern

You can just code every sentence or paragraph in a set of papers, but this is usually not very efficient, so we can use a search pattern to filter the text.

search <- search_text(papers, pattern = "Scotland")

Here we have 4 results. We’ll just show the paper id and text columns of the returned table, but the table also provides the section type, header, and section, paragraph, and sentence numbers (div, p, and s).

Chaining

You can chain together searches to iteratively narrow down results.

search <- papers |>
  search_text("DeBruine") |>
  search_text("2006")

Regex

You can also use regular expressions to refine your search. The pattern below returns every sentence that contains either “Scotland” or “Scottish”.

search <- search_text(papers, pattern = "(Scotland|Scottish)")

Match

You can return just the matching text for a regular expression by setting the results to “match”. This pattern searches for text like “p < .25” or p<0.01”.

match <- search_text(papers, 
                     pattern = "p\\s*>\\s*0?\\.[0-9]+\\b", 
                     return = "match")

You can expand this to the whole sentence, paragraph, or +/- some number of sentences around the match using expand_text().

expand <- expand_text(results_table = match, 
                      paper = papers,
                      expand_to = "sentence",
                      plus = 0,
                      minus = 0)

expand$expanded[1]
#> [1] "No main effects or interactions with time were found (p > .29), which indicates that the action-specific effects of TMS on confidence are not specific to its delivery before or after a perceptual decision."