Papercheck

https://scienceverse.github.io/talks/2025-fdsai-ethics/

Lisa DeBruine

Abstract

Papercheck is a tool that leverages text search, code, and large language models to extract and supplement information from scientific documents (including manuscripts, submitted or published articles, or preregistration documents) and provides automated suggestions for improvement.

Inspired by practices in software development, where automated checks (e.g., CRAN checks for R packages) are used to identify issues before release, Papercheck aims to screen scientific manuscripts to identify potential issues or areas for improvement and guide researchers in adopting best practices. It can also assist with processing large numbers of papers for metascientific enquiry.

The Problem

Best Practices are Rapidly Evolving

Un-FAIR Meta-Data

  • All research outputs should be FAIR
  • PDFs are where data goes to die
  • Meta-data use cases:
    • facilitating meta-analyses
    • improving the re-use of reliable measures
    • meta-scientific research

Solutions

Checklists?

Reporting guidelines, such as CONSORT, PRISMA, and JARS often provide extensive checklists.

  • Time-consuming
  • Requires expertise
  • Can be vague
  • Who checks the checklist?

Automated Checks

  • Time-efficient
  • Requires less expertise
  • Reproducible
  • Generates machine-readable metadata

Automation Strategies

Grobid: A machine learning software for extracting structured information from scholarly documents

And then…

Text Search

Code

LLM

R Package

LLM

query <- 'How many subjects were in the studies in total? 
Return your answer in JSON format giving the total and 
any subgroupings by gender, e.g.:
{"total": 100, men": 42, "women": 58}, 
Only return valid JSON, no notes.'

llm_subjects <- papers |> 
  search_text("\\d+", section = "method") |>
  search_text(return = "section") |>
  llm(query)

llm_subjects |> json_expand()
id answer total men women
debruine-child {“total”: 71, “men”: 32, “women”: 39} 71 32 39
debruine-fret {“total”: 48, “men”: 24, “women”: 24} 48 24 24
debruine-sex {“total”: 136, “men”: 86, “women”: 50} 136 86 50
debruine-tnl {“total”: 144, “men”: 66, “women”: 78} 144 66 78

LLM Models

papercheck::llm_model_list()
id owned_by created context_window
1 meta-llama/llama-4-scout-17b-16e-instruct Meta 2025-04-05 131072
3 llama-3.1-8b-instant Meta 2023-09-03 131072
4 meta-llama/llama-prompt-guard-2-22m Meta 2025-05-30 512
5 meta-llama/llama-4-maverick-17b-128e-instruct Meta 2025-04-05 131072
6 meta-llama/llama-prompt-guard-2-86m Meta 2025-05-30 512
7 meta-llama/llama-guard-4-12b Meta 2025-05-08 131072
9 llama-3.3-70b-versatile Meta 2024-12-06 131072
10 qwen/qwen3-32b Alibaba Cloud 2025-05-28 131072
11 moonshotai/kimi-k2-instruct Moonshot AI 2025-07-13 131072
12 openai/gpt-oss-20b OpenAI 2025-08-05 131072
13 playai-tts-arabic PlayAI 2025-02-27 8192
14 groq/compound-mini Groq 2025-09-04 131072
15 playai-tts PlayAI 2025-02-27 8192
16 openai/gpt-oss-120b OpenAI 2025-08-05 131072
17 allam-2-7b SDAIA 2025-01-23 4096
18 moonshotai/kimi-k2-instruct-0905 Moonshot AI 2025-09-05 262144
19 groq/compound Groq 2025-09-04 131072

Modules

module_list()
 * all_p_values: List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.
 * all_urls: List all the URLs in the main text.
 * aspredicted: Get data from AdPredicted pre-regosterations in a structured way
 * effect_size: Detect t-tests and F-tests with missing effect sizes
 * exact_p: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.)
 * marginal: List all sentences that describe an effect as 'marginally significant'.
 * miscitation: Check for frequently miscited papers. This module is just a proof of concept -- the miscite database is not yet populated with real examples.
 * osf_check: List all OSF links and whether they are open, closed, or do not exist.
 * power: Find power analyses and return their components.
 * ref_consistency: Check if all references are cited and all citations are referenced
 * retractionwatch: Flag any cited papers in the RetractionWatch database
 * statcheck: Check consistency of p-values and test statistics
 
Use `module_help("module_name")` for help with a specific module

Modules: Effect Sizes

mod <- module_run(
  paper = psychsci,
  module = "effect_size"
)
id ttests_with_es ttests_without_es Ftests_with_es Ftests_without_es
0956797613520608 0 0 5 0
0956797614522816 0 5 20 0
0956797614527830 0 0 0 0
0956797614557697 0 1 5 0
0956797614560771 2 0 0 0
0956797614566469 0 0 0 0
0956797615569001 1 1 0 0
0956797615569889 0 0 12 0
0956797615583071 6 4 2 2
0956797615588467 4 3 0 1

Modules:: Power Module

mod <- module_run(
  paper = psychsci,
  module = "power"
)
test sample alpha power es es_metric
2 12 0.05 0.95 1.15 unstandardised
3 16 0.05 0.80 0.76
4 unpaired t-test 24 0.05 0.80 0.60 unstandardised
5 NA NA NA 1.10 Cohen’s d
7 15 NA 0.90 0.25 Cohen’s d
8 one-way ANOVA 24 0.05 0.80 NA
10 10 0.05 0.95 1.36
11 15 NA 0.80 0.40 Cohen’s d
12 two-way ANOVA NA 0.05 0.80 NA
13 52 0.05 0.80 0.40 unstandardised

Select relevant text

sample_llm <- paper |>
  search_text(
    pattern = "power",
    section = "method", 
    return = "paragraph"
  ) |> 
  search_text(
    pattern = "[0-9], 
    return = "paragraph"
  ) |> 
  distinct(id, .keep_all = TRUE)

LLM Instructions

An a priori power analysis is used to estimate the required sample size to achieve a desired level of statistical power given an effect size, statistical test and alpha level.

If the paragraph DOES describe an a priori power analysis, extract ONLY the following information and return it as JSON, use this exact schema:

{
  "apriori": true,
  "test": "one-way ANOVA",
  "sample": 64,
  "alpha": 0.05,
  "power": 0.8,
  "es": 0.4,
  "es_metric": "Cohen\'s f"
}

LLMs Need Specific Rules

  • Return “apriori”: false if this is NOT an a priori power analysis, true if it is.
  • Do NOT classify paragraphs as a priori if they only report achieved power for an existing sample size.
  • If information is missing or unclear, leave it empty.
  • Use only the exact labels listed for “test” and “es_metric”.
  • Ignore whether the test is one-sided or two-sided.
  • If ANOVA is used, specify one-way or two-way.

Rules

  • For “test”: Use ONLY these exact strings (case-sensitive). Choose the closest match or null if unclear/unsupported. Ignore one-sided vs. two-sided.
- "paired t-test"
- "unpaired t-test"
- "one-sample t-test"
- "one-way ANOVA"
- "two-way ANOVA"
- "MANOVA"
- "regression"
- "chi-square"
- "correlation"
- "other"
- null (if no test mentioned or unclear)

Rules

  • For “es_metric”: Use ONLY these exact strings (case-sensitive) or “unstandardised” for raw/non-standardized effects (e.g., means, proportions). Use null if missing/unclear: “Cohen's d”, “Hedges' g”, “Cohen's f”, “partial eta squared”, “eta squared”, “unstandardised”
  • Do NOT guess values.

Return only valid JSON format, starting with { and ending with }.

Promoting Adoption

Center for Open Science

Caveats

  • Validation
  • Sustainability
  • AI Reproducibility
  • Inappropriate Use

Thank You!

papercheck - download the package or submit issues

VeriSci - join a community to create or test modules

@debruine - see what else I’m up to