In this blog post we will explain how Papercheck can automatically check the content of data repositories that are linked to in a scientific manuscript, using some papercheck functions for exploring OSF repositories to create a custom module.
There is an increasing awareness of the importance of open science practices, and widespread support among scientists for open science practices, such as data and code sharing (Ferguson et al. 2023). As data and code sharing is a relatively new practice, and many scientists lack training in open science, it is common to see badly documented data repositories. Best practices exist, such as the TIER protocol, but not all researchers might be aware of best practices.
At a minimum a data repository should contain a README file with instructions for how to reproduce the results. If data is shared, it should be stored in a ‘data’ folder, or at least have the word ‘data’ in the filename. Code or scripts should similarly be shared in a folder with that name, or at least with the word in the filename. Finally, if data is shared, there should be a codebook or data dictionary that explains which variables are in the dataset in order to allow others to re-use the data. Although it is easy to forget to organize a data repository well, it is also easy to automatically check. Here we demonstrate how Papercheck can check if a README is present, whether data and/or code are shared, and if there is a codebook.
Ideally peer reviewers or editors would check the contents of a data repository. In practice, time constraints mean that no one actually checks what is in a data repository. Automation can perform some of the checks that peers might otherwise perform manually. We provide an illustration of some checks that could be performed. Specifically 1) is any data that is shared clearly labeled as such, 2) is code that is shared clearly labeled as such, 3) is there a README file that explains to potential users which files are shared, where they can be found in the repository, and how the can be used to reproduce any reported results, and 4) is there a codebook or data dictionary?
Checking an OSF repository with Papercheck
We will illustrate the process of checking a data repository by
focusing on projects on the Open Science
Framework. For this illustration we use an open access paper
published in Psychological Science that has already been converted to a
papercheck object using GROBID. There are 250 open access papers in the
Papercheck object psychsci
; we will choose one for this
example.
# paper to use in this example
paper <- psychsci[[250]]
Set up OSF functions
You can only make 100 API requests per hour, unless you authorise
your requests, when you can make 10K requests per day. The OSF functions
in papercheck often make several requests per URL to get all of the
info, so it’s worthwhile setting your PAT. You can authorise them by
creating an OSF token at https://osf.io/settings/tokens and including the
following line in your .Renviron file (which you can open using
usethis::edit_r_environ()
):
OSF_PAT="replace-with-your-token-string"
The OSF API server is down a lot, so it’s often good to check it
before you run a bunch of OSF functions, we provide the function
osf_api_check()
for this. When the server is down, it can
take several seconds to return an error, so scripts where you are
checking many URLs can take a long time before you realise they aren’t
working.
osf_api_check()
#> [1] "ok"
Find OSF Links
We start by searching for OSF URLs using the
search_text()
function. OSF links can be tricky to find in
PDFs, since they can insert spaces in odd places, and view-only links
that contain a ? are often interpreted as being split across sentences.
This function is our best attempt at catching and fixing them all.
links <- osf_links(paper)
text | section |
---|---|
osf.io/hv29w | method |
osf.io/2es6n | method |
osf.io/jpm5a | method |
osf.io/aux7s | method |
osf.io/nw3mc | method |
osf.io/ks639 | method |
osf.io/y75nu | method |
OSF.IO/4TYM7 | funding |
OSF.IO/X4T9A | funding |
Retrieve Link Info
If valid, the link is processed, and the OSF Application Programming
Interface (API) is used to retrieve whether the link points to a file,
project, or registration. This is achieved through the
osf_retrieve()
function.
This function can take a vector of OSF IDs or URLs, or a table that
contains them. If the IDs aren’t in the first column, you will need to
specify the name of the column. The function will return your table with
added information. (You can quiet the output messages with
verbose(FALSE)
.)
The function osf_retrieve()
will also retrieve all child
components, files and folders if you set the argument
recursive = TRUE
. If there are duplicate IDs, it will only
get the contents for each item once. If you set the argument
find_project = TRUE
, it will also look up the parent
project of any links (but this requires more API calls).
info <- osf_retrieve(links, recursive = TRUE, find_project = TRUE)
#> Starting OSF retrieval for 9 files...
#> * Retrieving info from hv29w...
#> * Retrieving info from 2es6n...
#> * Retrieving info from jpm5a...
#> * Retrieving info from aux7s...
#> * Retrieving info from nw3mc...
#> * Retrieving info from ks639...
#> * Retrieving info from y75nu...
#> * Retrieving info from 4tym7...
#> * Retrieving info from x4t9a...
#> ...Main retrieval complete
#> Starting retrieval of children...
#> * Retrieving children for x4t9a...
#> * Retrieving files for x4t9a...
#> * Retrieving files for 6621454e716cb7048fa45a2a...
#> * Retrieving files for 6293d1cab59d5f1df8720db5...
#> * Retrieving files for 6293d1bfbbdcde278f4269ed...
#> * Retrieving files for 6293d1c5bbdcde278f4269f7...
#> * Retrieving files for 6339c10031d65306e12de5a2...
#> * Retrieving files for 6293d2e3b59d5f1df0720c6b...
#> * Retrieving files for 64f0ac666d1e8905f21516b2...
#> * Retrieving files for 64f0ab59f3dcd105d7ddd40b...
#> ...OSF retrieval complete!
osf_id | name | osf_type | project |
---|---|---|---|
hv29w | Kim,Doeller_Prereg_OSF.pdf | files | x4t9a |
2es6n | suppleVideo1_learnSph.mp4 | files | x4t9a |
jpm5a | suppleVideo2_learnPlane.mp4 | files | x4t9a |
aux7s | suppleVideo3_objlocSph.mp4 | files | x4t9a |
nw3mc | suppleVideo4_objlocPlane.mp4 | files | x4t9a |
ks639 | suppleVideo5_triangleSph.mp4 | files | x4t9a |
y75nu | suppleVideo6_trianglePlane.mp4 | files | x4t9a |
4tym7 | Cognitive maps for a spherical surface | registrations | x4t9a |
x4t9a | Cognitive maps for a spherical surface | nodes | x4t9a |
There are multiple OSF links in this paper, but they are all part of the same overarching OSF project, with the project ID x4t9a.
Summarize Contents
The OSF allows you to categorize components by category, and we can also determine file types using extensions.
osf_id | name | filetype |
---|---|---|
hv29w | Kim,Doeller_Prereg_OSF.pdf | text |
2es6n | suppleVideo1_learnSph.mp4 | video |
jpm5a | suppleVideo2_learnPlane.mp4 | video |
aux7s | suppleVideo3_objlocSph.mp4 | video |
nw3mc | suppleVideo4_objlocPlane.mp4 | video |
ks639 | suppleVideo5_triangleSph.mp4 | video |
y75nu | suppleVideo6_trianglePlane.mp4 | video |
662145f3716cb7048fa45a58 | virtualizerStudy1-main.zip | archive |
662146fa8df04804d3177e59 | ReadMe.txt | text |
64f0aaf4d9f2c905a0d04821 | main_analyseTriangleComple_20230423.m | code |
64f0aaf7f3dcd105dbddd396 | main_simulate_objlocTraj.m | code |
64f0aaf9989de605c3dd152a | supple_learningTrajectory.m | code |
6293d2d5b59d5f1df7720e41 | poweranalysis_sph.R | code |
6614017dc053943058b4d41c | supple_sphWithVariousRadius_clean.m | code |
6293d2d5b59d5f1df0720c66 | main_analyseObjLocTest.m | code |
6293d26c86324127ca5b5862 | suppleVideo2_learnPlane.mp4 | video |
6293d29eb7e8c726edc2dc38 | suppleVideo3_objlocSph.mp4 | video |
6293d2a1bbdcde278b42696a | suppleVideo6_trianglePlane.mp4 | video |
6293d271ddbe49279ba215f6 | suppleVideo1_learnSph.mp4 | video |
6293d275bbdcde278f426b25 | suppleVideo4_objlocPlane.mp4 | video |
6293d282ddbe49279aa21548 | suppleVideo5_triangleSph.mp4 | video |
6293d1e8b59d5f1df7720d4c | suppleMovie_legend.txt | text |
6293d310b7e8c726ecc2db7d | sumDemograph.csv | data |
6293d30fb7e8c726ecc2db79 | rawdata_plane_triangle.csv | data |
6293d30ebbdcde278f426c49 | rawdata_sph_objlocTest.csv | data |
6293d310b59d5f1df3720cf4 | rawdata_sph_triangle.csv | data |
661402d9e65c603b737d9c10 | cleanData_combine.mat | code |
661402b4943bee32eadfebdd | pilotData_triangle_combine_clean.csv | data |
6293d30d86324127d25b5c27 | rawdata_plane_objlocIdentity.csv | data |
6293d30d86324127ce5b5a91 | rawdata_sph_objlocIdentity.csv | data |
6293d30db7e8c726edc2dcb7 | rawdata_plane_objlocTest.csv | data |
6339c121ec7f3f0704f5fbf0 | Kim,Doeller_Prereg_OSF.pdf | text |
64f0ab2ff3dcd105c9ddd4fc | findShortcut.m | code |
6293d2f5bbdcde278f426c19 | sph2cartFn.m | code |
6293d2f6ddbe49279ba2164b | drawGeodesic.m | code |
6293d2f786324127ca5b5879 | sph2cartMKunity.m | code |
6293d2f786324127d25b5bf3 | translateOnSphere.m | code |
6293d2f6ddbe49279da216eb | northVecFn.m | code |
6293d2f9b7e8c726edc2dc9b | ttestplotMK2.m | code |
6293d2f5b7e8c726f0c2de20 | cart2sphFn.m | code |
6293d2f486324127ca5b5875 | rotAroundU.m | code |
64f0acb06c0f5a0647d057e8 | psub11_objLearn_Sph_traj.tsv | data |
64f0acbfd9f2c905a0d048ba | psub16_objLearn_Plane_traj.tsv | data |
64f0aca2f3dcd105daddd527 | psub06_objLearn_Sph_traj.tsv | data |
64f0acdb989de605c2dd1598 | psub25_objLearn_Plane_traj.tsv | data |
64f0ac9bf3dcd105d7ddd449 | psub04_objLearn_Plane_traj.tsv | data |
64f0acc2d9f2c905a5d04a36 | psub17_objLearn_Sph_traj.tsv | data |
64f0acab6c0f5a064fd058e4 | psub10_objLearn_Plane_traj.tsv | data |
64f0aca46c0f5a0650d059a5 | psub08_objLearn_Plane_traj.tsv | data |
64f0acba6d1e8905f21516d4 | psub14_objLearn_Sph_traj.tsv | data |
64f0aca9f3dcd105dbddd497 | psub09_objLearn_Sph_traj.tsv | data |
64f0ace16c0f5a0650d059c8 | psub27_objLearn_Plane_traj.tsv | data |
64f0acf2d9f2c905a0d048d2 | psub32_objLearn_Sph_traj.tsv | data |
64f0acef6d1e8905ee1515e2 | psub32_objLearn_Plane_traj.tsv | data |
64f0accd6c0f5a064fd058f0 | psub21_objLearn_Sph_traj.tsv | data |
64f0acb36c0f5a0650d059ab | psub12_objLearn_Sph_traj.tsv | data |
64f0ad06989de605badd1471 | psub41_objLearn_Plane_traj.tsv | data |
64f0acd6f3dcd105d3ddd39b | psub23_objLearn_Plane_traj.tsv | data |
64f0ad0ef3dcd105dbddd4d7 | psub44_objLearn_Sph_traj.tsv | data |
64f0acc3989de605c3dd1622 | psub17_objLearn_Plane_traj.tsv | data |
64f0aca9f3dcd105dbddd497 | psub09_objLearn_Sph_traj.tsv | data |
64f0aca1f3dcd105daddd525 | psub06_objLearn_Plane_traj.tsv | data |
64f0acfcf3dcd105d7ddd456 | psub37_objLearn_Plane_traj.tsv | data |
64f0aceb6d1e8905f315176a | psub30_objLearn_Sph_traj.tsv | data |
64f0acd66c0f5a064fd058f4 | psub23_objLearn_Sph_traj.tsv | data |
64f0ac98989de605c2dd1576 | psub03_objLearn_Plane_traj.tsv | data |
64f0ad0bf3dcd105d7ddd45a | psub42_objLearn_Sph_traj.tsv | data |
64f0ad046c0f5a0650d059d6 | psub40_objLearn_Sph_traj.tsv | data |
64f0ac9f6c0f5a064fd058de | psub05_objLearn_Sph_traj.tsv | data |
64f0ad03989de605bedd1514 | psub40_objLearn_Plane_traj.tsv | data |
64f0aca9f3dcd105dbddd497 | psub09_objLearn_Sph_traj.tsv | data |
64f0acd2989de605c2dd158e | psub22_objLearn_Sph_traj.tsv | data |
64f0ad126c0f5a064cd058bb | psub47_objLearn_Plane_traj.tsv | data |
64f0acc9f3dcd105dbddd4a6 | psub20_objLearn_Plane_traj.tsv | data |
64f0ad116d1e8905ee1515ee | psub46_objLearn_Sph_traj.tsv | data |
64f0ad14d9f2c905a5d04a70 | psub47_objLearn_Sph_traj.tsv | data |
64f0ad016d1e8905f3151778 | psub38_objLearn_Sph_traj.tsv | data |
64f0acac6c0f5a0650d059a9 | psub10_objLearn_Sph_traj.tsv | data |
64f0acb8f3dcd105dbddd49e | psub14_objLearn_Plane_traj.tsv | data |
64f0acc6d9f2c905a4d049b3 | psub19_objLearn_Sph_traj.tsv | data |
64f0acf8f3dcd105dbddd4cb | psub34_objLearn_Sph_traj.tsv | data |
64f0acfe989de605c3dd164d | psub37_objLearn_Sph_traj.tsv | data |
64f0acf36d1e8905f315176e | psub33_objLearn_Plane_traj.tsv | data |
64f0ace3f3dcd105dbddd4b8 | psub27_objLearn_Sph_traj.tsv | data |
64f0acbc6d1e8905ee1515d8 | psub15_objLearn_Plane_traj.tsv | data |
64f0acc9989de605bfdd160a | psub20_objLearn_Sph_traj.tsv | data |
64f0aca8d9f2c905a0d048b8 | psub09_objLearn_Plane_traj.tsv | data |
64f0ad00d9f2c905a4d049d6 | psub38_objLearn_Plane_traj.tsv | data |
64f0acb1f3dcd105d6ddd37c | psub12_objLearn_Plane_traj.tsv | data |
64f0ac98d9f2c905a4d04981 | psub03_objLearn_Sph_traj.tsv | data |
64f0acfd6d1e8905f3151776 | psub35_objLearn_Sph_traj.tsv | data |
64f0ad0fd9f2c905a5d04a6d | psub46_objLearn_Plane_traj.tsv | data |
64f0ace8989de605c3dd163f | psub30_objLearn_Plane_traj.tsv | data |
64f0acd36c0f5a0650d059c3 | psub22_objLearn_Plane_traj.tsv | data |
64f0acb7f3dcd105d7ddd44e | psub13_objLearn_Sph_traj.tsv | data |
64f0acc0989de605bfdd1603 | psub16_objLearn_Sph_traj.tsv | data |
64f0ac9bd9f2c905a5d04a2c | psub04_objLearn_Sph_traj.tsv | data |
64f0ad0ad9f2c9059dd0482b | psub42_objLearn_Plane_traj.tsv | data |
64f0acdc6d1e8905f2151737 | psub25_objLearn_Sph_traj.tsv | data |
64f0ac9fd9f2c905a4d04983 | psub05_objLearn_Plane_traj.tsv | data |
64f0ad0dd9f2c905a4d049d8 | psub44_objLearn_Plane_traj.tsv | data |
64f0acded9f2c905a4d049c8 | psub26_objLearn_Plane_traj.tsv | data |
64f0acf6d9f2c905a5d04a58 | psub34_objLearn_Plane_traj.tsv | data |
64f0acb4989de605c2dd1588 | psub13_objLearn_Plane_traj.tsv | data |
64f0ad08989de605c3dd1650 | psub41_objLearn_Sph_traj.tsv | data |
64f0aceb989de605c3dd1642 | psub31_objLearn_Plane_traj.tsv | data |
64f0acd9d9f2c905a4d049c3 | psub24_objLearn_Sph_traj.tsv | data |
64f0acbd6c0f5a064cd05891 | psub15_objLearn_Sph_traj.tsv | data |
64f0acaed9f2c905a4d049aa | psub11_objLearn_Plane_traj.tsv | data |
64f0ace5989de605c3dd163d | psub28_objLearn_Plane_traj.tsv | data |
64f0ace7d9f2c905a4d049cc | psub28_objLearn_Sph_traj.tsv | data |
64f0accd6d1e8905f3151765 | psub21_objLearn_Plane_traj.tsv | data |
64f0aca6f3dcd105daddd52a | psub08_objLearn_Sph_traj.tsv | data |
64f0acc66c0f5a064bd05865 | psub19_objLearn_Plane_traj.tsv | data |
64f0acf9f3dcd105d3ddd3a0 | psub35_objLearn_Plane_traj.tsv | data |
64f0acf56c0f5a0648d0581c | psub33_objLearn_Sph_traj.tsv | data |
64f0acd9d9f2c905a1d048ba | psub24_objLearn_Plane_traj.tsv | data |
64f0ab74989de605c2dd14eb | LICENSE | NA |
64f0ab746d1e8905ef1515e8 | mapface2edge.m | code |
64f0ab7c989de605c2dd14f0 | sortrowstol.m | code |
64f0ab7c6c0f5a0650d05905 | spheretri.m | code |
64f0ab7ff3dcd105dbddd3db | SphereTriTestCase.m | code |
64f0ab7f6d1e8905ee15157d | spheretribydepth.m | code |
64f0ab72989de605c3dd155d | istriequal.m | code |
64f0ab78d9f2c905a4d048f8 | README.md | text |
64f0ab78f3dcd105d6ddd35b | shrinkfacetri.m | code |
64f0ab6e989de605bedd146d | combvec.m | code |
64f0ab71f3dcd105d7ddd40c | isface.m | code |
64f0ab6fd9f2c905a5d04929 | icosahedron.m | code |
We can then use this information to determine if, for each file, the information about the files contains text that makes it easy to determine what is being shared. A simple regular expression text search for ‘README’, ‘codebook’, ‘script’, and ‘data’ (in a number of possible ways that these words can be written) is used to automatically detect what is shared.
osf_files_summary <- summarize_contents(info)
name | filetype | file_category |
---|---|---|
Kim,Doeller_Prereg_OSF.pdf | text | NA |
suppleVideo1_learnSph.mp4 | video | NA |
suppleVideo2_learnPlane.mp4 | video | NA |
suppleVideo3_objlocSph.mp4 | video | NA |
suppleVideo4_objlocPlane.mp4 | video | NA |
suppleVideo5_triangleSph.mp4 | video | NA |
suppleVideo6_trianglePlane.mp4 | video | NA |
virtualizerStudy1-main.zip | archive | NA |
ReadMe.txt | text | readme |
main_analyseTriangleComple_20230423.m | code | code |
main_simulate_objlocTraj.m | code | code |
supple_learningTrajectory.m | code | code |
poweranalysis_sph.R | code | code |
supple_sphWithVariousRadius_clean.m | code | code |
main_analyseObjLocTest.m | code | code |
suppleVideo2_learnPlane.mp4 | video | NA |
suppleVideo3_objlocSph.mp4 | video | NA |
suppleVideo6_trianglePlane.mp4 | video | NA |
suppleVideo1_learnSph.mp4 | video | NA |
suppleVideo4_objlocPlane.mp4 | video | NA |
suppleVideo5_triangleSph.mp4 | video | NA |
suppleMovie_legend.txt | text | NA |
sumDemograph.csv | data | data |
rawdata_plane_triangle.csv | data | data |
rawdata_sph_objlocTest.csv | data | data |
rawdata_sph_triangle.csv | data | data |
cleanData_combine.mat | code | code |
pilotData_triangle_combine_clean.csv | data | data |
rawdata_plane_objlocIdentity.csv | data | data |
rawdata_sph_objlocIdentity.csv | data | data |
rawdata_plane_objlocTest.csv | data | data |
Kim,Doeller_Prereg_OSF.pdf | text | NA |
findShortcut.m | code | code |
sph2cartFn.m | code | code |
drawGeodesic.m | code | code |
sph2cartMKunity.m | code | code |
translateOnSphere.m | code | code |
northVecFn.m | code | code |
ttestplotMK2.m | code | code |
cart2sphFn.m | code | code |
rotAroundU.m | code | code |
psub11_objLearn_Sph_traj.tsv | data | data |
psub16_objLearn_Plane_traj.tsv | data | data |
psub06_objLearn_Sph_traj.tsv | data | data |
psub25_objLearn_Plane_traj.tsv | data | data |
psub04_objLearn_Plane_traj.tsv | data | data |
psub17_objLearn_Sph_traj.tsv | data | data |
psub10_objLearn_Plane_traj.tsv | data | data |
psub08_objLearn_Plane_traj.tsv | data | data |
psub14_objLearn_Sph_traj.tsv | data | data |
psub09_objLearn_Sph_traj.tsv | data | data |
psub27_objLearn_Plane_traj.tsv | data | data |
psub32_objLearn_Sph_traj.tsv | data | data |
psub32_objLearn_Plane_traj.tsv | data | data |
psub21_objLearn_Sph_traj.tsv | data | data |
psub12_objLearn_Sph_traj.tsv | data | data |
psub41_objLearn_Plane_traj.tsv | data | data |
psub23_objLearn_Plane_traj.tsv | data | data |
psub44_objLearn_Sph_traj.tsv | data | data |
psub17_objLearn_Plane_traj.tsv | data | data |
psub09_objLearn_Sph_traj.tsv | data | data |
psub06_objLearn_Plane_traj.tsv | data | data |
psub37_objLearn_Plane_traj.tsv | data | data |
psub30_objLearn_Sph_traj.tsv | data | data |
psub23_objLearn_Sph_traj.tsv | data | data |
psub03_objLearn_Plane_traj.tsv | data | data |
psub42_objLearn_Sph_traj.tsv | data | data |
psub40_objLearn_Sph_traj.tsv | data | data |
psub05_objLearn_Sph_traj.tsv | data | data |
psub40_objLearn_Plane_traj.tsv | data | data |
psub09_objLearn_Sph_traj.tsv | data | data |
psub22_objLearn_Sph_traj.tsv | data | data |
psub47_objLearn_Plane_traj.tsv | data | data |
psub20_objLearn_Plane_traj.tsv | data | data |
psub46_objLearn_Sph_traj.tsv | data | data |
psub47_objLearn_Sph_traj.tsv | data | data |
psub38_objLearn_Sph_traj.tsv | data | data |
psub10_objLearn_Sph_traj.tsv | data | data |
psub14_objLearn_Plane_traj.tsv | data | data |
psub19_objLearn_Sph_traj.tsv | data | data |
psub34_objLearn_Sph_traj.tsv | data | data |
psub37_objLearn_Sph_traj.tsv | data | data |
psub33_objLearn_Plane_traj.tsv | data | data |
psub27_objLearn_Sph_traj.tsv | data | data |
psub15_objLearn_Plane_traj.tsv | data | data |
psub20_objLearn_Sph_traj.tsv | data | data |
psub09_objLearn_Plane_traj.tsv | data | data |
psub38_objLearn_Plane_traj.tsv | data | data |
psub12_objLearn_Plane_traj.tsv | data | data |
psub03_objLearn_Sph_traj.tsv | data | data |
psub35_objLearn_Sph_traj.tsv | data | data |
psub46_objLearn_Plane_traj.tsv | data | data |
psub30_objLearn_Plane_traj.tsv | data | data |
psub22_objLearn_Plane_traj.tsv | data | data |
psub13_objLearn_Sph_traj.tsv | data | data |
psub16_objLearn_Sph_traj.tsv | data | data |
psub04_objLearn_Sph_traj.tsv | data | data |
psub42_objLearn_Plane_traj.tsv | data | data |
psub25_objLearn_Sph_traj.tsv | data | data |
psub05_objLearn_Plane_traj.tsv | data | data |
psub44_objLearn_Plane_traj.tsv | data | data |
psub26_objLearn_Plane_traj.tsv | data | data |
psub34_objLearn_Plane_traj.tsv | data | data |
psub13_objLearn_Plane_traj.tsv | data | data |
psub41_objLearn_Sph_traj.tsv | data | data |
psub31_objLearn_Plane_traj.tsv | data | data |
psub24_objLearn_Sph_traj.tsv | data | data |
psub15_objLearn_Sph_traj.tsv | data | data |
psub11_objLearn_Plane_traj.tsv | data | data |
psub28_objLearn_Plane_traj.tsv | data | data |
psub28_objLearn_Sph_traj.tsv | data | data |
psub21_objLearn_Plane_traj.tsv | data | data |
psub08_objLearn_Sph_traj.tsv | data | data |
psub19_objLearn_Plane_traj.tsv | data | data |
psub35_objLearn_Plane_traj.tsv | data | data |
psub33_objLearn_Sph_traj.tsv | data | data |
psub24_objLearn_Plane_traj.tsv | data | data |
LICENSE | NA | NA |
mapface2edge.m | code | code |
sortrowstol.m | code | code |
spheretri.m | code | code |
SphereTriTestCase.m | code | code |
spheretribydepth.m | code | code |
istriequal.m | code | code |
README.md | text | readme |
shrinkfacetri.m | code | code |
combvec.m | code | code |
isface.m | code | code |
icosahedron.m | code | code |
Report Text
Finally, we print a report that communicates to the user - for example, a researcher preparing their manuscript for submission - whether there are suggestions to improve their data repository. We provide feedback about whether any of the four categories could be automatically detected, and if not, provide additional information about what would have made the automated tool recognize the files of interest. The output gives a detailed overview of the information it could not find, alongside a suggestion for how to learn more about best practices in this domain. If researchers use this Papercheck module before submission, they can improve the quality of their data repository in case any information is missing. Papercheck might miss data and code that is shared, but not clearly named, but by indicating this, users might realize that the data repository can be improved by more clearly naming folders and files.
osf_report <- function(summary) {
files <- dplyr::filter(summary, osf_type == "files")
data <- dplyr::filter(files, file_category == "data") |> nrow()
code <- dplyr::filter(files, file_category == "code") |> nrow()
codebook <- dplyr::filter(files, file_category == "codebook") |> nrow()
readme <- dplyr::filter(files, file_category == "readme") |> nrow()
traffic_light <- dplyr::case_when(
data == 0 & code == 0 & readme == 0 ~ "red",
data == 0 | code == 0 | readme == 0 ~ "yellow",
data > 0 & code > 0 & readme > 0 ~ "green"
)
data_report <- dplyr::case_when(
data == 0 ~ "\u26A0\uFE0F There was no data detected. Are you sure you cannot share any of the underlying data? If you did share the data, consider naming the file(s) or file folder with 'data'.",
data > 0 ~ "\u2705 Data file(s) were detected. Great job making your research more transparent and reproducible!"
)
codebook_report <- dplyr::case_when(
codebook == 0 ~ "\u26A0\uFE0F️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/",
codebook > 0 ~ "\u2705 Codebook(s) were detected. Well done!"
)
code_report <- dplyr::case_when(
code == 0 ~ "\u26A0\uFE0F️ No code files were found. Are you sure there is no code related to this manuscript? If you shared code, consider naming the file or file folder with 'code' or 'script'.",
code > 0 ~ "\u2705 Code file(s) were detected. Great job making it easier to reproduce your results!"
)
readme_report <- dplyr::case_when(
readme == 0 ~ "\u26A0\uFE0F No README files were identified. A read me is best practice to facilitate re-use. If you have a README, please name it explicitly (e.g., README.txt or _readme.pdf).",
readme > 0 ~ "\u2705 README detected. Great job making it easier to understand how to re-use files in your repository!"
)
report_message <- paste(
readme_report,
data_report,
codebook_report,
code_report,
"Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/",
sep = "\n\n"
)
return(list(
traffic_light = traffic_light,
report = report_message
))
}
report <- osf_report(osf_files_summary)
# print the report into a file
module_report(report) |> cat()
✅ README detected. Great job making it easier to understand how to re-use files in your repository!
✅ Data file(s) were detected. Great job making your research more transparent and reproducible!
⚠️️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/
✅ Code file(s) were detected. Great job making it easier to reproduce your results!
Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/
Checking the Contents of files
So far we have used Papercheck to automatically check whether certain types of files exist. But it is also possible to automatically download files, examine their contents, and provide feedback to users. This can be useful to examine datasets (e.g., do files contain IP addresses or other personal information), or code files. We will illustrate the latter by automatically checking the content of R scripts stored on the OSF, in repositories that are linked to in a scientific manuscript.
We can check R files for good coding practices that improve
reproducibility. We have created a check that examines 1) whether all
libraries are loaded in one block, instead of throughout the R script,
2) whether relative paths are used that will also work when someone runs
the code on a different computer (e.g.,
data <- read.csv(file='../data/data_study_1.csv')
)
instead of fixed paths (e.g.,
data <- read.csv(file='C:/data/data_study_1.csv')
), and
3) whether information is provided about the software used (i.e., the R
version), the version of packages that were used, and properties of the
computer that the analyses were performed on. In R, this can be achieved
by:
sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/London
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
#> [4] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5
#> [7] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
#> [10] tidyverse_2.0.0 papercheck_0.0.0.9049
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 generics_0.1.3 stringi_1.8.4 httpcode_0.3.0
#> [5] hms_1.1.3 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
#> [9] grid_4.4.3 timechange_0.3.0 fastmap_1.2.0 jsonlite_1.9.1
#> [13] crul_1.5.0 urltools_1.7.3 httr_1.4.7 scales_1.3.0
#> [17] textshaping_0.4.1 jquerylib_0.1.4 cli_3.6.4 rlang_1.1.5
#> [21] triebeard_0.4.1 munsell_0.5.1 withr_3.0.2 cachem_1.1.0
#> [25] yaml_2.3.10 tools_4.4.3 tzdb_0.5.0 memoise_2.0.1
#> [29] colorspace_2.1-1 curl_6.0.1 vctrs_0.6.5 R6_2.6.1
#> [33] lifecycle_1.0.4 fs_1.6.5 htmlwidgets_1.6.4 ragg_1.3.3
#> [37] osfr_0.2.9 pkgconfig_2.0.3 desc_1.4.3 pkgdown_2.1.1
#> [41] pillar_1.10.1 bslib_0.9.0 gtable_0.3.6 Rcpp_1.0.14
#> [45] glue_1.8.0 systemfonts_1.1.0 xfun_0.51 tidyselect_1.2.1
#> [49] rstudioapi_0.17.1 knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.29
#> [53] compiler_4.4.3
As most scientists have not been taught how to code explicitly, it is common to see scripts that do not adhere to best coding practices. We are no exception ourselves (e.g., you will not find a sessioninfo.txt file in our repositories). Although code might be reproducible even if it takes time to figure out which versions of an R package was used, which R version was used, and by changing fixed paths, reproducibility is facilitated if best practices are used. The whole point of automated checks is to have algorithms that capture expertise make recommendations that improve how we currently work.
check_r_files <- function(summary) {
r_files <- summary |>
dplyr::filter(osf_type == "files",
grepl("\\.R(md)?", name, ignore.case = TRUE)) |>
dplyr::mutate(abs_report = NA,
pkg_report = NA,
session_report = NA)
report <- lapply(r_files$osf_id, \(id) {
report <- dplyr::filter(r_files, osf_id == !!id)
# Try downloading the R file
file_url <- paste0("https://osf.io/download/", id)
r_code <- tryCatch(
readLines(url(file_url), warn = FALSE),
error = function(e) return(NULL)
)
if (is.null(r_code)) return("")
# absolute paths
abs_path <- grep("[\"\']([A-Z]:|\\/|~)", r_code)
report$abs <- dplyr::case_when(
length(abs_path) == 0 ~ "\u2705 No absolute paths were detected",
length(abs_path) > 0 ~ paste("\u274C Absolute paths found at lines: ",
paste(abs_path, collapse = ", "))
)
# package loading
pkg <- grep("\\b(library|require)\\(", r_code)
report$pkg<- dplyr::case_when(
length(pkg) == 0 ~ "\u26A0\uFE0F️ No packages are specified in this script.",
length(pkg) == 1 ~ "\u2705 Packages are loaded in a single block.",
all(diff(pkg) < 5) ~ "\u2705 Packages are loaded in a single block.",
.default = paste(
"\u274C Packages are loaded in multiple places: lines " ,
paste(pkg, collapse = ", ")
)
)
# session info
session <- grep("\\bsession_?[Ii]nfo\\(", r_code)
report$session <- dplyr::case_when(
length(session) == 0 ~ "\u274C️ No session info was found in this script.",
length(session) > 0 ~ paste(
"\u2705 Session info was found on line",
paste(session, collapse = ", "))
)
return(report)
}) |>
do.call(dplyr::bind_rows, args = _)
return(report)
}
r_file_results <- check_r_files(osf_files_summary)
name | report | feedback |
---|---|---|
poweranalysis_sph.R | abs | ✅ No absolute paths were detected |
poweranalysis_sph.R | pkg | ✅ Packages are loaded in a single block. |
poweranalysis_sph.R | session | ❌️ No session info was found in this script. |
Put it All Together
Let’s put everything together in one block of code, and perform all automated checks for another open access paper in Psychological Science.
# Add this and the custom functions to a file called osf_file_check.R
osf_file_check <- function(paper) {
links <- osf_links(paper)
info <- osf_retrieve(links, recursive = TRUE)
osf_files_summary <- summarize_contents(info)
report <- osf_report(osf_files_summary)
r_file_results <- check_r_files(osf_files_summary)
list(
traffic_light = report$traffic_light,
table = r_file_results,
report = report$report,
summary = osf_files_summary
)
}
module_results <- module_run(psychsci$`0956797620955209`, "osf_file_check.R")
#> Starting OSF retrieval for 1 files...
#> * Retrieving info from k2dbf...
#> ...Main retrieval complete
#> Starting retrieval of children...
#> * Retrieving children for k2dbf...
#> * Retrieving files for k2dbf...
#> * Retrieving files for 5e344fb4f6631d013e5a48c9...
#> * Retrieving files for 5b88067b7b17570016f95389...
#> ...OSF retrieval complete!
module_report(module_results, header = 4) |> cat()
OSF File Check
⚠️ No README files were identified. A read me is best practice to facilitate re-use. If you have a README, please name it explicitly (e.g., README.txt or _readme.pdf).
✅ Data file(s) were detected. Great job making your research more transparent and reproducible!
⚠️️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/
✅ Code file(s) were detected. Great job making it easier to reproduce your results!
Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/
text | section | div | p | s | osf_id | name | description | osf_type | public | category | registration | preprint |
---|---|---|---|---|---|---|---|---|---|---|---|---|
osf.io/k2dbf | method | 8 | 1 | 1 | k2dbf | Preregistered replication of “Sick body, vigilant mind: The biological immune system activates the behavioral immune system” | nodes | TRUE | project | FALSE | TRUE | |
osf.io/k2dbf | funding | 14 | 1 | 1 | k2dbf | Preregistered replication of “Sick body, vigilant mind: The biological immune system activates the behavioral immune system” | nodes | TRUE | project | FALSE | TRUE | |
osf.io/k2dbf | funding | 14 | 2 | 1 | k2dbf | Preregistered replication of “Sick body, vigilant mind: The biological immune system activates the behavioral immune system” | nodes | TRUE | project | FALSE | TRUE |
Showing 3 of 3 rows
Future Developments
We have demonstrated a rather basic workflow that can automatically check files stored on the Open Science Framework, and all the checks demonstrated here can be made more accurate or complete. At the same time, even the current simple automatic checks might already facilitate re-use by including information (e.g., a README) and improving how files are named. There are many obvious ways to expand these automated checks. First, the example can be expanded to other commonly used data repositories, such as GitHub, Dataverse, etc. Second, the checks can be expanded beyond the properties that are automatically checked now. If you are an expert on code reproducibility or data re-use and would like to add checks, do reach out to us. Third, we can check for other types of files. For example, we are collaborating with Attila Simko who is interested in identifying the files required to reproduce deep learning models in the medical imaging literature. We believe there will be many such field-dependent checks that can be automated, as the ability to automatically examine and/or retrieve files that are linked to in a paper should be useful for a large range of use-cases.
These examples were created using papercheck version 0.0.0.9045.