Checking OSF Data Repositories • papercheck

In this blog post we will explain how Papercheck can automatically check the content of data repositories that are linked to in a scientific manuscript, using some papercheck functions for exploring OSF repositories to create a custom module.

There is an increasing awareness of the importance of open science practices, and widespread support among scientists for open science practices, such as data and code sharing (Ferguson et al. 2023). As data and code sharing is a relatively new practice, and many scientists lack training in open science, it is common to see badly documented data repositories. Best practices exist, such as the TIER protocol, but not all researchers might be aware of best practices.

At a minimum a data repository should contain a README file with instructions for how to reproduce the results. If data is shared, it should be stored in a ‘data’ folder, or at least have the word ‘data’ in the filename. Code or scripts should similarly be shared in a folder with that name, or at least with the word in the filename. Finally, if data is shared, there should be a codebook or data dictionary that explains which variables are in the dataset in order to allow others to re-use the data. Although it is easy to forget to organize a data repository well, it is also easy to automatically check. Here we demonstrate how Papercheck can check if a README is present, whether data and/or code are shared, and if there is a codebook.

Ideally peer reviewers or editors would check the contents of a data repository. In practice, time constraints mean that no one actually checks what is in a data repository. Automation can perform some of the checks that peers might otherwise perform manually. We provide an illustration of some checks that could be performed. Specifically 1) is any data that is shared clearly labeled as such, 2) is code that is shared clearly labeled as such, 3) is there a README file that explains to potential users which files are shared, where they can be found in the repository, and how the can be used to reproduce any reported results, and 4) is there a codebook or data dictionary?

Checking an OSF repository with Papercheck

We will illustrate the process of checking a data repository by focusing on projects on the Open Science Framework. For this illustration we use an open access paper published in Psychological Science that has already been converted to a papercheck object using GROBID. There are 250 open access papers in the Papercheck object psychsci; we will choose one for this example.

# paper to use in this example
paper <- psychsci[[250]]

Set up OSF functions

You can only make 100 API requests per hour, unless you authorise your requests, when you can make 10K requests per day. The OSF functions in papercheck often make several requests per URL to get all of the info, so it’s worthwhile setting your PAT. You can authorise them by creating an OSF token at https://osf.io/settings/tokens and including the following line in your .Renviron file (which you can open using usethis::edit_r_environ()):

OSF_PAT="replace-with-your-token-string"

The OSF API server is down a lot, so it’s often good to check it before you run a bunch of OSF functions, we provide the function osf_api_check() for this. When the server is down, it can take several seconds to return an error, so scripts where you are checking many URLs can take a long time before you realise they aren’t working.

osf_api_check()
#> [1] "ok"

Find OSF Links

We start by searching for OSF URLs using the search_text() function. OSF links can be tricky to find in PDFs, since they can insert spaces in odd places, and view-only links that contain a ? are often interpreted as being split across sentences. This function is our best attempt at catching and fixing them all.

links <- osf_links(paper)

text	section
osf.io/hv29w	method
osf.io/2es6n	method
osf.io/jpm5a	method
osf.io/aux7s	method
osf.io/nw3mc	method
osf.io/ks639	method
osf.io/y75nu	method
OSF.IO/4TYM7	funding
OSF.IO/X4T9A	funding

Retrieve Link Info

If valid, the link is processed, and the OSF Application Programming Interface (API) is used to retrieve whether the link points to a file, project, or registration. This is achieved through the osf_retrieve() function.

This function can take a vector of OSF IDs or URLs, or a table that contains them. If the IDs aren’t in the first column, you will need to specify the name of the column. The function will return your table with added information. (You can quiet the output messages with verbose(FALSE).)

The function osf_retrieve() will also retrieve all child components, files and folders if you set the argument recursive = TRUE. If there are duplicate IDs, it will only get the contents for each item once. If you set the argument find_project = TRUE, it will also look up the parent project of any links (but this requires more API calls).

info <- osf_retrieve(links, recursive = TRUE, find_project = TRUE)
#> Starting OSF retrieval for 9 files...
#> * Retrieving info from hv29w...
#> * Retrieving info from 2es6n...
#> * Retrieving info from jpm5a...
#> * Retrieving info from aux7s...
#> * Retrieving info from nw3mc...
#> * Retrieving info from ks639...
#> * Retrieving info from y75nu...
#> * Retrieving info from 4tym7...
#> * Retrieving info from x4t9a...
#> ...Main retrieval complete
#> Starting retrieval of children...
#> * Retrieving children for x4t9a...
#> * Retrieving files for x4t9a...
#> * Retrieving files for 6621454e716cb7048fa45a2a...
#> * Retrieving files for 6293d1cab59d5f1df8720db5...
#> * Retrieving files for 6293d1bfbbdcde278f4269ed...
#> * Retrieving files for 6293d1c5bbdcde278f4269f7...
#> * Retrieving files for 6339c10031d65306e12de5a2...
#> * Retrieving files for 6293d2e3b59d5f1df0720c6b...
#> * Retrieving files for 64f0ac666d1e8905f21516b2...
#> * Retrieving files for 64f0ab59f3dcd105d7ddd40b...
#> ...OSF retrieval complete!

osf_id	name	osf_type	project
hv29w	Kim,Doeller_Prereg_OSF.pdf	files	x4t9a
2es6n	suppleVideo1_learnSph.mp4	files	x4t9a
jpm5a	suppleVideo2_learnPlane.mp4	files	x4t9a
aux7s	suppleVideo3_objlocSph.mp4	files	x4t9a
nw3mc	suppleVideo4_objlocPlane.mp4	files	x4t9a
ks639	suppleVideo5_triangleSph.mp4	files	x4t9a
y75nu	suppleVideo6_trianglePlane.mp4	files	x4t9a
4tym7	Cognitive maps for a spherical surface	registrations	x4t9a
x4t9a	Cognitive maps for a spherical surface	nodes	x4t9a

There are multiple OSF links in this paper, but they are all part of the same overarching OSF project, with the project ID x4t9a.

Summarize Contents

The OSF allows you to categorize components by category, and we can also determine file types using extensions.

osf_id	name	filetype
hv29w	Kim,Doeller_Prereg_OSF.pdf	text
2es6n	suppleVideo1_learnSph.mp4	video
jpm5a	suppleVideo2_learnPlane.mp4	video
aux7s	suppleVideo3_objlocSph.mp4	video
nw3mc	suppleVideo4_objlocPlane.mp4	video
ks639	suppleVideo5_triangleSph.mp4	video
y75nu	suppleVideo6_trianglePlane.mp4	video
662145f3716cb7048fa45a58	virtualizerStudy1-main.zip	archive
662146fa8df04804d3177e59	ReadMe.txt	text
64f0aaf4d9f2c905a0d04821	main_analyseTriangleComple_20230423.m	code
64f0aaf7f3dcd105dbddd396	main_simulate_objlocTraj.m	code
64f0aaf9989de605c3dd152a	supple_learningTrajectory.m	code
6293d2d5b59d5f1df7720e41	poweranalysis_sph.R	code
6614017dc053943058b4d41c	supple_sphWithVariousRadius_clean.m	code
6293d2d5b59d5f1df0720c66	main_analyseObjLocTest.m	code
6293d26c86324127ca5b5862	suppleVideo2_learnPlane.mp4	video
6293d29eb7e8c726edc2dc38	suppleVideo3_objlocSph.mp4	video
6293d2a1bbdcde278b42696a	suppleVideo6_trianglePlane.mp4	video
6293d271ddbe49279ba215f6	suppleVideo1_learnSph.mp4	video
6293d275bbdcde278f426b25	suppleVideo4_objlocPlane.mp4	video
6293d282ddbe49279aa21548	suppleVideo5_triangleSph.mp4	video
6293d1e8b59d5f1df7720d4c	suppleMovie_legend.txt	text
6293d310b7e8c726ecc2db7d	sumDemograph.csv	data
6293d30fb7e8c726ecc2db79	rawdata_plane_triangle.csv	data
6293d30ebbdcde278f426c49	rawdata_sph_objlocTest.csv	data
6293d310b59d5f1df3720cf4	rawdata_sph_triangle.csv	data
661402d9e65c603b737d9c10	cleanData_combine.mat	code
661402b4943bee32eadfebdd	pilotData_triangle_combine_clean.csv	data
6293d30d86324127d25b5c27	rawdata_plane_objlocIdentity.csv	data
6293d30d86324127ce5b5a91	rawdata_sph_objlocIdentity.csv	data
6293d30db7e8c726edc2dcb7	rawdata_plane_objlocTest.csv	data
6339c121ec7f3f0704f5fbf0	Kim,Doeller_Prereg_OSF.pdf	text
64f0ab2ff3dcd105c9ddd4fc	findShortcut.m	code
6293d2f5bbdcde278f426c19	sph2cartFn.m	code
6293d2f6ddbe49279ba2164b	drawGeodesic.m	code
6293d2f786324127ca5b5879	sph2cartMKunity.m	code
6293d2f786324127d25b5bf3	translateOnSphere.m	code
6293d2f6ddbe49279da216eb	northVecFn.m	code
6293d2f9b7e8c726edc2dc9b	ttestplotMK2.m	code
6293d2f5b7e8c726f0c2de20	cart2sphFn.m	code
6293d2f486324127ca5b5875	rotAroundU.m	code
64f0acb06c0f5a0647d057e8	psub11_objLearn_Sph_traj.tsv	data
64f0acbfd9f2c905a0d048ba	psub16_objLearn_Plane_traj.tsv	data
64f0aca2f3dcd105daddd527	psub06_objLearn_Sph_traj.tsv	data
64f0acdb989de605c2dd1598	psub25_objLearn_Plane_traj.tsv	data
64f0ac9bf3dcd105d7ddd449	psub04_objLearn_Plane_traj.tsv	data
64f0acc2d9f2c905a5d04a36	psub17_objLearn_Sph_traj.tsv	data
64f0acab6c0f5a064fd058e4	psub10_objLearn_Plane_traj.tsv	data
64f0aca46c0f5a0650d059a5	psub08_objLearn_Plane_traj.tsv	data
64f0acba6d1e8905f21516d4	psub14_objLearn_Sph_traj.tsv	data
64f0aca9f3dcd105dbddd497	psub09_objLearn_Sph_traj.tsv	data
64f0ace16c0f5a0650d059c8	psub27_objLearn_Plane_traj.tsv	data
64f0acf2d9f2c905a0d048d2	psub32_objLearn_Sph_traj.tsv	data
64f0acef6d1e8905ee1515e2	psub32_objLearn_Plane_traj.tsv	data
64f0accd6c0f5a064fd058f0	psub21_objLearn_Sph_traj.tsv	data
64f0acb36c0f5a0650d059ab	psub12_objLearn_Sph_traj.tsv	data
64f0ad06989de605badd1471	psub41_objLearn_Plane_traj.tsv	data
64f0acd6f3dcd105d3ddd39b	psub23_objLearn_Plane_traj.tsv	data
64f0ad0ef3dcd105dbddd4d7	psub44_objLearn_Sph_traj.tsv	data
64f0acc3989de605c3dd1622	psub17_objLearn_Plane_traj.tsv	data
64f0aca9f3dcd105dbddd497	psub09_objLearn_Sph_traj.tsv	data
64f0aca1f3dcd105daddd525	psub06_objLearn_Plane_traj.tsv	data
64f0acfcf3dcd105d7ddd456	psub37_objLearn_Plane_traj.tsv	data
64f0aceb6d1e8905f315176a	psub30_objLearn_Sph_traj.tsv	data
64f0acd66c0f5a064fd058f4	psub23_objLearn_Sph_traj.tsv	data
64f0ac98989de605c2dd1576	psub03_objLearn_Plane_traj.tsv	data
64f0ad0bf3dcd105d7ddd45a	psub42_objLearn_Sph_traj.tsv	data
64f0ad046c0f5a0650d059d6	psub40_objLearn_Sph_traj.tsv	data
64f0ac9f6c0f5a064fd058de	psub05_objLearn_Sph_traj.tsv	data
64f0ad03989de605bedd1514	psub40_objLearn_Plane_traj.tsv	data
64f0aca9f3dcd105dbddd497	psub09_objLearn_Sph_traj.tsv	data
64f0acd2989de605c2dd158e	psub22_objLearn_Sph_traj.tsv	data
64f0ad126c0f5a064cd058bb	psub47_objLearn_Plane_traj.tsv	data
64f0acc9f3dcd105dbddd4a6	psub20_objLearn_Plane_traj.tsv	data
64f0ad116d1e8905ee1515ee	psub46_objLearn_Sph_traj.tsv	data
64f0ad14d9f2c905a5d04a70	psub47_objLearn_Sph_traj.tsv	data
64f0ad016d1e8905f3151778	psub38_objLearn_Sph_traj.tsv	data
64f0acac6c0f5a0650d059a9	psub10_objLearn_Sph_traj.tsv	data
64f0acb8f3dcd105dbddd49e	psub14_objLearn_Plane_traj.tsv	data
64f0acc6d9f2c905a4d049b3	psub19_objLearn_Sph_traj.tsv	data
64f0acf8f3dcd105dbddd4cb	psub34_objLearn_Sph_traj.tsv	data
64f0acfe989de605c3dd164d	psub37_objLearn_Sph_traj.tsv	data
64f0acf36d1e8905f315176e	psub33_objLearn_Plane_traj.tsv	data
64f0ace3f3dcd105dbddd4b8	psub27_objLearn_Sph_traj.tsv	data
64f0acbc6d1e8905ee1515d8	psub15_objLearn_Plane_traj.tsv	data
64f0acc9989de605bfdd160a	psub20_objLearn_Sph_traj.tsv	data
64f0aca8d9f2c905a0d048b8	psub09_objLearn_Plane_traj.tsv	data
64f0ad00d9f2c905a4d049d6	psub38_objLearn_Plane_traj.tsv	data
64f0acb1f3dcd105d6ddd37c	psub12_objLearn_Plane_traj.tsv	data
64f0ac98d9f2c905a4d04981	psub03_objLearn_Sph_traj.tsv	data
64f0acfd6d1e8905f3151776	psub35_objLearn_Sph_traj.tsv	data
64f0ad0fd9f2c905a5d04a6d	psub46_objLearn_Plane_traj.tsv	data
64f0ace8989de605c3dd163f	psub30_objLearn_Plane_traj.tsv	data
64f0acd36c0f5a0650d059c3	psub22_objLearn_Plane_traj.tsv	data
64f0acb7f3dcd105d7ddd44e	psub13_objLearn_Sph_traj.tsv	data
64f0acc0989de605bfdd1603	psub16_objLearn_Sph_traj.tsv	data
64f0ac9bd9f2c905a5d04a2c	psub04_objLearn_Sph_traj.tsv	data
64f0ad0ad9f2c9059dd0482b	psub42_objLearn_Plane_traj.tsv	data
64f0acdc6d1e8905f2151737	psub25_objLearn_Sph_traj.tsv	data
64f0ac9fd9f2c905a4d04983	psub05_objLearn_Plane_traj.tsv	data
64f0ad0dd9f2c905a4d049d8	psub44_objLearn_Plane_traj.tsv	data
64f0acded9f2c905a4d049c8	psub26_objLearn_Plane_traj.tsv	data
64f0acf6d9f2c905a5d04a58	psub34_objLearn_Plane_traj.tsv	data
64f0acb4989de605c2dd1588	psub13_objLearn_Plane_traj.tsv	data
64f0ad08989de605c3dd1650	psub41_objLearn_Sph_traj.tsv	data
64f0aceb989de605c3dd1642	psub31_objLearn_Plane_traj.tsv	data
64f0acd9d9f2c905a4d049c3	psub24_objLearn_Sph_traj.tsv	data
64f0acbd6c0f5a064cd05891	psub15_objLearn_Sph_traj.tsv	data
64f0acaed9f2c905a4d049aa	psub11_objLearn_Plane_traj.tsv	data
64f0ace5989de605c3dd163d	psub28_objLearn_Plane_traj.tsv	data
64f0ace7d9f2c905a4d049cc	psub28_objLearn_Sph_traj.tsv	data
64f0accd6d1e8905f3151765	psub21_objLearn_Plane_traj.tsv	data
64f0aca6f3dcd105daddd52a	psub08_objLearn_Sph_traj.tsv	data
64f0acc66c0f5a064bd05865	psub19_objLearn_Plane_traj.tsv	data
64f0acf9f3dcd105d3ddd3a0	psub35_objLearn_Plane_traj.tsv	data
64f0acf56c0f5a0648d0581c	psub33_objLearn_Sph_traj.tsv	data
64f0acd9d9f2c905a1d048ba	psub24_objLearn_Plane_traj.tsv	data
64f0ab74989de605c2dd14eb	LICENSE	NA
64f0ab746d1e8905ef1515e8	mapface2edge.m	code
64f0ab7c989de605c2dd14f0	sortrowstol.m	code
64f0ab7c6c0f5a0650d05905	spheretri.m	code
64f0ab7ff3dcd105dbddd3db	SphereTriTestCase.m	code
64f0ab7f6d1e8905ee15157d	spheretribydepth.m	code
64f0ab72989de605c3dd155d	istriequal.m	code
64f0ab78d9f2c905a4d048f8	README.md	text
64f0ab78f3dcd105d6ddd35b	shrinkfacetri.m	code
64f0ab6e989de605bedd146d	combvec.m	code
64f0ab71f3dcd105d7ddd40c	isface.m	code
64f0ab6fd9f2c905a5d04929	icosahedron.m	code

We can then use this information to determine if, for each file, the information about the files contains text that makes it easy to determine what is being shared. A simple regular expression text search for ‘README’, ‘codebook’, ‘script’, and ‘data’ (in a number of possible ways that these words can be written) is used to automatically detect what is shared.

osf_files_summary <- summarize_contents(info)

name	filetype	file_category
Kim,Doeller_Prereg_OSF.pdf	text	NA
suppleVideo1_learnSph.mp4	video	NA
suppleVideo2_learnPlane.mp4	video	NA
suppleVideo3_objlocSph.mp4	video	NA
suppleVideo4_objlocPlane.mp4	video	NA
suppleVideo5_triangleSph.mp4	video	NA
suppleVideo6_trianglePlane.mp4	video	NA
virtualizerStudy1-main.zip	archive	NA
ReadMe.txt	text	readme
main_analyseTriangleComple_20230423.m	code	code
main_simulate_objlocTraj.m	code	code
supple_learningTrajectory.m	code	code
poweranalysis_sph.R	code	code
supple_sphWithVariousRadius_clean.m	code	code
main_analyseObjLocTest.m	code	code
suppleVideo2_learnPlane.mp4	video	NA
suppleVideo3_objlocSph.mp4	video	NA
suppleVideo6_trianglePlane.mp4	video	NA
suppleVideo1_learnSph.mp4	video	NA
suppleVideo4_objlocPlane.mp4	video	NA
suppleVideo5_triangleSph.mp4	video	NA
suppleMovie_legend.txt	text	NA
sumDemograph.csv	data	data
rawdata_plane_triangle.csv	data	data
rawdata_sph_objlocTest.csv	data	data
rawdata_sph_triangle.csv	data	data
cleanData_combine.mat	code	code
pilotData_triangle_combine_clean.csv	data	data
rawdata_plane_objlocIdentity.csv	data	data
rawdata_sph_objlocIdentity.csv	data	data
rawdata_plane_objlocTest.csv	data	data
Kim,Doeller_Prereg_OSF.pdf	text	NA
findShortcut.m	code	code
sph2cartFn.m	code	code
drawGeodesic.m	code	code
sph2cartMKunity.m	code	code
translateOnSphere.m	code	code
northVecFn.m	code	code
ttestplotMK2.m	code	code
cart2sphFn.m	code	code
rotAroundU.m	code	code
psub11_objLearn_Sph_traj.tsv	data	data
psub16_objLearn_Plane_traj.tsv	data	data
psub06_objLearn_Sph_traj.tsv	data	data
psub25_objLearn_Plane_traj.tsv	data	data
psub04_objLearn_Plane_traj.tsv	data	data
psub17_objLearn_Sph_traj.tsv	data	data
psub10_objLearn_Plane_traj.tsv	data	data
psub08_objLearn_Plane_traj.tsv	data	data
psub14_objLearn_Sph_traj.tsv	data	data
psub09_objLearn_Sph_traj.tsv	data	data
psub27_objLearn_Plane_traj.tsv	data	data
psub32_objLearn_Sph_traj.tsv	data	data
psub32_objLearn_Plane_traj.tsv	data	data
psub21_objLearn_Sph_traj.tsv	data	data
psub12_objLearn_Sph_traj.tsv	data	data
psub41_objLearn_Plane_traj.tsv	data	data
psub23_objLearn_Plane_traj.tsv	data	data
psub44_objLearn_Sph_traj.tsv	data	data
psub17_objLearn_Plane_traj.tsv	data	data
psub09_objLearn_Sph_traj.tsv	data	data
psub06_objLearn_Plane_traj.tsv	data	data
psub37_objLearn_Plane_traj.tsv	data	data
psub30_objLearn_Sph_traj.tsv	data	data
psub23_objLearn_Sph_traj.tsv	data	data
psub03_objLearn_Plane_traj.tsv	data	data
psub42_objLearn_Sph_traj.tsv	data	data
psub40_objLearn_Sph_traj.tsv	data	data
psub05_objLearn_Sph_traj.tsv	data	data
psub40_objLearn_Plane_traj.tsv	data	data
psub09_objLearn_Sph_traj.tsv	data	data
psub22_objLearn_Sph_traj.tsv	data	data
psub47_objLearn_Plane_traj.tsv	data	data
psub20_objLearn_Plane_traj.tsv	data	data
psub46_objLearn_Sph_traj.tsv	data	data
psub47_objLearn_Sph_traj.tsv	data	data
psub38_objLearn_Sph_traj.tsv	data	data
psub10_objLearn_Sph_traj.tsv	data	data
psub14_objLearn_Plane_traj.tsv	data	data
psub19_objLearn_Sph_traj.tsv	data	data
psub34_objLearn_Sph_traj.tsv	data	data
psub37_objLearn_Sph_traj.tsv	data	data
psub33_objLearn_Plane_traj.tsv	data	data
psub27_objLearn_Sph_traj.tsv	data	data
psub15_objLearn_Plane_traj.tsv	data	data
psub20_objLearn_Sph_traj.tsv	data	data
psub09_objLearn_Plane_traj.tsv	data	data
psub38_objLearn_Plane_traj.tsv	data	data
psub12_objLearn_Plane_traj.tsv	data	data
psub03_objLearn_Sph_traj.tsv	data	data
psub35_objLearn_Sph_traj.tsv	data	data
psub46_objLearn_Plane_traj.tsv	data	data
psub30_objLearn_Plane_traj.tsv	data	data
psub22_objLearn_Plane_traj.tsv	data	data
psub13_objLearn_Sph_traj.tsv	data	data
psub16_objLearn_Sph_traj.tsv	data	data
psub04_objLearn_Sph_traj.tsv	data	data
psub42_objLearn_Plane_traj.tsv	data	data
psub25_objLearn_Sph_traj.tsv	data	data
psub05_objLearn_Plane_traj.tsv	data	data
psub44_objLearn_Plane_traj.tsv	data	data
psub26_objLearn_Plane_traj.tsv	data	data
psub34_objLearn_Plane_traj.tsv	data	data
psub13_objLearn_Plane_traj.tsv	data	data
psub41_objLearn_Sph_traj.tsv	data	data
psub31_objLearn_Plane_traj.tsv	data	data
psub24_objLearn_Sph_traj.tsv	data	data
psub15_objLearn_Sph_traj.tsv	data	data
psub11_objLearn_Plane_traj.tsv	data	data
psub28_objLearn_Plane_traj.tsv	data	data
psub28_objLearn_Sph_traj.tsv	data	data
psub21_objLearn_Plane_traj.tsv	data	data
psub08_objLearn_Sph_traj.tsv	data	data
psub19_objLearn_Plane_traj.tsv	data	data
psub35_objLearn_Plane_traj.tsv	data	data
psub33_objLearn_Sph_traj.tsv	data	data
psub24_objLearn_Plane_traj.tsv	data	data
LICENSE	NA	NA
mapface2edge.m	code	code
sortrowstol.m	code	code
spheretri.m	code	code
SphereTriTestCase.m	code	code
spheretribydepth.m	code	code
istriequal.m	code	code
README.md	text	readme
shrinkfacetri.m	code	code
combvec.m	code	code
isface.m	code	code
icosahedron.m	code	code

Report Text

Finally, we print a report that communicates to the user - for example, a researcher preparing their manuscript for submission - whether there are suggestions to improve their data repository. We provide feedback about whether any of the four categories could be automatically detected, and if not, provide additional information about what would have made the automated tool recognize the files of interest. The output gives a detailed overview of the information it could not find, alongside a suggestion for how to learn more about best practices in this domain. If researchers use this Papercheck module before submission, they can improve the quality of their data repository in case any information is missing. Papercheck might miss data and code that is shared, but not clearly named, but by indicating this, users might realize that the data repository can be improved by more clearly naming folders and files.

osf_report <- function(summary) {
  files <- dplyr::filter(summary, osf_type == "files")
  data <- dplyr::filter(files, file_category == "data") |> nrow()
  code <- dplyr::filter(files, file_category == "code") |> nrow()
  codebook <- dplyr::filter(files, file_category == "codebook") |> nrow()
  readme <- dplyr::filter(files, file_category == "readme") |> nrow()
  
  traffic_light <- dplyr::case_when(
    data == 0 & code == 0 & readme == 0 ~ "red",
    data == 0 | code == 0 | readme == 0 ~ "yellow",
    data > 0 & code > 0 & readme > 0 ~ "green"
  )
  
  data_report <- dplyr::case_when(
    data == 0 ~ "\u26A0\uFE0F There was no data detected. Are you sure you cannot share any of the underlying data? If you did share the data, consider naming the file(s) or file folder with 'data'.",
    data > 0 ~ "\u2705 Data file(s) were detected. Great job making your research more transparent and reproducible!"
  )
  
  codebook_report <- dplyr::case_when(
    codebook == 0 ~ "\u26A0\uFE0F️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/",
    codebook > 0 ~ "\u2705 Codebook(s) were detected. Well done!"
  )
  
  code_report <- dplyr::case_when(
    code == 0 ~ "\u26A0\uFE0F️ No code files were found. Are you sure there is no code related to this manuscript? If you shared code, consider naming the file or file folder with 'code' or 'script'.",
    code > 0 ~ "\u2705 Code file(s) were detected. Great job making it easier to  reproduce your results!"
  )
  
  readme_report <- dplyr::case_when(
    readme == 0 ~ "\u26A0\uFE0F No README files were identified. A read me is best practice to facilitate re-use. If you have a README, please name it explicitly (e.g., README.txt or _readme.pdf).",
    readme > 0 ~ "\u2705 README detected. Great job making it easier to understand how to re-use files in your repository!"
  )
  
  report_message <- paste(
    readme_report,
    data_report, 
    codebook_report,
    code_report,
    "Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/",
    sep = "\n\n"
  )

  return(list(
    traffic_light = traffic_light,
    report = report_message
  ))
}

report <- osf_report(osf_files_summary) 

# print the report into a file
module_report(report) |> cat()

✅ README detected. Great job making it easier to understand how to re-use files in your repository!

✅ Data file(s) were detected. Great job making your research more transparent and reproducible!

⚠️️ No codebooks or data dictionaries were found. Consider adding one to make it easier for others to know which variables you have collected, and how to re-use them. The codebook package in R can automate a substantial part of the generation of a codebook: https://rubenarslan.github.io/codebook/

✅ Code file(s) were detected. Great job making it easier to reproduce your results!

Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/

Checking the Contents of files

So far we have used Papercheck to automatically check whether certain types of files exist. But it is also possible to automatically download files, examine their contents, and provide feedback to users. This can be useful to examine datasets (e.g., do files contain IP addresses or other personal information), or code files. We will illustrate the latter by automatically checking the content of R scripts stored on the OSF, in repositories that are linked to in a scientific manuscript.

We can check R files for good coding practices that improve reproducibility. We have created a check that examines 1) whether all libraries are loaded in one block, instead of throughout the R script, 2) whether relative paths are used that will also work when someone runs the code on a different computer (e.g., data <- read.csv(file='../data/data_study_1.csv') ) instead of fixed paths (e.g., data <- read.csv(file='C:/data/data_study_1.csv') ), and 3) whether information is provided about the software used (i.e., the R version), the version of packages that were used, and properties of the computer that the analyses were performed on. In R, this can be achieved by:

sessionInfo()
#> R version 4.4.3 (2025-02-28)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/London
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.4       forcats_1.0.0         stringr_1.5.1        
#>  [4] dplyr_1.1.4           purrr_1.0.4           readr_2.1.5          
#>  [7] tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.1        
#> [10] tidyverse_2.0.0       papercheck_0.0.0.9049
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9        generics_0.1.3    stringi_1.8.4     httpcode_0.3.0   
#>  [5] hms_1.1.3         digest_0.6.37     magrittr_2.0.3    evaluate_1.0.3   
#>  [9] grid_4.4.3        timechange_0.3.0  fastmap_1.2.0     jsonlite_1.9.1   
#> [13] crul_1.5.0        urltools_1.7.3    httr_1.4.7        scales_1.3.0     
#> [17] textshaping_0.4.1 jquerylib_0.1.4   cli_3.6.4         rlang_1.1.5      
#> [21] triebeard_0.4.1   munsell_0.5.1     withr_3.0.2       cachem_1.1.0     
#> [25] yaml_2.3.10       tools_4.4.3       tzdb_0.5.0        memoise_2.0.1    
#> [29] colorspace_2.1-1  curl_6.0.1        vctrs_0.6.5       R6_2.6.1         
#> [33] lifecycle_1.0.4   fs_1.6.5          htmlwidgets_1.6.4 ragg_1.3.3       
#> [37] osfr_0.2.9        pkgconfig_2.0.3   desc_1.4.3        pkgdown_2.1.1    
#> [41] pillar_1.10.1     bslib_0.9.0       gtable_0.3.6      Rcpp_1.0.14      
#> [45] glue_1.8.0        systemfonts_1.1.0 xfun_0.51         tidyselect_1.2.1 
#> [49] rstudioapi_0.17.1 knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.29   
#> [53] compiler_4.4.3

As most scientists have not been taught how to code explicitly, it is common to see scripts that do not adhere to best coding practices. We are no exception ourselves (e.g., you will not find a sessioninfo.txt file in our repositories). Although code might be reproducible even if it takes time to figure out which versions of an R package was used, which R version was used, and by changing fixed paths, reproducibility is facilitated if best practices are used. The whole point of automated checks is to have algorithms that capture expertise make recommendations that improve how we currently work.

check_r_files <- function(summary) {
  r_files <- summary |>
    dplyr::filter(osf_type == "files",
                  grepl("\\.R(md)?", name, ignore.case = TRUE)) |>
    dplyr::mutate(abs_report = NA, 
                  pkg_report = NA,
                  session_report = NA)
  
  report <- lapply(r_files$osf_id, \(id) {
    report <- dplyr::filter(r_files, osf_id == !!id)
    # Try downloading the R file
    file_url <- paste0("https://osf.io/download/", id)
    r_code <- tryCatch(
      readLines(url(file_url), warn = FALSE),
      error = function(e) return(NULL)
    )
    
    if (is.null(r_code)) return("")
    
    # absolute paths
    abs_path <- grep("[\"\']([A-Z]:|\\/|~)", r_code)
    report$abs <- dplyr::case_when(
      length(abs_path) == 0 ~ "\u2705 No absolute paths were detected",
      length(abs_path) > 0 ~ paste("\u274C Absolute paths found at lines: ",
                                   paste(abs_path, collapse = ", "))
    )
    
    # package loading
    pkg <- grep("\\b(library|require)\\(", r_code)
    report$pkg<- dplyr::case_when(
      length(pkg) == 0 ~ "\u26A0\uFE0F️ No packages are specified in this script.",
      length(pkg) == 1 ~ "\u2705 Packages are loaded in a single block.",
      all(diff(pkg) < 5) ~ "\u2705 Packages are loaded in a single block.",
      .default = paste(
        "\u274C Packages are loaded in multiple places: lines " ,
        paste(pkg, collapse = ", ")
      )
    )
    
    # session info 
    session <- grep("\\bsession_?[Ii]nfo\\(", r_code)
    report$session <- dplyr::case_when(
      length(session) == 0 ~ "\u274C️ No session info was found in this script.",
      length(session) > 0 ~ paste(
        "\u2705 Session info was found on line", 
        paste(session, collapse = ", "))
    )
    
    return(report)
  }) |>
    do.call(dplyr::bind_rows, args = _)
  
  return(report)
}

r_file_results <- check_r_files(osf_files_summary)

name	report	feedback
poweranalysis_sph.R	abs	✅ No absolute paths were detected
poweranalysis_sph.R	pkg	✅ Packages are loaded in a single block.
poweranalysis_sph.R	session	❌️ No session info was found in this script.

Put it All Together

Let’s put everything together in one block of code, and perform all automated checks for another open access paper in Psychological Science.

# Add this and the custom functions to a file called osf_file_check.R

osf_file_check <- function(paper) {
  links <- osf_links(paper)
  info <- osf_retrieve(links, recursive = TRUE)
  osf_files_summary <- summarize_contents(info)
  report <- osf_report(osf_files_summary)
  r_file_results <- check_r_files(osf_files_summary)  
  
  list(
    traffic_light = report$traffic_light,
    table = r_file_results,
    report = report$report,
    summary = osf_files_summary
  )
}

module_results <- module_run(psychsci$`0956797620955209`, "osf_file_check.R")
#> Starting OSF retrieval for 1 files...
#> * Retrieving info from k2dbf...
#> ...Main retrieval complete
#> Starting retrieval of children...
#> * Retrieving children for k2dbf...
#> * Retrieving files for k2dbf...
#> * Retrieving files for 5e344fb4f6631d013e5a48c9...
#> * Retrieving files for 5b88067b7b17570016f95389...
#> ...OSF retrieval complete!

module_report(module_results, header = 4) |> cat()

OSF File Check

⚠️ No README files were identified. A read me is best practice to facilitate re-use. If you have a README, please name it explicitly (e.g., README.txt or _readme.pdf).

✅ Data file(s) were detected. Great job making your research more transparent and reproducible!

✅ Code file(s) were detected. Great job making it easier to reproduce your results!

Learn more about reproducible data practices: https://www.projecttier.org/tier-protocol/

text	section	div	p	s	osf_id	name	osf_type	public	category	registration	preprint
osf.io/k2dbf	method	8	1	1	k2dbf	Preregistered replication of “Sick body, vigilant mind: The biological immune system activates the behavioral immune system”	nodes	TRUE	project	FALSE	TRUE
osf.io/k2dbf	funding	14	1	1	k2dbf	Preregistered replication of “Sick body, vigilant mind: The biological immune system activates the behavioral immune system”	nodes	TRUE	project	FALSE	TRUE
osf.io/k2dbf	funding	14	2	1	k2dbf	Preregistered replication of “Sick body, vigilant mind: The biological immune system activates the behavioral immune system”	nodes	TRUE	project	FALSE	TRUE

Showing 3 of 3 rows

Future Developments

We have demonstrated a rather basic workflow that can automatically check files stored on the Open Science Framework, and all the checks demonstrated here can be made more accurate or complete. At the same time, even the current simple automatic checks might already facilitate re-use by including information (e.g., a README) and improving how files are named. There are many obvious ways to expand these automated checks. First, the example can be expanded to other commonly used data repositories, such as GitHub, Dataverse, etc. Second, the checks can be expanded beyond the properties that are automatically checked now. If you are an expert on code reproducibility or data re-use and would like to add checks, do reach out to us. Third, we can check for other types of files. For example, we are collaborating with Attila Simko who is interested in identifying the files required to reproduce deep learning models in the medical imaging literature. We believe there will be many such field-dependent checks that can be automated, as the ability to automatically examine and/or retrieve files that are linked to in a paper should be useful for a large range of use-cases.

These examples were created using papercheck version 0.0.0.9045.

References

Ferguson, Joel, Rebecca Littman, Garret Christensen, Elizabeth Levy Paluck, Nicholas Swanson, Zenan Wang, Edward Miguel, David Birke, and John-Henry Pezzuto. 2023. “Survey of Open Science Practices and Attitudes in the Social Sciences.” Nature Communications 14 (11): 5401. https://doi.org/10.1038/s41467-023-41111-1.