Case studies¶

Gold standard¶

This dataset contains the 50 most recent articles from arxiv.org in both the cs.LG and stat.ML categories, between the dates 2022-10-24 and 2022-10-25 which had 570 search results. We select articles that belong to cs.LG or (cs.LG and stat.ML) category.

"Repository evaluation" is performed on articles that provided links to their code repository and "Paper evaluation" is performed on all 50 articles by parsing the .tex files from their corresponding arXiv links. reproscreener is evaluated this gold_standard dataset and the results are shown below.

In [1]:

Copied!





import pandas as pd
import numpy as np
from IPython.display import display
from pathlib import Path
import sys
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, cohen_kappa_score
from reproscreener.gold_standard import summary_table, tex_map_dict, repo_map_dict, prepare_pivot, compare_with_manual, split_parsed_readme

sys.path.append(str(Path.cwd().parent / "src/reproscrener"))

from reproscreener.plots.repo_eval_heatmaps import prepare_repo_heatmap_df, plot_repo_heatmap, plot_repo_clustermap
from reproscreener.plots.tex_eval_heatmaps import prepare_tex_heatmap_df, plot_tex_heatmap
from reproscreener.repo_eval import get_all_repo_eval_dict
from reproscreener.tex_eval import get_all_tex_eval_dict
from reproscreener.gdrive_downloader import gdrive_get_manual_eval
from reproscreener.utils import reverse_mapping
import pandas as pd
import numpy as np
from IPython.display import display
from pathlib import Path
import sys
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, cohen_kappa_score
from reproscreener.gold_standard import summary_table, tex_map_dict, repo_map_dict, prepare_pivot, compare_with_manual, split_parsed_readme

sys.path.append(str(Path.cwd().parent / "src/reproscrener"))

from reproscreener.plots.repo_eval_heatmaps import prepare_repo_heatmap_df, plot_repo_heatmap, plot_repo_clustermap
from reproscreener.plots.tex_eval_heatmaps import prepare_tex_heatmap_df, plot_tex_heatmap
from reproscreener.repo_eval import get_all_repo_eval_dict
from reproscreener.tex_eval import get_all_tex_eval_dict
from reproscreener.gdrive_downloader import gdrive_get_manual_eval
from reproscreener.utils import reverse_mapping

In [2]:

Copied!





path_repo = Path("../case-studies/arxiv-corpus/gold_standard/repo")
path_tex = Path("../case-studies/arxiv-corpus/gold_standard/source")
path_manual = Path("../case-studies/arxiv-corpus/manual_eval.csv")

manual_eval = gdrive_get_manual_eval(overwrite=False, manual_path=path_manual)
gold_standard_ids = manual_eval["paper"].unique()
path_repo = Path("../case-studies/arxiv-corpus/gold_standard/repo")
path_tex = Path("../case-studies/arxiv-corpus/gold_standard/source")
path_manual = Path("../case-studies/arxiv-corpus/manual_eval.csv")

manual_eval = gdrive_get_manual_eval(overwrite=False, manual_path=path_manual)
gold_standard_ids = manual_eval["paper"].unique()

Manual eval file already exists, use the overwrite flag to download

Repo evaluation¶

In [3]:

Copied!

repo_evaluation_dict = get_all_repo_eval_dict(path_repo)
repo_heatmap_df = prepare_repo_heatmap_df(repo_evaluation_dict, gold_standard_ids)
plot_repo_heatmap(repo_heatmap_df, filename="heatmap_repo_eval.png", path_plots=None, sort_x=True, sort_y=True)
repo_evaluation_dict = get_all_repo_eval_dict(path_repo)
repo_heatmap_df = prepare_repo_heatmap_df(repo_evaluation_dict, gold_standard_ids)
plot_repo_heatmap(repo_heatmap_df, filename="heatmap_repo_eval.png", path_plots=None, sort_x=True, sort_y=True)

No description has been provided for this image

In [4]:

Copied!

plot_repo_clustermap(repo_heatmap_df, filename="clustermap_repo_eval.png", path_plots=None)
plot_repo_clustermap(repo_heatmap_df, filename="clustermap_repo_eval.png", path_plots=None)

In [5]:

Copied!

repo_heatmap_df.head(10).drop(columns=["Display_Label"])
repo_heatmap_df.head(10).drop(columns=["Display_Label"])

Out[5]:

	Paper_ID	Matched_File	Category
0	1606.04671	Code provided but no matches	Others
1	1903.09668	readme_dependencies	Parsed Readme
2	1904.10554	Code provided but no matches	Others
3	1908.05659	requirements.txt	Dependencies
4	1908.05659	readme_install	Parsed Readme
5	1909.00931	Code provided but no matches	Others
6	1911.03867	environment.yml	Dependencies
7	1911.03867	requirements.txt	Dependencies
8	1911.03867	readme_requirements	Parsed Readme
9	2002.05905	Code provided but no matches	Others

In [6]:

Copied!

number_of_papers = len(repo_heatmap_df["Paper_ID"].unique())
print(f"Total number of papers in the gold standard: {len(gold_standard_ids)}")
number_of_papers = len(repo_heatmap_df["Paper_ID"].unique())
print(f"Total number of papers in the gold standard: {len(gold_standard_ids)}")

Total number of papers in the gold standard: 50

In [7]:

Copied!

summary_table(repo_heatmap_df, "Matched_File", number_of_papers)
summary_table(repo_heatmap_df, "Matched_File", number_of_papers)

Out[7]:

	Reproscreener_Article_Count	Reproscreener_Percentage
No code provided	28	56.00%
Code provided but no matches	9	18.00%
requirements.txt	6	12.00%
readme_install	4	8.00%
readme_requirements	3	6.00%
readme_setup	3	6.00%
readme_dependencies	2	4.00%
environment.yml	1	2.00%
conda_reqs.txt	1	2.00%
pip_reqs.txt	1	2.00%
run_experiments.py	1	2.00%
main.py	1	2.00%

The variables are grouped by the following categories defined in reverse_mapping:

Dependencies: Files related to the dependencies of the repository.
Wrapper Scripts: Files that combine various stages of the workflow.
Parsed Readme: Headers present in the README file of the repository that provide instructions about the code/data.
Others: Contains No code provided or Code provided but no matches. The latter is used when the code is provided but files from any of the other categories were found in the repository.

In [8]:

Copied!

reverse_mapping_df = pd.DataFrame.from_dict(reverse_mapping, orient='index', columns=['Category'])
reverse_mapping_df.index.name = 'Matched_File'
reverse_mapping_df
reverse_mapping_df = pd.DataFrame.from_dict(reverse_mapping, orient='index', columns=['Category'])
reverse_mapping_df.index.name = 'Matched_File'
reverse_mapping_df

Out[8]:

	Category
Matched_File
requirements.txt	Dependencies
setup.py	Dependencies
environment.yml	Dependencies
pyproject.toml	Dependencies
pip_reqs.txt	Dependencies
conda_reqs.txt	Dependencies
run.py	Wrapper Scripts
run.sh	Wrapper Scripts
main.py	Wrapper Scripts
main.sh	Wrapper Scripts
run_all.py	Wrapper Scripts
run_all.sh	Wrapper Scripts
run_experiments.py	Wrapper Scripts
run_experiments.sh	Wrapper Scripts
readme_requirements	Parsed Readme
readme_dependencies	Parsed Readme
readme_setup	Parsed Readme
readme_install	Parsed Readme
No code provided	Others
Code provided but no matches	Others

In [9]:

Copied!

summary_table(repo_heatmap_df, "Category", number_of_papers)
summary_table(repo_heatmap_df, "Category", number_of_papers)

Out[9]:

	Reproscreener_Article_Count	Reproscreener_Percentage
Others	37	74.00%
Parsed Readme	12	24.00%
Dependencies	9	18.00%
Wrapper Scripts	2	4.00%

In [10]:

Copied!





no_code_provided_counts = len(repo_heatmap_df[repo_heatmap_df["Matched_File"] == "No code provided"])
code_provided_counts = number_of_papers - no_code_provided_counts
code_provided_percentage = (code_provided_counts / number_of_papers) * 100
print(f"{code_provided_counts}/{number_of_papers} ({code_provided_percentage:.2f}%) of the papers have provided some code")
no_code_provided_counts = len(repo_heatmap_df[repo_heatmap_df["Matched_File"] == "No code provided"])
code_provided_counts = number_of_papers - no_code_provided_counts
code_provided_percentage = (code_provided_counts / number_of_papers) * 100
print(f"{code_provided_counts}/{number_of_papers} ({code_provided_percentage:.2f}%) of the papers have provided some code")

22/50 (44.00%) of the papers have provided some code

Tex Evaluation¶

In [11]:

Copied!

tex_evaluation_dict = get_all_tex_eval_dict(path_tex)
tex_heatmap_df = prepare_tex_heatmap_df(tex_evaluation_dict, gold_standard_ids)
tex_evaluation_dict = get_all_tex_eval_dict(path_tex)
tex_heatmap_df = prepare_tex_heatmap_df(tex_evaluation_dict, gold_standard_ids)

Output()

In [12]:

Copied!

plot_tex_heatmap(tex_heatmap_df, filename="heatmap_tex_eval.png", path_plots=None, sort_x=True, sort_y=True)
plot_tex_heatmap(tex_heatmap_df, filename="heatmap_tex_eval.png", path_plots=None, sort_x=True, sort_y=True)

In [13]:

Copied!

tex_heatmap_df.head(10)
tex_heatmap_df.head(10)

Out[13]:

	Paper_ID	Found_Variable
0	1606.04671	Research questions
1	1606.04671	Research method
2	1606.04671	Experimental setup
3	1606.04671	Research problem
4	1606.04671	Prediction
5	1606.04671	Training data
6	1606.04671	Hypothesis
7	1606.04671	Objective/Goal
8	1903.09668	Research questions
9	1903.09668	Research method

In [14]:

Copied!

summary_table(tex_heatmap_df, "Found_Variable", number_of_papers)
summary_table(tex_heatmap_df, "Found_Variable", number_of_papers)

Out[14]:

	Reproscreener_Article_Count	Reproscreener_Percentage
Research questions	44	88.00%
Research problem	44	88.00%
Research method	43	86.00%
Objective/Goal	39	78.00%
Prediction	34	68.00%
Method source code	23	46.00%
Hypothesis	21	42.00%
Training data	18	36.00%
Experimental setup	15	30.00%
Test data	7	14.00%
Pseudocode	6	12.00%
Validation data	2	4.00%
No variables found	1	2.00%

Comparision with manual evaluation¶

Repo evaluation comparison¶

In [15]:

Copied!





manual_eval = split_parsed_readme(manual_eval, 'parsed_readme')
manual_eval.rename(columns=repo_map_dict, inplace=True)
manual_eval.rename(columns={"paper": "Paper_ID"}, inplace=True)
manual_eval.head()
manual_eval.columns
manual_eval = split_parsed_readme(manual_eval, 'parsed_readme')
manual_eval.rename(columns=repo_map_dict, inplace=True)
manual_eval.rename(columns={"paper": "Paper_ID"}, inplace=True)
manual_eval.head()
manual_eval.columns

Out[15]:

Index(['Paper_ID', 'Unnamed: 1', 'paper_url', 'notes', 'empirical_dataset',
       'code_avail_article', 'code_avail_article_desc', 'code_avail_url',
       'pwc_link_avail', 'pwc_link_match', 'pwc_link_desc',
       'result_replication_code_avail', 'code_language', 'package',
       'wrapper_scripts', 'wrapper_scripts_desc', 'hardware_specifications',
       'software_dependencies', 'software_dependencies_desc',
       'will_it_reproduce', 'will_it_reproduce_desc', 'parsed_readme',
       'problem', 'problem_desc', 'objective', 'objective_desc',
       'research_method', 'research_method_desc', 'research_questions',
       'research_questions_desc', 'pseudocode', 'pseudocode_desc', 'dataset',
       'dataset_desc', 'hypothesis', 'hypothesis_desc', 'prediction',
       'experiment_setup', 'experiment_setup_desc', 'nan',
       'readme_dependencies', 'readme_install', 'readme_requirements',
       'readme_setup'],
      dtype='object')

In [16]:

Copied!





repo_heatmap_pivot = prepare_pivot(repo_heatmap_df, 'Paper_ID', repo_map_dict, var_column='Category', match_column='Matched_File')
auto_eval_df = repo_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]

manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]

compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict)
repo_heatmap_pivot = prepare_pivot(repo_heatmap_df, 'Paper_ID', repo_map_dict, var_column='Category', match_column='Matched_File')
auto_eval_df = repo_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]

manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]

compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict)

Out[16]:

Variable	False_Positives	False_Negatives	Total_Mistakes	Reproscreener_Found	Manual_Found
Dependencies	5	1	6	7.0	14.0
Wrapper Scripts	2	3	5	2.0	18.0
Parsed Readme - Requirements	3	1	4	3.0	2.0
Parsed Readme - Dependencies	2	1	3	2.0	9.0
Parsed Readme - Setup	3	0	3	3.0	2.0
Parsed Readme - Install	4	0	4	4.0	3.0

Where n = 50 for Reproscreener_Article_Count and Manual_Article_Count
False positives - Reproscreener found something that wasn't manually found
False negatives - Reproscreener didn't find something that was manually found
Total mistakes - False positives + False negatives

In [17]:

Copied!

compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict, output_format="percent")
compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict, output_format="percent")

Out[17]:

Variable	False_Positives	False_Negatives	Total_Mistakes	Reproscreener_Found	Manual_Found
Dependencies	10.0	2.0	12.0	14.0	28.0
Wrapper Scripts	4.0	6.0	10.0	4.0	36.0
Parsed Readme - Requirements	6.0	2.0	8.0	6.0	4.0
Parsed Readme - Dependencies	4.0	2.0	6.0	4.0	18.0
Parsed Readme - Setup	6.0	0.0	6.0	6.0	4.0
Parsed Readme - Install	8.0	0.0	8.0	8.0	6.0

Tex evaluation comparison¶

In [18]:

Copied!

tex_heatmap_pivot = prepare_pivot(tex_heatmap_df, 'Paper_ID', tex_map_dict, var_column='Found_Variable')

auto_eval_df = tex_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]

manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]

compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict)
tex_heatmap_pivot = prepare_pivot(tex_heatmap_df, 'Paper_ID', tex_map_dict, var_column='Found_Variable')

auto_eval_df = tex_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]

manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]

compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict)

Out[18]:

Variable	False_Positives	False_Negatives	Total_Mistakes	Reproscreener_Found	Manual_Found
Research questions	41	0	41	44.0	3.0
Research problem	30	1	31	44.0	15.0
Research method	34	1	35	43.0	10.0
Objective/Goal	35	0	35	39.0	4.0
Prediction	34	0	34	34.0	0.0
Method source code	5	4	9	23.0	22.0
Hypothesis	16	3	19	21.0	8.0
Training data	6	19	25	18.0	31.0
Experimental setup	0	22	22	15.0	37.0

Where n = 50 for Reproscreener_Article_Count and Manual_Article_Count
False positives - Reproscreener found something that wasn't manually found
False negatives - Reproscreener didn't find something that was manually found
Total mistakes - False positives + False negatives

In [19]:

Copied!

compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict, output_format="percent")
compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict, output_format="percent")

Out[19]:

Variable	False_Positives	False_Negatives	Total_Mistakes	Reproscreener_Found	Manual_Found
Research questions	82.0	0.0	82.0	88.0	6.0
Research problem	60.0	2.0	62.0	88.0	30.0
Research method	68.0	2.0	70.0	86.0	20.0
Objective/Goal	70.0	0.0	70.0	78.0	8.0
Prediction	68.0	0.0	68.0	68.0	0.0
Method source code	10.0	8.0	18.0	46.0	44.0
Hypothesis	32.0	6.0	38.0	42.0	16.0
Training data	12.0	38.0	50.0	36.0	62.0
Experimental setup	0.0	44.0	44.0	30.0	74.0