Case studies¶
Gold standard¶
This dataset contains the 50 most recent articles from arxiv.org in both the cs.LG and stat.ML categories, between the dates 2022-10-24 and 2022-10-25 which had 570 search results. We select articles that belong to cs.LG or
(cs.LG and
stat.ML) category.
"Repository evaluation" is performed on articles that provided links to their code repository and "Paper evaluation" is performed on all 50 articles by parsing the .tex
files from their corresponding arXiv links. reproscreener
is evaluated this gold_standard
dataset and the results are shown below.
In [1]:
Copied!
import pandas as pd
import numpy as np
from IPython.display import display
from pathlib import Path
import sys
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, cohen_kappa_score
from reproscreener.gold_standard import summary_table, tex_map_dict, repo_map_dict, prepare_pivot, compare_with_manual, split_parsed_readme
sys.path.append(str(Path.cwd().parent / "src/reproscrener"))
from reproscreener.plots.repo_eval_heatmaps import prepare_repo_heatmap_df, plot_repo_heatmap, plot_repo_clustermap
from reproscreener.plots.tex_eval_heatmaps import prepare_tex_heatmap_df, plot_tex_heatmap
from reproscreener.repo_eval import get_all_repo_eval_dict
from reproscreener.tex_eval import get_all_tex_eval_dict
from reproscreener.gdrive_downloader import gdrive_get_manual_eval
from reproscreener.utils import reverse_mapping
import pandas as pd
import numpy as np
from IPython.display import display
from pathlib import Path
import sys
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, cohen_kappa_score
from reproscreener.gold_standard import summary_table, tex_map_dict, repo_map_dict, prepare_pivot, compare_with_manual, split_parsed_readme
sys.path.append(str(Path.cwd().parent / "src/reproscrener"))
from reproscreener.plots.repo_eval_heatmaps import prepare_repo_heatmap_df, plot_repo_heatmap, plot_repo_clustermap
from reproscreener.plots.tex_eval_heatmaps import prepare_tex_heatmap_df, plot_tex_heatmap
from reproscreener.repo_eval import get_all_repo_eval_dict
from reproscreener.tex_eval import get_all_tex_eval_dict
from reproscreener.gdrive_downloader import gdrive_get_manual_eval
from reproscreener.utils import reverse_mapping
In [2]:
Copied!
path_repo = Path("../case-studies/arxiv-corpus/gold_standard/repo")
path_tex = Path("../case-studies/arxiv-corpus/gold_standard/source")
path_manual = Path("../case-studies/arxiv-corpus/manual_eval.csv")
manual_eval = gdrive_get_manual_eval(overwrite=False, manual_path=path_manual)
gold_standard_ids = manual_eval["paper"].unique()
path_repo = Path("../case-studies/arxiv-corpus/gold_standard/repo")
path_tex = Path("../case-studies/arxiv-corpus/gold_standard/source")
path_manual = Path("../case-studies/arxiv-corpus/manual_eval.csv")
manual_eval = gdrive_get_manual_eval(overwrite=False, manual_path=path_manual)
gold_standard_ids = manual_eval["paper"].unique()
Manual eval file already exists, use the overwrite flag to download
Repo evaluation¶
In [3]:
Copied!
repo_evaluation_dict = get_all_repo_eval_dict(path_repo)
repo_heatmap_df = prepare_repo_heatmap_df(repo_evaluation_dict, gold_standard_ids)
plot_repo_heatmap(repo_heatmap_df, filename="heatmap_repo_eval.png", path_plots=None, sort_x=True, sort_y=True)
repo_evaluation_dict = get_all_repo_eval_dict(path_repo)
repo_heatmap_df = prepare_repo_heatmap_df(repo_evaluation_dict, gold_standard_ids)
plot_repo_heatmap(repo_heatmap_df, filename="heatmap_repo_eval.png", path_plots=None, sort_x=True, sort_y=True)
In [4]:
Copied!
plot_repo_clustermap(repo_heatmap_df, filename="clustermap_repo_eval.png", path_plots=None)
plot_repo_clustermap(repo_heatmap_df, filename="clustermap_repo_eval.png", path_plots=None)
In [5]:
Copied!
repo_heatmap_df.head(10).drop(columns=["Display_Label"])
repo_heatmap_df.head(10).drop(columns=["Display_Label"])
Out[5]:
Paper_ID | Matched_File | Category | |
---|---|---|---|
0 | 1606.04671 | Code provided but no matches | Others |
1 | 1903.09668 | readme_dependencies | Parsed Readme |
2 | 1904.10554 | Code provided but no matches | Others |
3 | 1908.05659 | requirements.txt | Dependencies |
4 | 1908.05659 | readme_install | Parsed Readme |
5 | 1909.00931 | Code provided but no matches | Others |
6 | 1911.03867 | environment.yml | Dependencies |
7 | 1911.03867 | requirements.txt | Dependencies |
8 | 1911.03867 | readme_requirements | Parsed Readme |
9 | 2002.05905 | Code provided but no matches | Others |
In [6]:
Copied!
number_of_papers = len(repo_heatmap_df["Paper_ID"].unique())
print(f"Total number of papers in the gold standard: {len(gold_standard_ids)}")
number_of_papers = len(repo_heatmap_df["Paper_ID"].unique())
print(f"Total number of papers in the gold standard: {len(gold_standard_ids)}")
Total number of papers in the gold standard: 50
In [7]:
Copied!
summary_table(repo_heatmap_df, "Matched_File", number_of_papers)
summary_table(repo_heatmap_df, "Matched_File", number_of_papers)
Out[7]:
Reproscreener_Article_Count | Reproscreener_Percentage | |
---|---|---|
No code provided | 28 | 56.00% |
Code provided but no matches | 9 | 18.00% |
requirements.txt | 6 | 12.00% |
readme_install | 4 | 8.00% |
readme_requirements | 3 | 6.00% |
readme_setup | 3 | 6.00% |
readme_dependencies | 2 | 4.00% |
environment.yml | 1 | 2.00% |
conda_reqs.txt | 1 | 2.00% |
pip_reqs.txt | 1 | 2.00% |
run_experiments.py | 1 | 2.00% |
main.py | 1 | 2.00% |
The variables are grouped by the following categories defined in reverse_mapping
:
- Dependencies: Files related to the dependencies of the repository.
- Wrapper Scripts: Files that combine various stages of the workflow.
- Parsed Readme: Headers present in the README file of the repository that provide instructions about the code/data.
- Others: Contains
No code provided
orCode provided but no matches
. The latter is used when the code is provided but files from any of the other categories were found in the repository.
In [8]:
Copied!
reverse_mapping_df = pd.DataFrame.from_dict(reverse_mapping, orient='index', columns=['Category'])
reverse_mapping_df.index.name = 'Matched_File'
reverse_mapping_df
reverse_mapping_df = pd.DataFrame.from_dict(reverse_mapping, orient='index', columns=['Category'])
reverse_mapping_df.index.name = 'Matched_File'
reverse_mapping_df
Out[8]:
Category | |
---|---|
Matched_File | |
requirements.txt | Dependencies |
setup.py | Dependencies |
environment.yml | Dependencies |
pyproject.toml | Dependencies |
pip_reqs.txt | Dependencies |
conda_reqs.txt | Dependencies |
run.py | Wrapper Scripts |
run.sh | Wrapper Scripts |
main.py | Wrapper Scripts |
main.sh | Wrapper Scripts |
run_all.py | Wrapper Scripts |
run_all.sh | Wrapper Scripts |
run_experiments.py | Wrapper Scripts |
run_experiments.sh | Wrapper Scripts |
readme_requirements | Parsed Readme |
readme_dependencies | Parsed Readme |
readme_setup | Parsed Readme |
readme_install | Parsed Readme |
No code provided | Others |
Code provided but no matches | Others |
In [9]:
Copied!
summary_table(repo_heatmap_df, "Category", number_of_papers)
summary_table(repo_heatmap_df, "Category", number_of_papers)
Out[9]:
Reproscreener_Article_Count | Reproscreener_Percentage | |
---|---|---|
Others | 37 | 74.00% |
Parsed Readme | 12 | 24.00% |
Dependencies | 9 | 18.00% |
Wrapper Scripts | 2 | 4.00% |
In [10]:
Copied!
no_code_provided_counts = len(repo_heatmap_df[repo_heatmap_df["Matched_File"] == "No code provided"])
code_provided_counts = number_of_papers - no_code_provided_counts
code_provided_percentage = (code_provided_counts / number_of_papers) * 100
print(f"{code_provided_counts}/{number_of_papers} ({code_provided_percentage:.2f}%) of the papers have provided some code")
no_code_provided_counts = len(repo_heatmap_df[repo_heatmap_df["Matched_File"] == "No code provided"])
code_provided_counts = number_of_papers - no_code_provided_counts
code_provided_percentage = (code_provided_counts / number_of_papers) * 100
print(f"{code_provided_counts}/{number_of_papers} ({code_provided_percentage:.2f}%) of the papers have provided some code")
22/50 (44.00%) of the papers have provided some code
Tex Evaluation¶
In [11]:
Copied!
tex_evaluation_dict = get_all_tex_eval_dict(path_tex)
tex_heatmap_df = prepare_tex_heatmap_df(tex_evaluation_dict, gold_standard_ids)
tex_evaluation_dict = get_all_tex_eval_dict(path_tex)
tex_heatmap_df = prepare_tex_heatmap_df(tex_evaluation_dict, gold_standard_ids)
Output()
In [12]:
Copied!
plot_tex_heatmap(tex_heatmap_df, filename="heatmap_tex_eval.png", path_plots=None, sort_x=True, sort_y=True)
plot_tex_heatmap(tex_heatmap_df, filename="heatmap_tex_eval.png", path_plots=None, sort_x=True, sort_y=True)
In [13]:
Copied!
tex_heatmap_df.head(10)
tex_heatmap_df.head(10)
Out[13]:
Paper_ID | Found_Variable | |
---|---|---|
0 | 1606.04671 | Research questions |
1 | 1606.04671 | Research method |
2 | 1606.04671 | Experimental setup |
3 | 1606.04671 | Research problem |
4 | 1606.04671 | Prediction |
5 | 1606.04671 | Training data |
6 | 1606.04671 | Hypothesis |
7 | 1606.04671 | Objective/Goal |
8 | 1903.09668 | Research questions |
9 | 1903.09668 | Research method |
In [14]:
Copied!
summary_table(tex_heatmap_df, "Found_Variable", number_of_papers)
summary_table(tex_heatmap_df, "Found_Variable", number_of_papers)
Out[14]:
Reproscreener_Article_Count | Reproscreener_Percentage | |
---|---|---|
Research questions | 44 | 88.00% |
Research problem | 44 | 88.00% |
Research method | 43 | 86.00% |
Objective/Goal | 39 | 78.00% |
Prediction | 34 | 68.00% |
Method source code | 23 | 46.00% |
Hypothesis | 21 | 42.00% |
Training data | 18 | 36.00% |
Experimental setup | 15 | 30.00% |
Test data | 7 | 14.00% |
Pseudocode | 6 | 12.00% |
Validation data | 2 | 4.00% |
No variables found | 1 | 2.00% |
Comparision with manual evaluation¶
Repo evaluation comparison¶
In [15]:
Copied!
manual_eval = split_parsed_readme(manual_eval, 'parsed_readme')
manual_eval.rename(columns=repo_map_dict, inplace=True)
manual_eval.rename(columns={"paper": "Paper_ID"}, inplace=True)
manual_eval.head()
manual_eval.columns
manual_eval = split_parsed_readme(manual_eval, 'parsed_readme')
manual_eval.rename(columns=repo_map_dict, inplace=True)
manual_eval.rename(columns={"paper": "Paper_ID"}, inplace=True)
manual_eval.head()
manual_eval.columns
Out[15]:
Index(['Paper_ID', 'Unnamed: 1', 'paper_url', 'notes', 'empirical_dataset', 'code_avail_article', 'code_avail_article_desc', 'code_avail_url', 'pwc_link_avail', 'pwc_link_match', 'pwc_link_desc', 'result_replication_code_avail', 'code_language', 'package', 'wrapper_scripts', 'wrapper_scripts_desc', 'hardware_specifications', 'software_dependencies', 'software_dependencies_desc', 'will_it_reproduce', 'will_it_reproduce_desc', 'parsed_readme', 'problem', 'problem_desc', 'objective', 'objective_desc', 'research_method', 'research_method_desc', 'research_questions', 'research_questions_desc', 'pseudocode', 'pseudocode_desc', 'dataset', 'dataset_desc', 'hypothesis', 'hypothesis_desc', 'prediction', 'experiment_setup', 'experiment_setup_desc', 'nan', 'readme_dependencies', 'readme_install', 'readme_requirements', 'readme_setup'], dtype='object')
In [16]:
Copied!
repo_heatmap_pivot = prepare_pivot(repo_heatmap_df, 'Paper_ID', repo_map_dict, var_column='Category', match_column='Matched_File')
auto_eval_df = repo_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]
manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]
compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict)
repo_heatmap_pivot = prepare_pivot(repo_heatmap_df, 'Paper_ID', repo_map_dict, var_column='Category', match_column='Matched_File')
auto_eval_df = repo_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]
manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]
compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict)
Out[16]:
Variable | False_Positives | False_Negatives | Total_Mistakes | Reproscreener_Found | Manual_Found | |
---|---|---|---|---|---|---|
0 | Dependencies | 5 | 1 | 6 | 7.0 | 14.0 |
0 | Wrapper Scripts | 2 | 3 | 5 | 2.0 | 18.0 |
0 | Parsed Readme - Requirements | 3 | 1 | 4 | 3.0 | 2.0 |
0 | Parsed Readme - Dependencies | 2 | 1 | 3 | 2.0 | 9.0 |
0 | Parsed Readme - Setup | 3 | 0 | 3 | 3.0 | 2.0 |
0 | Parsed Readme - Install | 4 | 0 | 4 | 4.0 | 3.0 |
- Where n = 50 for
Reproscreener_Article_Count
andManual_Article_Count
- False positives - Reproscreener found something that wasn't manually found
- False negatives - Reproscreener didn't find something that was manually found
- Total mistakes - False positives + False negatives
In [17]:
Copied!
compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict, output_format="percent")
compare_with_manual(auto_eval_df, manual_eval_df, repo_map_dict, output_format="percent")
Out[17]:
Variable | False_Positives | False_Negatives | Total_Mistakes | Reproscreener_Found | Manual_Found | |
---|---|---|---|---|---|---|
0 | Dependencies | 10.0 | 2.0 | 12.0 | 14.0 | 28.0 |
0 | Wrapper Scripts | 4.0 | 6.0 | 10.0 | 4.0 | 36.0 |
0 | Parsed Readme - Requirements | 6.0 | 2.0 | 8.0 | 6.0 | 4.0 |
0 | Parsed Readme - Dependencies | 4.0 | 2.0 | 6.0 | 4.0 | 18.0 |
0 | Parsed Readme - Setup | 6.0 | 0.0 | 6.0 | 6.0 | 4.0 |
0 | Parsed Readme - Install | 8.0 | 0.0 | 8.0 | 8.0 | 6.0 |
Tex evaluation comparison¶
In [18]:
Copied!
tex_heatmap_pivot = prepare_pivot(tex_heatmap_df, 'Paper_ID', tex_map_dict, var_column='Found_Variable')
auto_eval_df = tex_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]
manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]
compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict)
tex_heatmap_pivot = prepare_pivot(tex_heatmap_df, 'Paper_ID', tex_map_dict, var_column='Found_Variable')
auto_eval_df = tex_heatmap_pivot.copy()
auto_eval_df.columns = [f"{col}_reproscreener" if col != "Paper_ID" else col for col in auto_eval_df.columns]
manual_eval_df = manual_eval.copy()
manual_eval_df.columns = [f"{col}_manual" if col != "Paper_ID" else col for col in manual_eval_df.columns]
compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict)
Out[18]:
Variable | False_Positives | False_Negatives | Total_Mistakes | Reproscreener_Found | Manual_Found | |
---|---|---|---|---|---|---|
0 | Research questions | 41 | 0 | 41 | 44.0 | 3.0 |
0 | Research problem | 30 | 1 | 31 | 44.0 | 15.0 |
0 | Research method | 34 | 1 | 35 | 43.0 | 10.0 |
0 | Objective/Goal | 35 | 0 | 35 | 39.0 | 4.0 |
0 | Prediction | 34 | 0 | 34 | 34.0 | 0.0 |
0 | Method source code | 5 | 4 | 9 | 23.0 | 22.0 |
0 | Hypothesis | 16 | 3 | 19 | 21.0 | 8.0 |
0 | Training data | 6 | 19 | 25 | 18.0 | 31.0 |
0 | Experimental setup | 0 | 22 | 22 | 15.0 | 37.0 |
- Where n = 50 for
Reproscreener_Article_Count
andManual_Article_Count
- False positives - Reproscreener found something that wasn't manually found
- False negatives - Reproscreener didn't find something that was manually found
- Total mistakes - False positives + False negatives
In [19]:
Copied!
compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict, output_format="percent")
compare_with_manual(auto_eval_df, manual_eval_df, tex_map_dict, output_format="percent")
Out[19]:
Variable | False_Positives | False_Negatives | Total_Mistakes | Reproscreener_Found | Manual_Found | |
---|---|---|---|---|---|---|
0 | Research questions | 82.0 | 0.0 | 82.0 | 88.0 | 6.0 |
0 | Research problem | 60.0 | 2.0 | 62.0 | 88.0 | 30.0 |
0 | Research method | 68.0 | 2.0 | 70.0 | 86.0 | 20.0 |
0 | Objective/Goal | 70.0 | 0.0 | 70.0 | 78.0 | 8.0 |
0 | Prediction | 68.0 | 0.0 | 68.0 | 68.0 | 0.0 |
0 | Method source code | 10.0 | 8.0 | 18.0 | 46.0 | 44.0 |
0 | Hypothesis | 32.0 | 6.0 | 38.0 | 42.0 | 16.0 |
0 | Training data | 12.0 | 38.0 | 50.0 | 36.0 | 62.0 |
0 | Experimental setup | 0.0 | 44.0 | 44.0 | 30.0 | 74.0 |