Architecture

`tex_eval` module

The tex_eval module is used to evaluate .tex files exttracted from the arXiv source tarball of the paper.

`combine_tex_in_folder(folder_path)`

Combine all .tex files in a given directory into a single file.

Parameters:

Name	Type	Description	Default
`folder_path`	`Path`	Path to the directory containing .tex files.	required

Returns:

Name	Type	Description
`Path`	`Path`	Path to the combined .tex file.

`evaluate_paper(tex_folder_path, paper_id)`

Evaluate a paper by extracting variables and URLs from its tex files.

Parameters:

Name	Type	Description	Default
`tex_folder_path`	`Path`	Path to the directory containing the paper's .tex files.	required
`paper_id`	`str`	ID of the paper.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A DataFrame with the evaluation results.

`evaluate_papers(path_corpus, evaluation_dict)`

Evaluate a list of papers by extracting variables and URLs from their tex files.

Parameters:

Name	Type	Description	Default
`path_corpus`	`Path`	Path to the directory containing all papers.	required
`evaluation_dict`	`dict`	Dictionary to store the evaluation results.	required

Returns:

Name	Type	Description
`dict`	`dict`	Dictionary with the evaluation results.

`extract_tex_urls(combined_path)`

Extract URLs from the combined tex file.

Parameters:

Name	Type	Description	Default
`combined_path`	`Path`	Path to the combined .tex file.	required

Returns:

Type	Description
`Set[str]`	Set[str]: Set of found URLs.

`find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'zenodo'])`

Find URLs belonging to allowed domains (default - github, gitlab, zenodo).

Parameters:

Name	Type	Description	Default
`url_list`	`Set[str]`	Set of URLs to process.	required
`allowed_domains`	`List[str]`	List of allowed domain names. Defaults to ["github", "gitlab", "zenodo"].	`['github', 'gitlab', 'zenodo']`

Returns:

Type	Description
`List[str]`	List[str]: List of found URLs belonging to allowed domains.

`find_tex_variables(combined_path)`

Find variables in the combined tex file. Uses the KeywordProcessor from flashtext package to extract variables.

Parameters:

Name	Type	Description	Default
`combined_path`	`Path`	Path to the combined .tex file.	required

Returns:

Type	Description
`Set[str]`	Set[str]: Set of found variables.

`get_all_tex_eval_dict(path_corpus)`

Evaluates all papers in the given corpus and returns a dictionary of evaluation data.

Parameters:

Name	Type	Description	Default
`path_corpus`	`Path`	A Path object representing the path to the corpus of papers to evaluate.	required

Returns:

Type	Description
`dict`	A dictionary where the keys are paper IDs and the values are DataFrames containing the tex_eval results.

`paper_evaluation_results(paper_id, found_vars, found_links, title='No title found')`

Create a rich Panel with the results of the paper evaluation.

Args paper_id (str): ID of the paper. title (str): Title of the paper. found_vars (Set[str]): Set of found variables. found_links (List[str]): List of found URLs.

`repo_eval` module

`check_dependencies(dir_path)`

Check if the necessary dependency files exist in the directory.

Parameters:

Name	Type	Description	Default
`dir_path`	`Path`	Path to the directory to check.	required

Returns:

Type	Description
`Tuple[List[str], List[str]]`	Tuple[List[str], List[str]]: Two lists containing the found dependency files and not found dependency files.

`check_files(dir_path, files)`

Check if the given files exist in the directory.

Parameters:

Name	Type	Description	Default
`dir_path`	`Path`	Path to the directory to check.	required
`files`	`List[str]`	List of filenames to look for.	required

Returns:

Type	Description
`Tuple[List[str], List[str]]`	Tuple[List[str], List[str]]: Two lists containing the found files and not found files.

`check_parsed_readme(dir_path)`

Check if the necessary sections exist in the README file.

Parameters:

Name	Type	Description	Default
`dir_path`	`Path`	Path to the directory to check.	required

Returns:

Type	Description
`Tuple[List[str], List[str]]`	Tuple[List[str], List[str]]: Two lists containing the found sections and not found sections.

`check_wrapper_scripts(dir_path)`

Check if the necessary wrapper script files exist in the directory.

Parameters:

Name	Type	Description	Default
`dir_path`	`Path`	Path to the directory to check.	required

Returns:

Type	Description
`Tuple[List[str], List[str]]`	Tuple[List[str], List[str]]: Two lists containing the found wrapper script files and not found wrapper script files.

`clone_repo(arxiv_id, repo_url, path_corpus, overwrite=False)`

Clone a repository from the given URL to the given path using the arxiv_id as the directory name. If the repository already exists, it won't be overwritten unless specified.

Parameters:

Name	Type	Description	Default
`arxiv_id`	`str`	The arxiv id of the paper.	required
`repo_url`	`str`	URL of the repository to clone.	required
`path_corpus`	`Path`	Path to clone the repository to.	required
`overwrite`	`bool`	Whether to overwrite the existing repository. Defaults to False.	`False`

Returns:

Name	Type	Description
`Path`	`Path`	Path to the cloned repository. Returns False if cloning fails.

`clone_repos(arxiv_ids, repo_urls, path_corpus, overwrite=False)`

Clone a list of repositories from the given URLs to the given path using the arxiv_ids as the directory names. If a repository already exists, it won't be overwritten unless specified.

Parameters:

Name	Type	Description	Default
`arxiv_ids`	`List[str]`	List of arxiv ids corresponding to the repositories.	required
`repo_urls`	`List[str]`	List of URLs of the repositories to clone.	required
`path_corpus`	`Path`	Path to clone the repositories to.	required
`overwrite`	`bool`	Whether to overwrite the existing repositories. Defaults to False.	`False`

Returns:

Type	Description
`List[Path]`	List[Path]: List of paths to the cloned repositories. Returns False if cloning fails.

`evaluate_repo(path_corpus)`

Evaluate a repository by checking the existence of certain files and sections in README.

Parameters:

Name	Type	Description	Default
`path_corpus`	`Path`	Path to the repository.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A DataFrame with the evaluation results.

`evaluate_repos(path_corpus, evaluation_dict)`

Evaluate a list of repositories by checking the existence of certain files and sections in README.

Parameters:

Name	Type	Description	Default
`path_corpus`	`Path`	Path to the directory containing all repositories.	required
`evaluation_dict`	`dict`	Dictionary to store the evaluation results.	required

Returns:

Name	Type	Description
`dict`	`dict`	Dictionary with the evaluation results.

`get_all_repo_eval_dict(path_corpus)`

Evaluates all repositories in the given corpus and returns a dictionary of evaluation data.

Parameters:

Name	Type	Description	Default
`path_corpus`	`Path`	A Path object representing the path to the corpus of repositories to evaluate.	required

Returns:

Name	Type	Description
`dict`	`dict`	where the keys are repository names and the values are DataFrames containing the the repo_eval results.

`repo_eval_table(df_table, title='')`

Prepare a DataFrame for display as a rich table.

Parameters:

Name	Type	Description	Default
`df_table`	`DataFrame`	DataFrame to display.	required
`title`	`str`	Title of the table. Defaults to "".	`''`

Returns:

Name	Type	Description
`Table`	`Table`	a rich Table object ready to be printed.

`scrape_arxiv` module

The scrape_arxiv module is used to obtain the gold standard dataset from the arXiv. It includes the PDFs, source tarballs, and abstract for each paper.

`gold_standard` module

The gold_standard module is used to evaluate and compare the performance of reproscreener on the gold standard dataset. It uses the data from the scrape_arxiv module.

Architecture

tex_eval module

combine_tex_in_folder(folder_path)

evaluate_paper(tex_folder_path, paper_id)

evaluate_papers(path_corpus, evaluation_dict)

extract_tex_urls(combined_path)

find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'zenodo'])

find_tex_variables(combined_path)

get_all_tex_eval_dict(path_corpus)

paper_evaluation_results(paper_id, found_vars, found_links, title='No title found')

repo_eval module

check_dependencies(dir_path)

check_files(dir_path, files)

check_parsed_readme(dir_path)

check_wrapper_scripts(dir_path)

clone_repo(arxiv_id, repo_url, path_corpus, overwrite=False)

clone_repos(arxiv_ids, repo_urls, path_corpus, overwrite=False)

evaluate_repo(path_corpus)

evaluate_repos(path_corpus, evaluation_dict)

get_all_repo_eval_dict(path_corpus)

repo_eval_table(df_table, title='')

scrape_arxiv module

gold_standard module

`tex_eval` module

`combine_tex_in_folder(folder_path)`

`evaluate_paper(tex_folder_path, paper_id)`

`evaluate_papers(path_corpus, evaluation_dict)`

`extract_tex_urls(combined_path)`

`find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'zenodo'])`

`find_tex_variables(combined_path)`

`get_all_tex_eval_dict(path_corpus)`

`paper_evaluation_results(paper_id, found_vars, found_links, title='No title found')`

`repo_eval` module

`check_dependencies(dir_path)`

`check_files(dir_path, files)`

`check_parsed_readme(dir_path)`

`check_wrapper_scripts(dir_path)`

`clone_repo(arxiv_id, repo_url, path_corpus, overwrite=False)`

`clone_repos(arxiv_ids, repo_urls, path_corpus, overwrite=False)`

`evaluate_repo(path_corpus)`

`evaluate_repos(path_corpus, evaluation_dict)`

`get_all_repo_eval_dict(path_corpus)`

`repo_eval_table(df_table, title='')`

`scrape_arxiv` module

`gold_standard` module