Skip to content

Architecture

tex_eval module

The tex_eval module is used to evaluate .tex files exttracted from the arXiv source tarball of the paper.

combine_tex_in_folder(folder_path)

Combine all .tex files in a given directory into a single file.

Parameters:

Name Type Description Default
folder_path Path

Path to the directory containing .tex files.

required

Returns:

Name Type Description
Path Path

Path to the combined .tex file.

evaluate_paper(tex_folder_path, paper_id)

Evaluate a paper by extracting variables and URLs from its tex files.

Parameters:

Name Type Description Default
tex_folder_path Path

Path to the directory containing the paper's .tex files.

required
paper_id str

ID of the paper.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with the evaluation results.

evaluate_papers(path_corpus, evaluation_dict)

Evaluate a list of papers by extracting variables and URLs from their tex files.

Parameters:

Name Type Description Default
path_corpus Path

Path to the directory containing all papers.

required
evaluation_dict dict

Dictionary to store the evaluation results.

required

Returns:

Name Type Description
dict dict

Dictionary with the evaluation results.

extract_tex_urls(combined_path)

Extract URLs from the combined tex file.

Parameters:

Name Type Description Default
combined_path Path

Path to the combined .tex file.

required

Returns:

Type Description
Set[str]

Set[str]: Set of found URLs.

Find URLs belonging to allowed domains (default - github, gitlab, zenodo).

Parameters:

Name Type Description Default
url_list Set[str]

Set of URLs to process.

required
allowed_domains List[str]

List of allowed domain names. Defaults to ["github", "gitlab", "zenodo"].

['github', 'gitlab', 'zenodo']

Returns:

Type Description
List[str]

List[str]: List of found URLs belonging to allowed domains.

find_tex_variables(combined_path)

Find variables in the combined tex file. Uses the KeywordProcessor from flashtext package to extract variables.

Parameters:

Name Type Description Default
combined_path Path

Path to the combined .tex file.

required

Returns:

Type Description
Set[str]

Set[str]: Set of found variables.

get_all_tex_eval_dict(path_corpus)

Evaluates all papers in the given corpus and returns a dictionary of evaluation data.

Parameters:

Name Type Description Default
path_corpus Path

A Path object representing the path to the corpus of papers to evaluate.

required

Returns:

Type Description
dict

A dictionary where the keys are paper IDs and the values are DataFrames containing the tex_eval results.

paper_evaluation_results(paper_id, found_vars, found_links, title='No title found')

Create a rich Panel with the results of the paper evaluation.

Args paper_id (str): ID of the paper. title (str): Title of the paper. found_vars (Set[str]): Set of found variables. found_links (List[str]): List of found URLs.

repo_eval module

check_dependencies(dir_path)

Check if the necessary dependency files exist in the directory.

Parameters:

Name Type Description Default
dir_path Path

Path to the directory to check.

required

Returns:

Type Description
Tuple[List[str], List[str]]

Tuple[List[str], List[str]]: Two lists containing the found dependency files and not found dependency files.

check_files(dir_path, files)

Check if the given files exist in the directory.

Parameters:

Name Type Description Default
dir_path Path

Path to the directory to check.

required
files List[str]

List of filenames to look for.

required

Returns:

Type Description
Tuple[List[str], List[str]]

Tuple[List[str], List[str]]: Two lists containing the found files and not found files.

check_parsed_readme(dir_path)

Check if the necessary sections exist in the README file.

Parameters:

Name Type Description Default
dir_path Path

Path to the directory to check.

required

Returns:

Type Description
Tuple[List[str], List[str]]

Tuple[List[str], List[str]]: Two lists containing the found sections and not found sections.

check_wrapper_scripts(dir_path)

Check if the necessary wrapper script files exist in the directory.

Parameters:

Name Type Description Default
dir_path Path

Path to the directory to check.

required

Returns:

Type Description
Tuple[List[str], List[str]]

Tuple[List[str], List[str]]: Two lists containing the found wrapper script files and not found wrapper script files.

clone_repo(arxiv_id, repo_url, path_corpus, overwrite=False)

Clone a repository from the given URL to the given path using the arxiv_id as the directory name. If the repository already exists, it won't be overwritten unless specified.

Parameters:

Name Type Description Default
arxiv_id str

The arxiv id of the paper.

required
repo_url str

URL of the repository to clone.

required
path_corpus Path

Path to clone the repository to.

required
overwrite bool

Whether to overwrite the existing repository. Defaults to False.

False

Returns:

Name Type Description
Path Path

Path to the cloned repository. Returns False if cloning fails.

clone_repos(arxiv_ids, repo_urls, path_corpus, overwrite=False)

Clone a list of repositories from the given URLs to the given path using the arxiv_ids as the directory names. If a repository already exists, it won't be overwritten unless specified.

Parameters:

Name Type Description Default
arxiv_ids List[str]

List of arxiv ids corresponding to the repositories.

required
repo_urls List[str]

List of URLs of the repositories to clone.

required
path_corpus Path

Path to clone the repositories to.

required
overwrite bool

Whether to overwrite the existing repositories. Defaults to False.

False

Returns:

Type Description
List[Path]

List[Path]: List of paths to the cloned repositories. Returns False if cloning fails.

evaluate_repo(path_corpus)

Evaluate a repository by checking the existence of certain files and sections in README.

Parameters:

Name Type Description Default
path_corpus Path

Path to the repository.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with the evaluation results.

evaluate_repos(path_corpus, evaluation_dict)

Evaluate a list of repositories by checking the existence of certain files and sections in README.

Parameters:

Name Type Description Default
path_corpus Path

Path to the directory containing all repositories.

required
evaluation_dict dict

Dictionary to store the evaluation results.

required

Returns:

Name Type Description
dict dict

Dictionary with the evaluation results.

get_all_repo_eval_dict(path_corpus)

Evaluates all repositories in the given corpus and returns a dictionary of evaluation data.

Parameters:

Name Type Description Default
path_corpus Path

A Path object representing the path to the corpus of repositories to evaluate.

required

Returns:

Name Type Description
dict dict

where the keys are repository names and the values are DataFrames containing the the repo_eval results.

repo_eval_table(df_table, title='')

Prepare a DataFrame for display as a rich table.

Parameters:

Name Type Description Default
df_table DataFrame

DataFrame to display.

required
title str

Title of the table. Defaults to "".

''

Returns:

Name Type Description
Table Table

a rich Table object ready to be printed.

scrape_arxiv module

The scrape_arxiv module is used to obtain the gold standard dataset from the arXiv. It includes the PDFs, source tarballs, and abstract for each paper.

gold_standard module

The gold_standard module is used to evaluate and compare the performance of reproscreener on the gold standard dataset. It uses the data from the scrape_arxiv module.