Architecture
tex_eval
module
The tex_eval
module is used to evaluate .tex
files exttracted from the arXiv source tarball of the paper.
combine_tex_in_folder(folder_path)
Combine all .tex files in a given directory into a single file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder_path |
Path
|
Path to the directory containing .tex files. |
required |
Returns:
Name | Type | Description |
---|---|---|
Path |
Path
|
Path to the combined .tex file. |
evaluate_paper(tex_folder_path, paper_id)
Evaluate a paper by extracting variables and URLs from its tex files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tex_folder_path |
Path
|
Path to the directory containing the paper's .tex files. |
required |
paper_id |
str
|
ID of the paper. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with the evaluation results. |
evaluate_papers(path_corpus, evaluation_dict)
Evaluate a list of papers by extracting variables and URLs from their tex files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_corpus |
Path
|
Path to the directory containing all papers. |
required |
evaluation_dict |
dict
|
Dictionary to store the evaluation results. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
Dictionary with the evaluation results. |
extract_tex_urls(combined_path)
Extract URLs from the combined tex file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
combined_path |
Path
|
Path to the combined .tex file. |
required |
Returns:
Type | Description |
---|---|
Set[str]
|
Set[str]: Set of found URLs. |
find_data_repository_links(url_list, allowed_domains=['github', 'gitlab', 'zenodo'])
Find URLs belonging to allowed domains (default - github, gitlab, zenodo).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url_list |
Set[str]
|
Set of URLs to process. |
required |
allowed_domains |
List[str]
|
List of allowed domain names. Defaults to ["github", "gitlab", "zenodo"]. |
['github', 'gitlab', 'zenodo']
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of found URLs belonging to allowed domains. |
find_tex_variables(combined_path)
Find variables in the combined tex file.
Uses the KeywordProcessor
from flashtext
package to extract variables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
combined_path |
Path
|
Path to the combined .tex file. |
required |
Returns:
Type | Description |
---|---|
Set[str]
|
Set[str]: Set of found variables. |
get_all_tex_eval_dict(path_corpus)
Evaluates all papers in the given corpus and returns a dictionary of evaluation data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_corpus |
Path
|
A Path object representing the path to the corpus of papers to evaluate. |
required |
Returns:
Type | Description |
---|---|
dict
|
A dictionary where the keys are paper IDs and the values are DataFrames containing the tex_eval results. |
paper_evaluation_results(paper_id, found_vars, found_links, title='No title found')
Create a rich Panel with the results of the paper evaluation.
Args paper_id (str): ID of the paper. title (str): Title of the paper. found_vars (Set[str]): Set of found variables. found_links (List[str]): List of found URLs.
repo_eval
module
check_dependencies(dir_path)
Check if the necessary dependency files exist in the directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Path
|
Path to the directory to check. |
required |
Returns:
Type | Description |
---|---|
Tuple[List[str], List[str]]
|
Tuple[List[str], List[str]]: Two lists containing the found dependency files and not found dependency files. |
check_files(dir_path, files)
Check if the given files exist in the directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Path
|
Path to the directory to check. |
required |
files |
List[str]
|
List of filenames to look for. |
required |
Returns:
Type | Description |
---|---|
Tuple[List[str], List[str]]
|
Tuple[List[str], List[str]]: Two lists containing the found files and not found files. |
check_parsed_readme(dir_path)
Check if the necessary sections exist in the README file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Path
|
Path to the directory to check. |
required |
Returns:
Type | Description |
---|---|
Tuple[List[str], List[str]]
|
Tuple[List[str], List[str]]: Two lists containing the found sections and not found sections. |
check_wrapper_scripts(dir_path)
Check if the necessary wrapper script files exist in the directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Path
|
Path to the directory to check. |
required |
Returns:
Type | Description |
---|---|
Tuple[List[str], List[str]]
|
Tuple[List[str], List[str]]: Two lists containing the found wrapper script files and not found wrapper script files. |
clone_repo(arxiv_id, repo_url, path_corpus, overwrite=False)
Clone a repository from the given URL to the given path using the arxiv_id as the directory name. If the repository already exists, it won't be overwritten unless specified.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arxiv_id |
str
|
The arxiv id of the paper. |
required |
repo_url |
str
|
URL of the repository to clone. |
required |
path_corpus |
Path
|
Path to clone the repository to. |
required |
overwrite |
bool
|
Whether to overwrite the existing repository. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
Path |
Path
|
Path to the cloned repository. Returns False if cloning fails. |
clone_repos(arxiv_ids, repo_urls, path_corpus, overwrite=False)
Clone a list of repositories from the given URLs to the given path using the arxiv_ids as the directory names. If a repository already exists, it won't be overwritten unless specified.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arxiv_ids |
List[str]
|
List of arxiv ids corresponding to the repositories. |
required |
repo_urls |
List[str]
|
List of URLs of the repositories to clone. |
required |
path_corpus |
Path
|
Path to clone the repositories to. |
required |
overwrite |
bool
|
Whether to overwrite the existing repositories. Defaults to False. |
False
|
Returns:
Type | Description |
---|---|
List[Path]
|
List[Path]: List of paths to the cloned repositories. Returns False if cloning fails. |
evaluate_repo(path_corpus)
Evaluate a repository by checking the existence of certain files and sections in README.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_corpus |
Path
|
Path to the repository. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with the evaluation results. |
evaluate_repos(path_corpus, evaluation_dict)
Evaluate a list of repositories by checking the existence of certain files and sections in README.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_corpus |
Path
|
Path to the directory containing all repositories. |
required |
evaluation_dict |
dict
|
Dictionary to store the evaluation results. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
Dictionary with the evaluation results. |
get_all_repo_eval_dict(path_corpus)
Evaluates all repositories in the given corpus and returns a dictionary of evaluation data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_corpus |
Path
|
A Path object representing the path to the corpus of repositories to evaluate. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
where the keys are repository names and the values are DataFrames containing the the repo_eval results. |
repo_eval_table(df_table, title='')
Prepare a DataFrame for display as a rich table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_table |
DataFrame
|
DataFrame to display. |
required |
title |
str
|
Title of the table. Defaults to "". |
''
|
Returns:
Name | Type | Description |
---|---|---|
Table |
Table
|
a rich Table object ready to be printed. |
scrape_arxiv
module
The scrape_arxiv
module is used to obtain the gold standard dataset from the arXiv. It includes the PDFs, source tarballs, and abstract for each paper.
gold_standard
module
The gold_standard
module is used to evaluate and compare the performance of reproscreener
on the gold standard dataset. It uses the data from the scrape_arxiv
module.