bioat package
Subpackages
- bioat.lib package
- Submodules
- bioat.lib.libalignment module
- bioat.lib.libcircos module
- bioat.lib.libcolor module
- bioat.lib.libcrispr module
- bioat.lib.libdataclasses module
- bioat.lib.libdetect_seq module
- bioat.lib.libfastx module
- bioat.lib.libjgi module
- bioat.lib.libpandas module
- bioat.lib.libpatentseq module
- bioat.lib.libpath module
- bioat.lib.libpdb module
- bioat.lib.libphylo module
- bioat.lib.libplot module
- bioat.lib.libsnakemake module
- bioat.lib.libspider module
- Module contents
Submodules
bioat.about module
about.py.
This module provides information about the BioAT package, including version, author details, and links to related resources.
- Author:
Herman Huanan Zhao (hermanzhaozzzz AT gmail.com)
Example
- To display the information about the package, run the command:
bioat about
- ivar __ABOUT__:
A formatted string containing information about the BioAT package, including version, repository page, documentation page, issue tracking page, and author details.
- vartype __ABOUT__:
str
bioat.bamtools module
module of bamtools.
bioat bamtools <command> deal with SAM or BAM files
- example 1:
- bioat list
- <in shell>:
$ bioat bam remove_clip –help
- <in python consolo>:
>>> from bioat.cli import Cli >>> bioat = Cli() >>> help(bioat.bam.remove_clip)
- example 2:
_example_
- class bioat.bamtools.BamTools[source]
Bases:
objectBam toolbox.
- mpileup2table(mpileup: str | ~pathlib.Path, output: str | ~pathlib.Path | ~_io.TextIOWrapper = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, threads: int = 1, mutation_number_threshold: int = 0, temp_dir: str | ~pathlib.Path = '__bioat_temp_dir', remove_temp: bool = True, log_level: str = 'WARNING') None[source]
Converts an mpileup file to a structured info file.
- Parameters:
mpileup (str) – Path to the samtools mpileup format file.
output (str | TextIOWrapper) – Path to the output file where parsed data will be stored. Defaults to standard output.
threads (int) – Number of threads to utilize for processing. Default is one less than the number of available CPU cores.
mutation_number_threshold (int) – Threshold for mutation information; set to 0 to include all sites.
temp_dir (str, optional) – Directory for temporary files. Defaults to a directory in ‘__bioat_temp_dir’.
remove_temp (bool, optional) – Indicator for whether to remove temporary files after processing. Defaults to True.
log_level (str, optional) – Level of logging. Options are ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to ‘WARNING’.
- Returns:
This function does not return a value. It outputs a file based on the provided parameters.
- Return type:
None
- remove_clip(input: str | ~pathlib.Path | ~_io.TextIOWrapper = <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output: str | ~pathlib.Path | ~_io.TextIOWrapper = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, threads: int = 2, output_fmt: str = 'SAM', remove_as_paired: bool = False, max_clip: int = 0, log_level: str = 'WARNING')[source]
Remove soft/hard clipped reads from a BAM/SAM file.
This method removes soft/hard clipped reads from a BAM/SAM file. It can accept input from stdin and produce output to stdout.
- Parameters:
input (str | TextIOWrapper) – BAM file sorted by query name with soft/hard clipped reads. Pipe stdin is supported, e.g.: [samtools view -h foo_sort_name.bam | bioat bam remove_clip <flags>].
output (str | TextIOWrapper) – BAM file sorted by query name without soft/hard clipped reads. Pipe stdout is supported, e.g.: [bioat bam remove_clip <flags> | wc -l] or [bioat bam remove_clip <flags> | samtools view ….].
threads (int, optional) – Number of threads used by pysam and samtools core. Defaults to the number of CPU cores.
output_fmt (str, optional) – Format of the output file, can be “BAM” or “SAM”. Defaults to “SAM”.
remove_as_paired (bool, optional) – Flag to determine whether to remove single clipped reads. If True, removes both the clipped read and its paired read. The input BAM/SAM must be sorted by name and have header [SO:queryname]. If False, only removes the single clipped read.
max_clip (int, optional) – The maximum number of clips allowed per read. Defaults to 0.
log_level (str, optional) – Logging level for the process. Can be one of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to “INFO”.
- Returns:
None
bioat.bedtools module
BioAt BedTools # todo.
bioat.cli module
bioat.cli.
This module provides the command line interface (CLI) for the BioAT (Bioinformatic Analysis Tools) toolkit. The BioAT toolkit is designed for bioinformatic analyses, allowing users to handle various biological data formats including BED, BAM, FASTA, FASTQ, and VCF.
- Usage Examples:
- Example 1:
In shell: $ bioat list
In Python console: >>> from bioat.cli import Cli >>> bioat = Cli() >>> bioat.list() >>> print(bioat.list())
- Example 2:
In shell: $ bioat about In Python console: >>> from bioat.cli import Cli >>> bioat = Cli() >>> bioat.about() >>> print(bioat.about())
- Description:
The CLI interface allows users to access the functionalities of the BioAT package through terminal commands or import it directly in Python scripts. The module provides various tools for tasks such as mining CRISPRs, downloading metagenomic data, and retrieving search results from Google Scholar.
- Commands:
about: Displays information about the BioAT toolkit.
list: Returns available groups and commands in a structured format.
version: Returns the version information of the BioAT toolkit.
- Copyright:
For researchers: Freely applicable for academic research; citation is appreciated but not mandatory. For commercial use: Not permitted without prior permission from the author.
- class bioat.cli.Cli[source]
Bases:
objectCli interface of BioAT.
- Brief:
“bioat” is short for “Bioinformatic Analysis Tools.” It is a command-line toolkit and a Python package that can be used through this CLI interface in a terminal or via the import method in Python code.
“bioat” has many subcommands to handle different bio-formats: BED, BAM, FASTA, FASTQ, VCF, etc.
“bioat” can be used for mining CRISPRs, downloading metagenomes, and even reporting Google Scholar search results!
For more information, run $ bioat about.
- Copyright:
- For researchers: freely applied to academic research. Please cite my work:
bibtex
software copyright
- For commercial use:
NOT PERMITTED unless permission is obtained from the author.
—
- COMMAND:
Such as bioat version. This represents a direct command.
- GROUPS:
Such as bioat crispr. This represents a group/bundle name for commands related to a specific topic.
—
- Usage:
A demo to understand bioat:
- bioat about
- <in shell>:
$ bioat about
- <in python console>:
>>> from bioat.cli import Cli >>> bioat = Cli() >>> bioat.list() >>> print(bioat.list())
- classmethod about()[source]
Returns information about the bioat tool.
This class method provides a description or metadata regarding the bioat application, which may include version information, author details, or usage instructions.
- Returns:
Information about the bioat tool.
- Return type:
str
- property bam
- property bed
- property crispr
- property fastx
- property fold
- property hic
- list()[source]
Returns a formatted string of GROUPS and COMMANDS.
This method retrieves and formats the attributes and sub-attributes of the instance, excluding private members (those starting with “_”).
- Returns:
- A tree formatted string representing the available
subcommands and their attributes.
- Return type:
str
- property meta
- property search
- property table
- property target_seq
bioat.crisprtools module
crisprtools.py.
This module provides a toolbox for mining CRISPR-related sequences in metagenomic data. It includes functionality for identifying Cas candidates associated with CRISPR loci and for annotating Cas13 candidates from protein sequences. The main class, CrisprTools, provides methods for calling external executables such as Prodigal and Pilercr for protein prediction and CRISPR identification.
- Classes:
- CrisprTools: A class containing methods for identifying Cas candidates and Cas13
candidates from genomic data.
- bioat.crisprtools.cas_finder()[source]
A method for de novo annotation of Cas candidates from CRISPR loci, utilizing input fasta files and producing various output files.
- bioat.crisprtools.cas13_finder()[source]
A method for the annotation of Cas13 candidates from protein fasta files.
- Usage:
To use this module, create an instance of CrisprTools and call the methods cas_finder or cas13_finder with the appropriate parameters to perform CRISPR analysis on your dataset.
- class bioat.crisprtools.CrisprTools[source]
Bases:
objectCRISPR mining toolbox.
This class provides methods for performing CRISPR analysis on datasets, including finding Cas proteins and CRISPR sequences.
- Variables:
None
- cas13_finder(input_faa, output_faa=None, lmin=200, lmax=1500, log_level='INFO')[source]
De novo annotation for Cas13 candidates from proteins.faa.
- Parameters:
input_faa (str) – The input file containing Cas candidates in .faa format.
output_faa (str, optional) – The output file for Cas13 candidates in .faa format. Defaults to None.
lmin (int, optional) – Minimum length for a Cas candidate. Defaults to 200.
lmax (int, optional) – Maximum length for a Cas candidate. Defaults to 1500.
log_level (str, optional) – The logging level. Options are ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to ‘INFO’.
- cas_finder(input_fa, output_faa=None, output_contig_fa=None, output_crispr_info_tab=None, lmin=3000, lmax=None, extend=10000, temp_dir=None, prodigal=None, prodigal_mode='meta', pilercr=None, rm_temp=True, log_level='INFO')[source]
De novo annotation for Cas candidates from neighbors of CRISPR loci.
- Parameters:
input_fa (str) – Path to the input metagenome fasta file containing many contigs.
output_faa (str, optional) – Path to save the de novo annotated Cas candidates.
output_contig_fa (str, optional) – Path to save the whole contigs of de novo annotated Cas candidates.
output_crispr_info_tab (str, optional) – Path to save the de novo annotated CRISPR info table (CSV format).
lmin (int, optional) – Minimum length for a contig. Defaults to 3000.
lmax (int, optional) – Maximum length for a contig. Defaults to None.
extend (int, optional) – Distance over which proteins are considered from the start/end of the CRISPR loci. Defaults to 10000.
temp_dir (str, optional) – Directory to store temporary files. Defaults to None.
prodigal (str, optional) – Path to the Prodigal executable. Defaults to None.
prodigal_mode (str, optional) – Mode for Prodigal annotation. Can be “meta” or “single”. Defaults to “meta”.
pilercr (str, optional) – Path to the Pilercr executable. Defaults to None.
rm_temp (bool, optional) – If False, temporary files will be kept. Defaults to True.
log_level (str, optional) – Logging level; set to “DEBUG” to see logs from prodigal and pilercr. Defaults to “INFO”.
bioat.devtools module
bioat.exceptions module
This module contains custom exception classes for the bioat package.
- exception bioat.exceptions.BioatError(*args, **kwargs)[source]
Bases:
ExceptionBase class for all custom exceptions in the bioat package.
- exception bioat.exceptions.BioatFileFormatError(*args, **kwargs)[source]
Bases:
BioatErrorException raised for errors in file format.
- exception bioat.exceptions.BioatFileNotCompleteError(*args, **kwargs)[source]
Bases:
BioatErrorException raised when a file is not complete.
- exception bioat.exceptions.BioatFileNotFoundError(*args, **kwargs)[source]
Bases:
BioatErrorException raised when a required file is not found.
- exception bioat.exceptions.BioatInvalidInputError(*args, **kwargs)[source]
Bases:
BioatErrorException raised for invalid input errors.
- exception bioat.exceptions.BioatInvalidOptionError(*args, **kwargs)[source]
Bases:
BioatErrorException raised for invalid option errors.
- exception bioat.exceptions.BioatInvalidParameterError(*args, **kwargs)[source]
Bases:
BioatErrorException raised for invalid parameter errors.
- exception bioat.exceptions.BioatMissingDependencyError(*args, **kwargs)[source]
Bases:
BioatErrorException raised when a required dependency is missing.
- exception bioat.exceptions.BioatRuntimeError(*args, **kwargs)[source]
Bases:
BioatErrorException raised for runtime errors.
- exception bioat.exceptions.BioatValueError(*args, **kwargs)[source]
Bases:
BioatErrorException raised for errors related to values.
bioat.fastxtools module
- class bioat.fastxtools.FastxTools[source]
Bases:
objectFASTA & FASTQ toolbox.
- filter_read_contains_n(file: str, output='<stdout>')[source]
Filter reads that contain the ‘N’ base in FASTA or FASTQ formats.
This function processes a given FASTA or FASTQ file and filters out reads that contain the ‘N’ base. The result is directed to an output file, or to stdout if no output file is specified.
- Parameters:
file (str) – The name of the FASTA or FASTQ file to be processed.
output (str) – The name of the output file. Defaults to stdout if not specified, and the format matches the input.
- Returns:
None
- fmt_this(file: str, new_file: str | None = None, force=False, log_level='WARNING')[source]
Formats a FASTA file to improve readability.
- Parameters:
file (str) – The input filename for the FASTA file.
new_file (str | None) – The output filename. If None, the file will be replaced. Default is None.
force (bool) – If True, forces the formatting even if the output file exists. Default is False.
log_level (str) – The logging level for messages. Default is “WARNING”.
This function calls ‘format_this_fastx’ to perform the actual formatting on the specified FASTA file.
- mgi_parse_md5(file: str, log_level='WARNING')[source]
Converts a mgi-like MD5 file into a standard MD5 file.
- Parameters:
file (str) – The name of the mgi-like MD5 file to read.
log_level (str, optional) – The logging level to use. It can be INFO, DEBUG, WARNING, or ERROR. The default is WARNING.
- plot_length_distribution(file: str, table: str | None = None, image: str | None = None, plt_show: bool = False, log_level='WARNING')[source]
Plots the length distribution of a FASTA file.
- Parameters:
file (str) – The input filename for the FASTA file.
table (str | None, optional) – The output filename for the length distribution table. Default is <file>.lengths.
image (str | None, optional) – The output filename for the length distribution figure. Default is <file>.lengths.pdf.
plt_show (bool, optional) – If True, shows the plot. Default is False.
log_level (str) – The logging level for messages. Default is “WARNING”.
bioat.foldtools module
TODO.
- class bioat.foldtools.FoldTools[source]
Bases:
objectFolding toolbox.
- get_cut2ref_aln_info(ref: str | Structure, cut: str | Structure, cal_rmsd=True, cal_tmscore=False, label1='ref', label2='cut', usalign_bin: str = 'usalign', log_level='WARNING')[source]
Align cutted pdb to ref pdb using the CA atoms.
Aligns a truncated protein structure (cut) to its full-length reference structure (ref) using Ca atoms and Biopython’s Superimposer.
This function: - Extracts all Ca atoms from ref and cut - Removes atoms from ref at the indices listed in gap_indices - Aligns the remaining atoms from cut to the corresponding positions in ref - Modifies the cut structure in-place to match the aligned orientation - Returns both structures and the RMSD value of the alignment
It assumes: - One-to-one correspondence between residues after gap removal - Structures are predicted by AlphaFold2 / ESMFold (no missing atoms)
- Parameters:
ref (str or Bio.PDB.Structure.Structure) – Reference structure path or loaded Structure.
cut (str or Bio.PDB.Structure.Structure) – Truncated structure path or loaded Structure.
cal_rmsd (bool, optional) – Whether to calculate RMSD. Default is True.
cal_tmscore (bool, optional) – Whether to calculate TM-score using USalign. Default is False.
label1 (str, optional) – Name for the reference structure. Default is “ref”.
label2 (str, optional) – Name for the cut structure. Default is “cut”.
usalign_bin (str, optional) – Path to the USalign binary for TM-score calculation. Default is “usalign”.
log_level (str, optional) – Logging level. Default is “WARNING”.
- Returns:
- {
“{label1}”: aln label1 structure, # if cal_rmsd is True, unaltered label1 structure “{label2}}”: fixed label2 structure, # if cal_rmsd is True, fix label2 coords in-place “RMSD”: 0.123 # if cal_rmsd is True, the RMSD value between label1 and label2 f”{label1}_seq”: ref_seq, # if cal_rmsd is True, the sequence of label1 structure f”{label2}_seq”: cut_seq, # if cal_rmsd is True, the sequence of label2 structure “alignment_dict”: alignment_dict, # if cal_rmsd is True, the alignment dict of label1 and label2 “gap_indices”: gap_indices, # if cal_rmsd is True, the indices of gaps in label1 structure “TM-score:mean”: 0.623, # if cal_tmscore is True, the mean TM-score value “TM-score:TM1”: 0.456, # if cal_tmscore is True, use label1 as ref <L_N> in calculation “TM-score:TM2”: 0.789, # if cal_tmscore is True, use label2 as ref <L_N> in calculation …
}
- Return type:
dict
- pdb2fasta(input_pdb: str | Path, output_fasta: str | Path | None = None, log_level='WARNING')[source]
Converts a PDB file to a FASTA file.
- Details:
Proteins: The protein sequence for each chain will be extracted as “Chain X Protein”.
DNA and RNA: Bases for DNA (A, T, G, C) will be saved as “Chain X DNA”, and bases for RNA (A, U, G, C) will be saved as “Chain X RNA”.
Other molecules: Any unrecognized molecules (e.g., ions, modified molecules) will be labeled as [residue] and stored as “Chain X Other molecules”.
Multi-chain complexes: The program supports multi-chain structures in complexes, and the content of each chain will be recorded separately.
- Parameters:
input_pdb (str or Path) – Path to the input PDB/CIF file or Biopython Structure.
output_fasta (str or Path, optional) – Output file path. If None, the output file will be named as the basename of the input file with a “.fa” extension. Defaults to None.
func_return – (bool, optional) Whether to return a list of SeqRecord objects, useful when used as a function but not for command line. Defaults to False.
log_level (str, optional) – Logging level. Defaults to “WARNING”.
- Returns:
List of SeqRecord if func_return is True, otherwise None.
- show_ref_cut(ref_seq: str | Path | Seq, ref_pdb: str | Path | Structure, cut_seq: list[str | Path | Seq] | str | Path | Seq | None = None, cut_pdb: list[str | Path | Structure] | str | Path | Structure | None = None, cut_labels: list[str] | str | None = None, ref_color: str = 'red', ref_map_colors: tuple[str, str] | None = None, ref_map_values: dict | None = None, cut_color='lightgray', gap_color='purple', ref_style='cartoon', cut_style='cartoon', gap_style='cartoon', ref_map_value_random: bool = False, output_fig: str | Path | None = None, col: int = 4, scale: float = 1.0, annotate: bool = True, text_interval: int = 5, log_level='WARNING')[source]
Visualizes the alignment of sequences and highlights changes in PDB structures using py3Dmol.
- Parameters:
ref_seq (str or Path or Seq) – Amino acid sequence content for the ref protein.
ref_pdb (str or Path or Bio.PDB.Structure.Structure) – Path to the PDB file of the reference structure.
cut_seq (str, Path or Seq or None, optional) – Amino acid sequence content for the cut protein.
cut_pdb (str, Path or Bio.PDB.Structure.Structure or None, optional) – Path to the PDB file of the cut structure.
cut_labels (list[str] or str or None, optional) – Label for the cut proteins. If None, the label will be set to “cut”.
ref_color (str, optional) – Color for reference residues.
ref_map_colors (tuple[str, str] or None, optional) – ref_map_colors will be used as color bar from ref_map_colors[0] to ref_map_colors[1]. If None, do not apply color mapping. Defaults to None.
ref_map_values (dict or None, optional) – A dictionary of values for the ref color map, it will be normalized to the range of [0 - 1]. If None, all residues will be colored with the same color. e.g. ref_value_dict = {‘V_0’: 0.4177215189873418, ‘S_1’: 0.8185654008438819, ‘K_2’: 0.9915611814345991, ‘G_3’: 0.42616033755274263, …}
cut_color (str, optional) – Color for cut residues.
gap_color (str, optional) – Color for gaps or removed residues.
ref_style (str, optional) – “stick”, “sphere”, “cartoon”, or “line”
cut_style (str, optional) – “stick”, “sphere”, “cartoon”, or “line”
gap_style (str, optional) – “stick”, “sphere”, “cartoon”, or “line”
ref_map_value_random (bool, optional) – If True, ref_value_dict will be randomly generated. Defaults to False.
output_fig (str or None, optional) – Output figure file path. If None, the figure will not be saved in html format. Defaults to None.
col (int, optional) – Number of columns for the visualization. Defaults to 3.
scale (float, optional) – Scale factor for the visualization. Defaults to 1.0.
annotate (bool, optional) – Whether to annotate the visualization with labels. Defaults to True.
text_interval (int, optional) – The interval between text annotations. Defaults to 5.
log_level (str, optional) – Log level. Defaults to “WARNING”.
bioat.hictools module
bioat.logger module
- class bioat.logger.LoggerManager(log_level: str = 'INFO', mod_name: str = 'bioat.logger', cls_name: str | None = None, func_name: str | None = None)[source]
Bases:
object- DEFAULT_LEVEL = 'INFO'
- LOG_FORMAT = '%(asctime)s.%(msecs)03d - [%(name)s] - %(filename)s[line:%(lineno)4d] - %(levelname)+8s: %(message)s'
- __init__(log_level: str = 'INFO', mod_name: str = 'bioat.logger', cls_name: str | None = None, func_name: str | None = None)[source]
Initialize the LoggerManager with default log level and module name.
- Parameters:
log_level (str, optional) – Default logger level. Defaults to “ERROR”.
mod_name (str, optional) – Module name. Defaults to “fbt”.
cls_name (str, optional) – The name of the class for which the logger is created. Defaults to None.
func_name (str, optional) – The name of the function for which the logger is created. Defaults to None.
bioat.metatools module
- class bioat.metatools.MetaTools[source]
Bases:
objectMetagenome toolbox.
- JGI_query(query_info: str | None = None, xml: str | None = None, log_fails: str | None = None, nretry: int = 4, timeout: int = 60, regex: str | None = None, all_get: bool = False, overwrite_conf: bool = False, filter_files: bool = False, proxy_pool: str | None = None, just_query_xml: bool = False, syntax_help: bool = False, usage: bool = False, log_level: str = 'INFO')[source]
JGI_query: Tool for downloading files from the JGI-IMG database.
This function lists and retrieves files from JGI using the curl API and returns a list of all files available for download for a given query organism.
The source code is adapted from https://github.com/glarue/jgi-query.
- Parameters:
query_info (str | None) – Organism name formatted per JGI’s abbreviation. Example: ‘Nematostella vectensis’ is abbreviated by JGI as ‘Nemve1’. The correct abbreviation can be found by searching for the organism on JGI; the name used in the URL of the ‘Info’ page for that organism is the correct abbreviation. The full URL may also be used for this argument.
xml (str | None) – Specify a local XML file for the query instead of retrieving a new copy from JGI.
log_fails (str | None) – Log file containing URLs to retry downloading from in case of failure.
nretry (int) – Number of times to retry downloading files with errors. Use 0 to skip such files.
timeout (int) – Timeout (in seconds) for downloading. Set to -1 to disable.
regex (str | None) – Regex pattern to use for auto-selecting and downloading files without interaction.
all_get (bool) – If True, auto-select and download all files for the query without interaction.
overwrite_conf (bool) – If True, initiate configuration dialog to overwrite existing user/password configuration.
filter_files (bool) – Under development. Filter organism results by config categories instead of reporting all files listed by JGI for the query.
proxy_pool (str | None) – URL for the proxy pool, e.g., http://abc.com:port. See https://github.com/hermanzhaozzzz/proxy_pool.
just_query_xml (bool) – Set True if you just want to save the XML file.
syntax_help (bool) – If True, provide syntax help in doc mode.
usage (bool) – If True, print verbose usage information and exit.
log_level (str) – Set the logging level. Options include ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’.
bioat.searchtools module
Some cmdline tools for searching, powered by playwright.
- class bioat.searchtools.SearchTools[source]
Bases:
objectSearch toolbox.
- google_scholar(keyword: str, output: str | Path | None = None, sort_by: str = 'CitePerYear', n_results: int = 100, plot: bool = False, start_year: int | None = None, end_year: int = 2026, log_level: str = 'WARNING', **kwargs)[source]
Search Google Scholar.
This method creates a DataFrame of publication data from Google Scholar. Each result includes title, citations, year, authors, venue, publisher, and link. It is useful for finding relevant papers by citation metrics.
Optionally, it can generate a plot of citation counts versus rank and save the table in various formats.
- Parameters:
keyword (str) – Keyword to search for. For exact matches, wrap in double and single quotes, e.g., “‘exact keyword’”.
output (str, optional) – Output file path. Supported formats: .csv, .tsv, .xls, .xlsx. Default is None, which means no output and only print the table in the console.
sort_by (str) – Column to sort the result by, such as “Citations” or “CitePerYear”. Default is “CitePerYear”.
n_results (int) – Number of search results to retrieve. Default is 100.
plot (bool) – Whether to plot citation count vs. rank. Default is False.
start_year (int, optional) – Optional start year for publication filtering.
end_year (int) – End year for publication filtering. Default is current year.
log_level (str) – Logging level. One of: ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’.
- Returns:
None. Prints the table and optionally saves or plots it.
- query_patent(seq: str, query_name: str | None = None, username: str | None = None, password: str | None = None, via_proxy: str | None = None, output: str | None = None, nobrowser: bool = True, retry: int = 3, local_browser: str | None = None, rm_fail_cookie: bool = False, log_level: str = 'INFO')[source]
Return a table with a list of patent blast hit from lens.org.
- Parameters:
seq – protein sequence, e.g. MCRISQQKK
query_name – queryName in output table
username – ORCID username(usually a mail)
password – ORCID password
via_proxy – like http://127.0.0.1:8234 socks5://127.0.0.1:8235
output – output table.csv/csv.gz
nobrowser – wether or not to open browser for DEBUG
retry – max retry times
local_browser – local firefox browser executable file path
rm_fail_cookie – remove cookies from local if query fail, default is False
log_level – ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’
bioat.systemtools module
bioat.tabletools module
_summary_.
author: Herman Huanan Zhao email: hermanzhaozzzz@gmail.com homepage: https://github.com/hermanzhaozzzz
_description_
- example 1:
- bioat list
- <in shell>:
$ bioat list
- <in python consolo>:
>>> from bioat.cli import Cli >>> bioat = Cli() >>> bioat.list() >>> print(bioat.list())
- example 2:
_example_
- class bioat.tabletools.TableTools[source]
Bases:
objectTo integrate tables.
- merge(inputs, tags, output, input_fmt='tsv', output_fmt='tsv', input_header=False, output_header=False, log_level='WARNING')[source]
A simple tool to merge same formatted tables from different sample.
Params :param inputs: table files :param tags: tags for each table :param output: merged file :param input_fmt: tsv | csv :param output_fmt: tsv | csv :param input_header: True | False, input table has header or not :param output_header: True | False, output table has header or not :param log_level: log status
- split(input: str, n: int, output_prefix=None, input_fmt='tsv', output_fmt='tsv', input_header=False, output_header=False, compress=False, log_level='WARNING')[source]
A simple tool to split table into parts.
Params :param input: table to split :param n: split table into n parts :param output_prefix: name prefix for splitted parts, the same with input if not defined :param input_fmt: tsv | csv :param output_fmt: tsv | csv :param input_header: True | False, input table has header or not :param output_header: True | False, output table has header or not :param compress: True | False, gzip the output table or not :param log_level: log status
bioat.target_seq module
_summary_.
author: Herman Huanan Zhao email: hermanzhaozzzz@gmail.com homepage: https://github.com/hermanzhaozzzz
_description_
- example 1:
- bioat list
- <in shell>:
$ bioat list
- <in python consolo>:
>>> from bioat.cli import Cli >>> bioat = Cli() >>> bioat.list() >>> print(bioat.list())
- example 2:
_example_
- class bioat.target_seq.TargetSeq[source]
Bases:
objectTarget Deep Sequencing toolbox.
- region_heatmap(input_table: str, output_fig: str, target_seq: str | None = None, reference_seq: str | None = None, input_table_header: bool = True, output_fig_fmt: str = 'pdf', output_fig_dpi: int = 100, show_indel: bool = True, show_index: bool = True, box_border: bool = False, box_space: float = 0.03, min_color: tuple = (250, 239, 230), max_color: tuple = (154, 104, 57), min_ratio: float = 0.001, max_ratio: float = 0.99, region_extend_length: int = 5, local_alignment_scoring_matrix: tuple = (5, -4, -24, -8), local_alignment_min_score: int = 15, PAM_priority_weight: float = 1.0, get_built_in_target_seq: bool = False, log_level: str = 'INFO')[source]
Plot region mutation information using a table generated by bioat bam mpileup_to_table.
This function generates a visualization of mutation information for a specific genomic region based on a table created by the bioat bam mpileup_to_table command.
- Parameters:
input_table (str) – Path to the table generated by bioat bam mpileup_to_table. This table should contain base mutation information for a short genome region (no more than 1k nt).
output_fig (str) – Path to the output figure file.
target_seq (str, optional) – Target sequence to align against the reference sequence in mpileup.table.
Examples
‘GAGTCCGAGCAGAAGAAGAA^GGG^’ for SpCas9-BE (PAM: ^GGG^).
‘^TTTA^GCCCCAATAATCCCCACATGTCA’ for cpf1-BE (PAM: ^TTTA^).
‘TGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGA’ for no PAM.
Defaults to None.
- reference_seq (str, optional): Custom reference sequence to overwrite the one in mpileup.table.
Can be a FASTA file or a DNA sequence. Defaults to None.
input_table_header (bool, optional): Whether the input_table contains a header. Defaults to True. output_fig_fmt (str, optional): Format of the output figure (“pdf” or “png”). Defaults to “pdf”. output_fig_dpi (int, optional): DPI for the output figure. Defaults to 300. show_indel (bool, optional): Whether to show indel information in the output figure. Defaults to True. show_index (bool, optional): Whether to display index information in the output figure. Defaults to True. box_border (bool, optional): Whether to display box borders in the output figure. Defaults to True. box_space (int, optional): Space size between two boxes. Defaults to 1. min_color (tuple, optional): Minimum color for the heatmap in RGB format. Defaults to (255, 255, 255). max_color (tuple, optional): Maximum color for the heatmap in RGB format. Defaults to (0, 0, 0). min_ratio (float, optional): Mutation ratio below min_ratio will be shown as white. Defaults to 0.0. max_ratio (float, optional): Mutation ratio above max_ratio will be capped. Defaults to 1.0. region_extend_length (int, optional): Number of base pairs to extend on either side of the region. Defaults to 0. local_alignment_scoring_matrix (tuple, optional): Alignment scoring parameters as a tuple:
(<align_match_score>, <align_mismatch_score>, <align_gap_open_score>, <align_gap_extension_score>). Defaults to None.
local_alignment_min_score (int, optional): Minimum alignment score to consider as a valid alignment. Defaults to 0. PAM_priority_weight (float, optional): Weight multiplier for PAM alignment scores. Defaults to 1.0. get_built_in_target_seq (bool, optional): Set to True to return built-in target sequence information. Defaults to False. log_level (str, optional): Logging level. One of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to “INFO”.
- Returns:
None
Examples
>>> # Generate mutation table >>> samtools mpileup test_sorted.bam --reference HK4-AOut-1.ref.upper.fa | gzip > test_sorted.mpileup.gz >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info.tsv >>> # Generate region heatmap >>> bioat target_seq region_heatmap --input_table test_sorted.mpileup.info.tsv --output_fig test_sorted.mpileup.info.pdf
- region_heatmap_compare(input_tables: str, labels: str | None = None, target_seq: str | None = None, reference_seq: str | None = None, output_fig_heatmap: str | None = None, output_fig_count_ratio: str | None = None, output_table_heatmap: str | None = None, output_table_count_ratio: str | None = None, output_fig_fmt: str = 'pdf', input_table_header: bool = True, to_base: tuple = ('A', 'G', 'C', 'T', 'Ins', 'Del'), heatmap_mut_direction: tuple = ('CT', 'GA'), count_ratio='all', region_extend_length: int = 5, output_fig_dpi: int = 100, show_indel: bool = True, show_index: bool = True, block_ref: bool = True, box_border: bool = False, box_space: float = 0.03, min_color: tuple = (250, 239, 230), max_color: tuple = (154, 104, 57), min_ratio: float = 0.001, max_ratio: float = 0.99, local_alignment_scoring_matrix: tuple = (5, -4, -24, -8), local_alignment_min_score: int = 15, PAM_priority_weight: float = 1.0, get_built_in_target_seq: bool = False, log_level: str = 'INFO')[source]
Plot region mutation information for multiple conditions.
This function generates a comparison of mutation information across multiple conditions using tables generated by bioat bam mpileup_to_table.
- Parameters:
input_tables (str) – Paths to input tables generated by bioat bam mpileup_to_table, separated by commas. Each table should contain mutation information for a short genomic region (≤1k nt).
labels (str) – Labels for the panels, separated by commas.
target_seq (str, optional) –
Sequence to align against the reference sequence in mpileup.table. Examples:
’GAGTCCGAGCAGAAGAAGAA^GGG^’ for SpCas9-BE (PAM: ^GGG^).
’^TTTA^GCCCCAATAATCCCCACATGTCA’ for cpf1-BE (PAM: ^TTTA^).
’TGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGA’ for no PAM.
Defaults to None.
reference_seq (str, optional) – Custom reference sequence to overwrite the one in mpileup.table. Can be a FASTA file or a DNA sequence. Defaults to None.
output_fig_heatmap (str, optional) – Path to the heatmap output figure. Defaults to None.
output_fig_count_ratio (str, optional) – Path to the count/ratio output figure. Defaults to None.
output_table_heatmap (str, optional) – Path to the heatmap output table. Defaults to None.
output_table_count_ratio (str, optional) – Path to the count/ratio output table. Defaults to None.
output_fig_fmt (str, optional) – Format of the output figures, either “pdf” or “png”. Defaults to “pdf”.
input_table_header (bool, optional) – Whether the input tables have headers. Defaults to True.
to_base (str, optional) – Reference bases to convert to, separated by commas. Defaults to “A,G,C,T,Ins,Del”.
heatmap_mut_direction (str, optional) – Mutation directions to plot, specified as [from Base][to Base], separated by commas. Defaults to “CT,GA”.
count_ratio (str, optional) – Type of plot to generate: “count”, “ratio”, or “all”. Defaults to “all”.
region_extend_length (int, optional) – Number of base pairs to extend on both sides of the region. Defaults to 0.
output_fig_dpi (int, optional) – DPI for the output figures. Defaults to 300.
show_indel (bool, optional) – Whether to show indel information in the output figures. Defaults to True.
show_index (bool, optional) – Whether to display index information in the output figures. Defaults to True.
block_ref (bool, optional) – Whether to hide colors for reference sites in the output figures. Defaults to True.
box_border (bool, optional) – Whether to display box borders in the output figures. Defaults to True.
box_space (int, optional) – Space size between two boxes in the heatmap. Defaults to 1.
min_color (tuple, optional) – Minimum color for the heatmap in RGB format. Defaults to (255, 255, 255).
max_color (tuple, optional) – Maximum color for the heatmap in RGB format. Defaults to (0, 0, 0).
min_ratio (float, optional) – Mutation ratios below this value will appear white. Defaults to 0.0.
max_ratio (float, optional) – Mutation ratios above this value will be capped. Defaults to 1.0.
local_alignment_scoring_matrix (tuple, optional) – Alignment scoring parameters as a tuple: (<align_match_score>, <align_mismatch_score>, <align_gap_open_score>, <align_gap_extension_score>). Defaults to None.
local_alignment_min_score (int, optional) – Minimum alignment score to consider as a valid alignment. Defaults to 0.
PAM_priority_weight (float, optional) – Weight multiplier for PAM alignment scores. Defaults to 1.0.
get_built_in_target_seq (bool, optional) – Set to True to return built-in target sequence information. Defaults to False.
log_level (str, optional) – Logging level. One of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to “INFO”.
- Returns:
None
Examples
>>> # Generate mutation tables >>> samtools mpileup test_sorted.bam --reference HK4-AOut-1.ref.upper.fa | gzip > test_sorted.mpileup.gz >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info1.tsv >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info2.tsv >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info3.tsv >>> # Generate region heatmap comparison >>> bioat target_seq region_heatmap_compare --input_tables test_sorted.mpileup.info1.tsv,test_sorted.mpileup.info2.tsv,test_sorted.mpileup.info3.tsv --labels condition1,condition2,condition3 --target_seq HEK4
Module contents
BioAT package. BioAT can be a package to import. BioAT also can be a command-line tool. It is a bioinformatic tool/pkg bundle for python.
- class bioat.BamTools[source]
Bases:
objectBam toolbox.
- mpileup2table(mpileup: str | ~pathlib.Path, output: str | ~pathlib.Path | ~_io.TextIOWrapper = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, threads: int = 1, mutation_number_threshold: int = 0, temp_dir: str | ~pathlib.Path = '__bioat_temp_dir', remove_temp: bool = True, log_level: str = 'WARNING') None[source]
Converts an mpileup file to a structured info file.
- Parameters:
mpileup (str) – Path to the samtools mpileup format file.
output (str | TextIOWrapper) – Path to the output file where parsed data will be stored. Defaults to standard output.
threads (int) – Number of threads to utilize for processing. Default is one less than the number of available CPU cores.
mutation_number_threshold (int) – Threshold for mutation information; set to 0 to include all sites.
temp_dir (str, optional) – Directory for temporary files. Defaults to a directory in ‘__bioat_temp_dir’.
remove_temp (bool, optional) – Indicator for whether to remove temporary files after processing. Defaults to True.
log_level (str, optional) – Level of logging. Options are ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to ‘WARNING’.
- Returns:
This function does not return a value. It outputs a file based on the provided parameters.
- Return type:
None
- remove_clip(input: str | ~pathlib.Path | ~_io.TextIOWrapper = <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>, output: str | ~pathlib.Path | ~_io.TextIOWrapper = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, threads: int = 2, output_fmt: str = 'SAM', remove_as_paired: bool = False, max_clip: int = 0, log_level: str = 'WARNING')[source]
Remove soft/hard clipped reads from a BAM/SAM file.
This method removes soft/hard clipped reads from a BAM/SAM file. It can accept input from stdin and produce output to stdout.
- Parameters:
input (str | TextIOWrapper) – BAM file sorted by query name with soft/hard clipped reads. Pipe stdin is supported, e.g.: [samtools view -h foo_sort_name.bam | bioat bam remove_clip <flags>].
output (str | TextIOWrapper) – BAM file sorted by query name without soft/hard clipped reads. Pipe stdout is supported, e.g.: [bioat bam remove_clip <flags> | wc -l] or [bioat bam remove_clip <flags> | samtools view ….].
threads (int, optional) – Number of threads used by pysam and samtools core. Defaults to the number of CPU cores.
output_fmt (str, optional) – Format of the output file, can be “BAM” or “SAM”. Defaults to “SAM”.
remove_as_paired (bool, optional) – Flag to determine whether to remove single clipped reads. If True, removes both the clipped read and its paired read. The input BAM/SAM must be sorted by name and have header [SO:queryname]. If False, only removes the single clipped read.
max_clip (int, optional) – The maximum number of clips allowed per read. Defaults to 0.
log_level (str, optional) – Logging level for the process. Can be one of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to “INFO”.
- Returns:
None
- class bioat.CrisprTools[source]
Bases:
objectCRISPR mining toolbox.
This class provides methods for performing CRISPR analysis on datasets, including finding Cas proteins and CRISPR sequences.
- Variables:
None
- cas13_finder(input_faa, output_faa=None, lmin=200, lmax=1500, log_level='INFO')[source]
De novo annotation for Cas13 candidates from proteins.faa.
- Parameters:
input_faa (str) – The input file containing Cas candidates in .faa format.
output_faa (str, optional) – The output file for Cas13 candidates in .faa format. Defaults to None.
lmin (int, optional) – Minimum length for a Cas candidate. Defaults to 200.
lmax (int, optional) – Maximum length for a Cas candidate. Defaults to 1500.
log_level (str, optional) – The logging level. Options are ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to ‘INFO’.
- cas_finder(input_fa, output_faa=None, output_contig_fa=None, output_crispr_info_tab=None, lmin=3000, lmax=None, extend=10000, temp_dir=None, prodigal=None, prodigal_mode='meta', pilercr=None, rm_temp=True, log_level='INFO')[source]
De novo annotation for Cas candidates from neighbors of CRISPR loci.
- Parameters:
input_fa (str) – Path to the input metagenome fasta file containing many contigs.
output_faa (str, optional) – Path to save the de novo annotated Cas candidates.
output_contig_fa (str, optional) – Path to save the whole contigs of de novo annotated Cas candidates.
output_crispr_info_tab (str, optional) – Path to save the de novo annotated CRISPR info table (CSV format).
lmin (int, optional) – Minimum length for a contig. Defaults to 3000.
lmax (int, optional) – Maximum length for a contig. Defaults to None.
extend (int, optional) – Distance over which proteins are considered from the start/end of the CRISPR loci. Defaults to 10000.
temp_dir (str, optional) – Directory to store temporary files. Defaults to None.
prodigal (str, optional) – Path to the Prodigal executable. Defaults to None.
prodigal_mode (str, optional) – Mode for Prodigal annotation. Can be “meta” or “single”. Defaults to “meta”.
pilercr (str, optional) – Path to the Pilercr executable. Defaults to None.
rm_temp (bool, optional) – If False, temporary files will be kept. Defaults to True.
log_level (str, optional) – Logging level; set to “DEBUG” to see logs from prodigal and pilercr. Defaults to “INFO”.
- class bioat.FastxTools[source]
Bases:
objectFASTA & FASTQ toolbox.
- filter_read_contains_n(file: str, output='<stdout>')[source]
Filter reads that contain the ‘N’ base in FASTA or FASTQ formats.
This function processes a given FASTA or FASTQ file and filters out reads that contain the ‘N’ base. The result is directed to an output file, or to stdout if no output file is specified.
- Parameters:
file (str) – The name of the FASTA or FASTQ file to be processed.
output (str) – The name of the output file. Defaults to stdout if not specified, and the format matches the input.
- Returns:
None
- fmt_this(file: str, new_file: str | None = None, force=False, log_level='WARNING')[source]
Formats a FASTA file to improve readability.
- Parameters:
file (str) – The input filename for the FASTA file.
new_file (str | None) – The output filename. If None, the file will be replaced. Default is None.
force (bool) – If True, forces the formatting even if the output file exists. Default is False.
log_level (str) – The logging level for messages. Default is “WARNING”.
This function calls ‘format_this_fastx’ to perform the actual formatting on the specified FASTA file.
- mgi_parse_md5(file: str, log_level='WARNING')[source]
Converts a mgi-like MD5 file into a standard MD5 file.
- Parameters:
file (str) – The name of the mgi-like MD5 file to read.
log_level (str, optional) – The logging level to use. It can be INFO, DEBUG, WARNING, or ERROR. The default is WARNING.
- plot_length_distribution(file: str, table: str | None = None, image: str | None = None, plt_show: bool = False, log_level='WARNING')[source]
Plots the length distribution of a FASTA file.
- Parameters:
file (str) – The input filename for the FASTA file.
table (str | None, optional) – The output filename for the length distribution table. Default is <file>.lengths.
image (str | None, optional) – The output filename for the length distribution figure. Default is <file>.lengths.pdf.
plt_show (bool, optional) – If True, shows the plot. Default is False.
log_level (str) – The logging level for messages. Default is “WARNING”.
- class bioat.FoldTools[source]
Bases:
objectFolding toolbox.
- get_cut2ref_aln_info(ref: str | Structure, cut: str | Structure, cal_rmsd=True, cal_tmscore=False, label1='ref', label2='cut', usalign_bin: str = 'usalign', log_level='WARNING')[source]
Align cutted pdb to ref pdb using the CA atoms.
Aligns a truncated protein structure (cut) to its full-length reference structure (ref) using Ca atoms and Biopython’s Superimposer.
This function: - Extracts all Ca atoms from ref and cut - Removes atoms from ref at the indices listed in gap_indices - Aligns the remaining atoms from cut to the corresponding positions in ref - Modifies the cut structure in-place to match the aligned orientation - Returns both structures and the RMSD value of the alignment
It assumes: - One-to-one correspondence between residues after gap removal - Structures are predicted by AlphaFold2 / ESMFold (no missing atoms)
- Parameters:
ref (str or Bio.PDB.Structure.Structure) – Reference structure path or loaded Structure.
cut (str or Bio.PDB.Structure.Structure) – Truncated structure path or loaded Structure.
cal_rmsd (bool, optional) – Whether to calculate RMSD. Default is True.
cal_tmscore (bool, optional) – Whether to calculate TM-score using USalign. Default is False.
label1 (str, optional) – Name for the reference structure. Default is “ref”.
label2 (str, optional) – Name for the cut structure. Default is “cut”.
usalign_bin (str, optional) – Path to the USalign binary for TM-score calculation. Default is “usalign”.
log_level (str, optional) – Logging level. Default is “WARNING”.
- Returns:
- {
“{label1}”: aln label1 structure, # if cal_rmsd is True, unaltered label1 structure “{label2}}”: fixed label2 structure, # if cal_rmsd is True, fix label2 coords in-place “RMSD”: 0.123 # if cal_rmsd is True, the RMSD value between label1 and label2 f”{label1}_seq”: ref_seq, # if cal_rmsd is True, the sequence of label1 structure f”{label2}_seq”: cut_seq, # if cal_rmsd is True, the sequence of label2 structure “alignment_dict”: alignment_dict, # if cal_rmsd is True, the alignment dict of label1 and label2 “gap_indices”: gap_indices, # if cal_rmsd is True, the indices of gaps in label1 structure “TM-score:mean”: 0.623, # if cal_tmscore is True, the mean TM-score value “TM-score:TM1”: 0.456, # if cal_tmscore is True, use label1 as ref <L_N> in calculation “TM-score:TM2”: 0.789, # if cal_tmscore is True, use label2 as ref <L_N> in calculation …
}
- Return type:
dict
- pdb2fasta(input_pdb: str | Path, output_fasta: str | Path | None = None, log_level='WARNING')[source]
Converts a PDB file to a FASTA file.
- Details:
Proteins: The protein sequence for each chain will be extracted as “Chain X Protein”.
DNA and RNA: Bases for DNA (A, T, G, C) will be saved as “Chain X DNA”, and bases for RNA (A, U, G, C) will be saved as “Chain X RNA”.
Other molecules: Any unrecognized molecules (e.g., ions, modified molecules) will be labeled as [residue] and stored as “Chain X Other molecules”.
Multi-chain complexes: The program supports multi-chain structures in complexes, and the content of each chain will be recorded separately.
- Parameters:
input_pdb (str or Path) – Path to the input PDB/CIF file or Biopython Structure.
output_fasta (str or Path, optional) – Output file path. If None, the output file will be named as the basename of the input file with a “.fa” extension. Defaults to None.
func_return – (bool, optional) Whether to return a list of SeqRecord objects, useful when used as a function but not for command line. Defaults to False.
log_level (str, optional) – Logging level. Defaults to “WARNING”.
- Returns:
List of SeqRecord if func_return is True, otherwise None.
- show_ref_cut(ref_seq: str | Path | Seq, ref_pdb: str | Path | Structure, cut_seq: list[str | Path | Seq] | str | Path | Seq | None = None, cut_pdb: list[str | Path | Structure] | str | Path | Structure | None = None, cut_labels: list[str] | str | None = None, ref_color: str = 'red', ref_map_colors: tuple[str, str] | None = None, ref_map_values: dict | None = None, cut_color='lightgray', gap_color='purple', ref_style='cartoon', cut_style='cartoon', gap_style='cartoon', ref_map_value_random: bool = False, output_fig: str | Path | None = None, col: int = 4, scale: float = 1.0, annotate: bool = True, text_interval: int = 5, log_level='WARNING')[source]
Visualizes the alignment of sequences and highlights changes in PDB structures using py3Dmol.
- Parameters:
ref_seq (str or Path or Seq) – Amino acid sequence content for the ref protein.
ref_pdb (str or Path or Bio.PDB.Structure.Structure) – Path to the PDB file of the reference structure.
cut_seq (str, Path or Seq or None, optional) – Amino acid sequence content for the cut protein.
cut_pdb (str, Path or Bio.PDB.Structure.Structure or None, optional) – Path to the PDB file of the cut structure.
cut_labels (list[str] or str or None, optional) – Label for the cut proteins. If None, the label will be set to “cut”.
ref_color (str, optional) – Color for reference residues.
ref_map_colors (tuple[str, str] or None, optional) – ref_map_colors will be used as color bar from ref_map_colors[0] to ref_map_colors[1]. If None, do not apply color mapping. Defaults to None.
ref_map_values (dict or None, optional) – A dictionary of values for the ref color map, it will be normalized to the range of [0 - 1]. If None, all residues will be colored with the same color. e.g. ref_value_dict = {‘V_0’: 0.4177215189873418, ‘S_1’: 0.8185654008438819, ‘K_2’: 0.9915611814345991, ‘G_3’: 0.42616033755274263, …}
cut_color (str, optional) – Color for cut residues.
gap_color (str, optional) – Color for gaps or removed residues.
ref_style (str, optional) – “stick”, “sphere”, “cartoon”, or “line”
cut_style (str, optional) – “stick”, “sphere”, “cartoon”, or “line”
gap_style (str, optional) – “stick”, “sphere”, “cartoon”, or “line”
ref_map_value_random (bool, optional) – If True, ref_value_dict will be randomly generated. Defaults to False.
output_fig (str or None, optional) – Output figure file path. If None, the figure will not be saved in html format. Defaults to None.
col (int, optional) – Number of columns for the visualization. Defaults to 3.
scale (float, optional) – Scale factor for the visualization. Defaults to 1.0.
annotate (bool, optional) – Whether to annotate the visualization with labels. Defaults to True.
text_interval (int, optional) – The interval between text annotations. Defaults to 5.
log_level (str, optional) – Log level. Defaults to “WARNING”.
- class bioat.MetaTools[source]
Bases:
objectMetagenome toolbox.
- JGI_query(query_info: str | None = None, xml: str | None = None, log_fails: str | None = None, nretry: int = 4, timeout: int = 60, regex: str | None = None, all_get: bool = False, overwrite_conf: bool = False, filter_files: bool = False, proxy_pool: str | None = None, just_query_xml: bool = False, syntax_help: bool = False, usage: bool = False, log_level: str = 'INFO')[source]
JGI_query: Tool for downloading files from the JGI-IMG database.
This function lists and retrieves files from JGI using the curl API and returns a list of all files available for download for a given query organism.
The source code is adapted from https://github.com/glarue/jgi-query.
- Parameters:
query_info (str | None) – Organism name formatted per JGI’s abbreviation. Example: ‘Nematostella vectensis’ is abbreviated by JGI as ‘Nemve1’. The correct abbreviation can be found by searching for the organism on JGI; the name used in the URL of the ‘Info’ page for that organism is the correct abbreviation. The full URL may also be used for this argument.
xml (str | None) – Specify a local XML file for the query instead of retrieving a new copy from JGI.
log_fails (str | None) – Log file containing URLs to retry downloading from in case of failure.
nretry (int) – Number of times to retry downloading files with errors. Use 0 to skip such files.
timeout (int) – Timeout (in seconds) for downloading. Set to -1 to disable.
regex (str | None) – Regex pattern to use for auto-selecting and downloading files without interaction.
all_get (bool) – If True, auto-select and download all files for the query without interaction.
overwrite_conf (bool) – If True, initiate configuration dialog to overwrite existing user/password configuration.
filter_files (bool) – Under development. Filter organism results by config categories instead of reporting all files listed by JGI for the query.
proxy_pool (str | None) – URL for the proxy pool, e.g., http://abc.com:port. See https://github.com/hermanzhaozzzz/proxy_pool.
just_query_xml (bool) – Set True if you just want to save the XML file.
syntax_help (bool) – If True, provide syntax help in doc mode.
usage (bool) – If True, print verbose usage information and exit.
log_level (str) – Set the logging level. Options include ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’.
- class bioat.SearchTools[source]
Bases:
objectSearch toolbox.
- google_scholar(keyword: str, output: str | Path | None = None, sort_by: str = 'CitePerYear', n_results: int = 100, plot: bool = False, start_year: int | None = None, end_year: int = 2026, log_level: str = 'WARNING', **kwargs)[source]
Search Google Scholar.
This method creates a DataFrame of publication data from Google Scholar. Each result includes title, citations, year, authors, venue, publisher, and link. It is useful for finding relevant papers by citation metrics.
Optionally, it can generate a plot of citation counts versus rank and save the table in various formats.
- Parameters:
keyword (str) – Keyword to search for. For exact matches, wrap in double and single quotes, e.g., “‘exact keyword’”.
output (str, optional) – Output file path. Supported formats: .csv, .tsv, .xls, .xlsx. Default is None, which means no output and only print the table in the console.
sort_by (str) – Column to sort the result by, such as “Citations” or “CitePerYear”. Default is “CitePerYear”.
n_results (int) – Number of search results to retrieve. Default is 100.
plot (bool) – Whether to plot citation count vs. rank. Default is False.
start_year (int, optional) – Optional start year for publication filtering.
end_year (int) – End year for publication filtering. Default is current year.
log_level (str) – Logging level. One of: ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’.
- Returns:
None. Prints the table and optionally saves or plots it.
- query_patent(seq: str, query_name: str | None = None, username: str | None = None, password: str | None = None, via_proxy: str | None = None, output: str | None = None, nobrowser: bool = True, retry: int = 3, local_browser: str | None = None, rm_fail_cookie: bool = False, log_level: str = 'INFO')[source]
Return a table with a list of patent blast hit from lens.org.
- Parameters:
seq – protein sequence, e.g. MCRISQQKK
query_name – queryName in output table
username – ORCID username(usually a mail)
password – ORCID password
via_proxy – like http://127.0.0.1:8234 socks5://127.0.0.1:8235
output – output table.csv/csv.gz
nobrowser – wether or not to open browser for DEBUG
retry – max retry times
local_browser – local firefox browser executable file path
rm_fail_cookie – remove cookies from local if query fail, default is False
log_level – ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’
- class bioat.TableTools[source]
Bases:
objectTo integrate tables.
- merge(inputs, tags, output, input_fmt='tsv', output_fmt='tsv', input_header=False, output_header=False, log_level='WARNING')[source]
A simple tool to merge same formatted tables from different sample.
Params :param inputs: table files :param tags: tags for each table :param output: merged file :param input_fmt: tsv | csv :param output_fmt: tsv | csv :param input_header: True | False, input table has header or not :param output_header: True | False, output table has header or not :param log_level: log status
- split(input: str, n: int, output_prefix=None, input_fmt='tsv', output_fmt='tsv', input_header=False, output_header=False, compress=False, log_level='WARNING')[source]
A simple tool to split table into parts.
Params :param input: table to split :param n: split table into n parts :param output_prefix: name prefix for splitted parts, the same with input if not defined :param input_fmt: tsv | csv :param output_fmt: tsv | csv :param input_header: True | False, input table has header or not :param output_header: True | False, output table has header or not :param compress: True | False, gzip the output table or not :param log_level: log status
- class bioat.TargetSeq[source]
Bases:
objectTarget Deep Sequencing toolbox.
- region_heatmap(input_table: str, output_fig: str, target_seq: str | None = None, reference_seq: str | None = None, input_table_header: bool = True, output_fig_fmt: str = 'pdf', output_fig_dpi: int = 100, show_indel: bool = True, show_index: bool = True, box_border: bool = False, box_space: float = 0.03, min_color: tuple = (250, 239, 230), max_color: tuple = (154, 104, 57), min_ratio: float = 0.001, max_ratio: float = 0.99, region_extend_length: int = 5, local_alignment_scoring_matrix: tuple = (5, -4, -24, -8), local_alignment_min_score: int = 15, PAM_priority_weight: float = 1.0, get_built_in_target_seq: bool = False, log_level: str = 'INFO')[source]
Plot region mutation information using a table generated by bioat bam mpileup_to_table.
This function generates a visualization of mutation information for a specific genomic region based on a table created by the bioat bam mpileup_to_table command.
- Parameters:
input_table (str) – Path to the table generated by bioat bam mpileup_to_table. This table should contain base mutation information for a short genome region (no more than 1k nt).
output_fig (str) – Path to the output figure file.
target_seq (str, optional) – Target sequence to align against the reference sequence in mpileup.table.
Examples
‘GAGTCCGAGCAGAAGAAGAA^GGG^’ for SpCas9-BE (PAM: ^GGG^).
‘^TTTA^GCCCCAATAATCCCCACATGTCA’ for cpf1-BE (PAM: ^TTTA^).
‘TGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGA’ for no PAM.
Defaults to None.
- reference_seq (str, optional): Custom reference sequence to overwrite the one in mpileup.table.
Can be a FASTA file or a DNA sequence. Defaults to None.
input_table_header (bool, optional): Whether the input_table contains a header. Defaults to True. output_fig_fmt (str, optional): Format of the output figure (“pdf” or “png”). Defaults to “pdf”. output_fig_dpi (int, optional): DPI for the output figure. Defaults to 300. show_indel (bool, optional): Whether to show indel information in the output figure. Defaults to True. show_index (bool, optional): Whether to display index information in the output figure. Defaults to True. box_border (bool, optional): Whether to display box borders in the output figure. Defaults to True. box_space (int, optional): Space size between two boxes. Defaults to 1. min_color (tuple, optional): Minimum color for the heatmap in RGB format. Defaults to (255, 255, 255). max_color (tuple, optional): Maximum color for the heatmap in RGB format. Defaults to (0, 0, 0). min_ratio (float, optional): Mutation ratio below min_ratio will be shown as white. Defaults to 0.0. max_ratio (float, optional): Mutation ratio above max_ratio will be capped. Defaults to 1.0. region_extend_length (int, optional): Number of base pairs to extend on either side of the region. Defaults to 0. local_alignment_scoring_matrix (tuple, optional): Alignment scoring parameters as a tuple:
(<align_match_score>, <align_mismatch_score>, <align_gap_open_score>, <align_gap_extension_score>). Defaults to None.
local_alignment_min_score (int, optional): Minimum alignment score to consider as a valid alignment. Defaults to 0. PAM_priority_weight (float, optional): Weight multiplier for PAM alignment scores. Defaults to 1.0. get_built_in_target_seq (bool, optional): Set to True to return built-in target sequence information. Defaults to False. log_level (str, optional): Logging level. One of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to “INFO”.
- Returns:
None
Examples
>>> # Generate mutation table >>> samtools mpileup test_sorted.bam --reference HK4-AOut-1.ref.upper.fa | gzip > test_sorted.mpileup.gz >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info.tsv >>> # Generate region heatmap >>> bioat target_seq region_heatmap --input_table test_sorted.mpileup.info.tsv --output_fig test_sorted.mpileup.info.pdf
- region_heatmap_compare(input_tables: str, labels: str | None = None, target_seq: str | None = None, reference_seq: str | None = None, output_fig_heatmap: str | None = None, output_fig_count_ratio: str | None = None, output_table_heatmap: str | None = None, output_table_count_ratio: str | None = None, output_fig_fmt: str = 'pdf', input_table_header: bool = True, to_base: tuple = ('A', 'G', 'C', 'T', 'Ins', 'Del'), heatmap_mut_direction: tuple = ('CT', 'GA'), count_ratio='all', region_extend_length: int = 5, output_fig_dpi: int = 100, show_indel: bool = True, show_index: bool = True, block_ref: bool = True, box_border: bool = False, box_space: float = 0.03, min_color: tuple = (250, 239, 230), max_color: tuple = (154, 104, 57), min_ratio: float = 0.001, max_ratio: float = 0.99, local_alignment_scoring_matrix: tuple = (5, -4, -24, -8), local_alignment_min_score: int = 15, PAM_priority_weight: float = 1.0, get_built_in_target_seq: bool = False, log_level: str = 'INFO')[source]
Plot region mutation information for multiple conditions.
This function generates a comparison of mutation information across multiple conditions using tables generated by bioat bam mpileup_to_table.
- Parameters:
input_tables (str) – Paths to input tables generated by bioat bam mpileup_to_table, separated by commas. Each table should contain mutation information for a short genomic region (≤1k nt).
labels (str) – Labels for the panels, separated by commas.
target_seq (str, optional) –
Sequence to align against the reference sequence in mpileup.table. Examples:
’GAGTCCGAGCAGAAGAAGAA^GGG^’ for SpCas9-BE (PAM: ^GGG^).
’^TTTA^GCCCCAATAATCCCCACATGTCA’ for cpf1-BE (PAM: ^TTTA^).
’TGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGA’ for no PAM.
Defaults to None.
reference_seq (str, optional) – Custom reference sequence to overwrite the one in mpileup.table. Can be a FASTA file or a DNA sequence. Defaults to None.
output_fig_heatmap (str, optional) – Path to the heatmap output figure. Defaults to None.
output_fig_count_ratio (str, optional) – Path to the count/ratio output figure. Defaults to None.
output_table_heatmap (str, optional) – Path to the heatmap output table. Defaults to None.
output_table_count_ratio (str, optional) – Path to the count/ratio output table. Defaults to None.
output_fig_fmt (str, optional) – Format of the output figures, either “pdf” or “png”. Defaults to “pdf”.
input_table_header (bool, optional) – Whether the input tables have headers. Defaults to True.
to_base (str, optional) – Reference bases to convert to, separated by commas. Defaults to “A,G,C,T,Ins,Del”.
heatmap_mut_direction (str, optional) – Mutation directions to plot, specified as [from Base][to Base], separated by commas. Defaults to “CT,GA”.
count_ratio (str, optional) – Type of plot to generate: “count”, “ratio”, or “all”. Defaults to “all”.
region_extend_length (int, optional) – Number of base pairs to extend on both sides of the region. Defaults to 0.
output_fig_dpi (int, optional) – DPI for the output figures. Defaults to 300.
show_indel (bool, optional) – Whether to show indel information in the output figures. Defaults to True.
show_index (bool, optional) – Whether to display index information in the output figures. Defaults to True.
block_ref (bool, optional) – Whether to hide colors for reference sites in the output figures. Defaults to True.
box_border (bool, optional) – Whether to display box borders in the output figures. Defaults to True.
box_space (int, optional) – Space size between two boxes in the heatmap. Defaults to 1.
min_color (tuple, optional) – Minimum color for the heatmap in RGB format. Defaults to (255, 255, 255).
max_color (tuple, optional) – Maximum color for the heatmap in RGB format. Defaults to (0, 0, 0).
min_ratio (float, optional) – Mutation ratios below this value will appear white. Defaults to 0.0.
max_ratio (float, optional) – Mutation ratios above this value will be capped. Defaults to 1.0.
local_alignment_scoring_matrix (tuple, optional) – Alignment scoring parameters as a tuple: (<align_match_score>, <align_mismatch_score>, <align_gap_open_score>, <align_gap_extension_score>). Defaults to None.
local_alignment_min_score (int, optional) – Minimum alignment score to consider as a valid alignment. Defaults to 0.
PAM_priority_weight (float, optional) – Weight multiplier for PAM alignment scores. Defaults to 1.0.
get_built_in_target_seq (bool, optional) – Set to True to return built-in target sequence information. Defaults to False.
log_level (str, optional) – Logging level. One of ‘CRITICAL’, ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’, ‘NOTSET’. Defaults to “INFO”.
- Returns:
None
Examples
>>> # Generate mutation tables >>> samtools mpileup test_sorted.bam --reference HK4-AOut-1.ref.upper.fa | gzip > test_sorted.mpileup.gz >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info1.tsv >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info2.tsv >>> bioat bam mpileup_to_table test_sorted.mpileup.gz > test_sorted.mpileup.info3.tsv >>> # Generate region heatmap comparison >>> bioat target_seq region_heatmap_compare --input_tables test_sorted.mpileup.info1.tsv,test_sorted.mpileup.info2.tsv,test_sorted.mpileup.info3.tsv --labels condition1,condition2,condition3 --target_seq HEK4