cns.pipelines module
This module provides functions for processing Copy Number Segment (CNS) data. It includes functions to fill gaps, infer missing values, aggregate data, calculate coverage, and more.
Functions
main_align: Fills gaps in the CNS data and adds missing chromosomes.
main_infer: Infers missing values in the CNS data.
main_impute: Fills gaps, adds missing chromosomes, and infers missing values in the CNS data.
main_breakage: Identifies breakpoints in CNS data.
main_coverage: Calculates coverage statistics for CNS data.
main_ploidy: Calculates ploidy statistics for CNS data.
main_segment: Segments CNS data based on specified parameters.
main_aggregate: Aggregates CNS data over specified genomic segments.
main_seg_agg: Segments CNS data and aggregates the results.
- cns.pipelines.main_aggregate(cns_df, segs, how='mean', cn_columns=None, print_info=False)
Aggregates CNS data over specified genomic segments.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
segs (pandas.DataFrame) – DataFrame containing segments over which to aggregate CNS data.
how (str, optional) – Aggregation method. Options are “mean”, “min”, “max”, or “none”. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with aggregated copy number values.
- Return type:
pandas.DataFrame
Notes
If how is not “none” and there are NaNs in cns_df, a warning is issued because NaNs are not considered in aggregation.
Examples
>>> aggregated_cns = main_aggregate(cns_df, segs, how="max")
- cns.pipelines.main_align(cns_df, samples_df=None, cn_columns=None, segs=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Aligns all samples with the reference assembly. Adds missing segments, chromosomes, and cuts off the ends if needed.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS (Copy Number Segment) data.
samples_df (pandas.DataFrame, optional) – DataFrame containing sample information. If None, samples are created from cns_df.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
assembly (object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with filled gaps and added missing chromosomes.
- Return type:
pandas.DataFrame
Notes
This function performs the following steps: 1. Adds tails to the CNS data to cover chromosome ends. 2. Fills gaps between segments. 3. Optionally adds missing chromosomes. 4. Removes outlier segments. 5. Merges neighboring segments with the same copy number.
- cns.pipelines.main_breakage(cns_df, samples_df=None, cn_columns=None, segs=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Identifies breakpoints in CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
threshold (float, optional) – Threshold for detecting breakpoints. Default is 0.5.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
segs (segments dictionary, optional) – Dictionary of segments used for selective masking. Default is None.
assembly (Assembly object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame containing breakpoint information.
- Return type:
pandas.DataFrame
Notes
This function detects breakpoints in the CNS data based on changes in copy number values.
- cns.pipelines.main_coverage(cns_df, samples_df=None, cn_columns=None, segs=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Calculates coverage statistics for CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame, optional) – DataFrame containing sample information. If None, samples are created from cns_df.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
segs (segments dictionary, optional) – Dictionary of segments used for selective masking. Default is None.
assembly (Assembly object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame containing coverage statistics for each sample.
- Return type:
pandas.DataFrame
Notes
The function calculates coverage metrics such as the fraction of the genome covered by CNS data.
- cns.pipelines.main_impute(cns_df, samples_df=None, cn_columns=None, segs=None, method='extend', assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Fills gaps in the CNS data, adds missing chromosomes, and infers missing values.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS (Copy Number Segment) data.
samples_df (pandas.DataFrame, optional) – DataFrame containing sample information. If None, samples are created from cns_df.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
assembly (object, optional) – Genome assembly to use. Default is hg19.
method (str, optional) – Inference method to use. Options are “extend”, “diploid”, or “zero”. Default is “extend”.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with filled gaps, added missing chromosomes, and inferd values.
- Return type:
pandas.DataFrame
- cns.pipelines.main_infer(cns_df, samples_df=None, cn_columns=None, segs=None, method='extend', print_info=False)
Infers values to replace the NaNs in the CNS data.
NOTE: Only replaces NaNs! Usually requires main_align to be ran first.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame, optional) – DataFrame containing sample information. Required if method is “diploid”.
method (str, optional) – Inference method to use. Options are “extend”, “diploid”, or “zero”. Default is “extend”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with imputed copy number values.
- Return type:
pandas.DataFrame
Notes
This function performs the following steps: 1. Infers missing values based on the specified method. 2. Fills any remaining NaNs with zeros. 3. Merges neighboring segments with the same copy number.
- cns.pipelines.main_ploidy(cns_df, samples_df=None, cn_columns=None, segs=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Calculates ploidy statistics for CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame, optional) – DataFrame containing sample information. If None, samples are created from cns_df.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
segs (segments dictionary, optional) – Dictionary of segments used for selective masking. Default is None.
assembly (Assembly object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame containing ploidy statistics for each sample.
- Return type:
pandas.DataFrame
Notes
This function calculates ploidy metrics such as the fraction of the genome that is aneuploid.
- cns.pipelines.main_seg_agg(cns_df, select_segs=None, remove_segs=None, how='mean', split_size=-1, merge_dist=-1, filter_size=-1, cn_columns=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Segments CNS data and aggregates the results.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS (Copy Number Segment) data.
select_segs (segment dictionary, optional) – Segments to select for computation. By default covers the whole assembly.
remove_segs (segments dictionary, optional) – Segments to remove from the selection.
how (str, optional) – Aggregation method to use. Default is “mean”.
split_size (int, optional) – Size in base pairs to split segments. Default is -1 (no splitting).
merge_dist (int, optional) – Distance in base pairs to merge nearby segments. Default is -1 (no merging).
filter_size (int, optional) – Minimum size in base pairs to filter segments. Default is -1 (no filtering).
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
assembly (Assembly object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with aggregated CNS data.
- Return type:
pandas.DataFrame
- cns.pipelines.main_segment(select_segs=None, remove_segs=None, split_size=-1, merge_dist=-1, keep_ends=True, filter_size=-1, align_to_assembly=False, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Creates a segmentation based on specific segments.
- Parameters:
select_segs – Segments to select for computation. By default covers the whole assembly.
remove_segs (segment dictionary, optional) – Segments to remove from the selection. Be default nothing is removed.
split_size (int, optional) – Size in base pairs to split segments. Default is -1 (no splitting).
merge_dist (int, optional) – Distance in base pairs to merge nearby segments. Default is -1 (no merging). If 0, merges all touching segments.
keep_ends (bool, optional) – If True, clustering (merge_dist > 0) will not cluster start and end breakpoint of each chromosomes.
filter_size (int, optional) – Minimum size in base pairs to filter segments. Default is -1 (no filtering).
align_to_assembly (bool, optional) – If True, aligns segments to the assembly. Default is False.
assembly (Assembly object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
Dictionary of segments after processing.
- Return type:
dictionary of segments