cns.process package

cns.process.aggregation module

cns.process.aggregation.add_cum_mid(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=True)

Adds a ‘cum_mid’ column to the CNS data representing the cumulative middle position across chromosomes based on the specified genome assembly.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data with ‘chrom’, ‘start’, and ‘end’ columns.
assembly (object, optional) – Genome assembly to use for chromosome lengths. Default is hg19.
inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.

Returns:

DataFrame with the added ‘cum_mid’ column.

Return type:

pandas.DataFrame

cns.process.aggregation.add_mid(cns_df, inplace=True)

Adds a ‘mid’ column to the CNS data representing the middle point of each segment.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data with ‘start’ and ‘end’ columns.
inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.

Returns:

DataFrame with the added ‘mid’ column.

Return type:

pandas.DataFrame

cns.process.aggregation.add_total_cn(cns_df, cn_columns=None, remove_cn_columns=False, inplace=True)

Adds a total copy number (CN) column to the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
remove_cn_columns (bool, optional) – If True, removes the individual CN columns after adding the total CN column. Default is False.
inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.

Returns:

DataFrame with the added total CN column.

Return type:

pandas.DataFrame

cns.process.aggregation.aggregate_by_break_type(cns_df, break_type, assembly=<cns.utils.assemblies.Assembly object>, how='mean', cn_columns=None, print_info=True)

Aggregates CNS data by break type.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
break_type (str) – Type of break to use for aggregation.
assembly (object, optional) – Genome assembly to use. Default is hg19.
how (str, optional) – Aggregation method to use. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is True.

Returns:

DataFrame with aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.aggregate_by_breaks(cns_df, breaks, how='mean', cn_columns=None, print_info=True)

Aggregates CNS data by specified breaks.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
breaks (list of tuples) – List of breakpoints to use for aggregation.
how (str, optional) – Aggregation method to use. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is True.

Returns:

DataFrame with aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.aggregate_by_segments(cns_df, segs, how='mean', cn_columns=None, print_info=True)

Aggregates CNS data by specified segments.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
segments (pandas.DataFrame) – DataFrame containing segments to use for aggregation.
how (str, optional) – Aggregation method to use. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is True.

Returns:

DataFrame with aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.group_samples(cns_df, cn_columns=None, how='mean', group_name='grouped')

Groups CNS data by samples and aggregates the results.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
how (str, optional) – Aggregation method to use. Options are “mean”, “max”, “min”. Default is “mean”.
group_name (str, optional) – Name for the grouped samples. Default is “grouped”.

Returns:

DataFrame with grouped and aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.mean_value_per_seg(cns_df, segs, score_col)

Calculate weighted scores for segments based on overlap with copy number segments.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing copy number segments
segs (dict) – Dictionary mapping chromosome names to lists of segments (start, end, name)
score_col (str) – Column name containing the scores to aggregate

Returns:

DataFrame with scores for each segment

Return type:

pandas.DataFrame

cns.process.aggregation.stack_groups(cns_dfs, labels=None)

Stacks multiple CNS DataFrames into a single DataFrame.

Parameters:

cns_dfs (list of pandas.DataFrame) – List of CNS DataFrames to stack.
labels (list of str, optional) – List of labels for the stacked DataFrames. If specified, the length of labels must be equal to the number of DataFrames.

Returns:

Stacked DataFrame.

Return type:

pandas.DataFrame

cns.process.breakpoints module

cns.process.breakpoints.make_breaks(break_type, strategy='scale', assembly=<cns.utils.assemblies.Assembly object>)

Creates breakpoints based on the specified break type.

Parameters:

break_type (str) – Type of break to use for creating breakpoints. Options are “arms”, “bands”, “whole”, or a step size (e.g., “1MB”).
assembly (object, optional) – Genome assembly to use. Default is hg19.

Returns:

Dictionary with chromosome names as keys and list of breaks as values.

Return type:

dict

Raises:

ValueError – If the break type is not recognized.

cns.process.breakpoints.split_into_bins(reg_len, step_size, strategy='after')

Splits a region into bins of specified size using a given strategy.

Parameters:

reg_len (int) – Length of the region to split.
step_size (int) – Size of each bin.
strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.

Returns:

List of bin boundaries.

Return type:

list of int

cns.process.clustering module

cns.process.clustering.cluster_segments(input_segs, clust_dist, keep_ends=True, print_info=False)

Clusters segments based on the specified clustering distance.

Parameters:

input_segs (dict) – Dictionary of input segments with chromosome names as keys and list of segments as values.
clust_dist (int) – Clustering distance to use for merging segments.
keep_ends (bool, optional) – If True, keeps the ends of the segments. Default is True.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

Dictionary of clustered segments with chromosome names as keys and list of segments as values.

Return type:

dict

cns.process.imputation module

cns.process.imputation.add_missing(cns_df, samples_df=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)

Adds missing chromosomes to the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame) – DataFrame containing sample information.
chr_lens (dict) – Dictionary of chromosome lengths.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with missing chromosomes added.

Return type:

pandas.DataFrame

cns.process.imputation.add_tails(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)

Adds tails to the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
assembly (Assembly object) – Assembly object containing chromosome length information.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with tails added.

Return type:

pandas.DataFrame

cns.process.imputation.cns_infer(cns_df, samples_df, method='extend', cn_columns=None, print_info=True)

Infers NaN values in the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame) – DataFrame containing sample information.
method (str, optional) – Inference method to use. Options are “extend”, “diploid”, or “zero”. Default is “extend”. Note - if “name” is present in cns_df, it will be removed when using “extend” method.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with imputed copy number values.

Return type:

pandas.DataFrame

cns.process.imputation.fill_gaps(cns_df, print_info=True)

Fills gaps in the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with gaps filled.

Return type:

pandas.DataFrame

cns.process.imputation.fill_nans_with_zeros(cns_df, cn_columns=None, print_info=True)

Fills NaN values in the CNS data with zeros.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with NaN values filled with zeros.

Return type:

pandas.DataFrame

cns.process.imputation.merge_cns_df(cns_df, cn_columns=None, print_info=True)

Merges consecutive CNS segments with the same copy number values.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with merged CNS segments.

Return type:

pandas.DataFrame

cns.process.imputation.remove_outliers(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)

Removes outliers from the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing CNS data.
assembly (Assembly object) – Assembly object containing chromosome length information.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with outliers removed.

Return type:

pandas.DataFrame

cns.process.normalize module

cns.process.normalize.get_norm_sizes(segs, assembly=<cns.utils.assemblies.Assembly object>)

Calculates normalization sizes for autosomes, sex chromosomes, and all chromosomes.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
assembly (object, optional) – Genome assembly to use. Default is hg19.

Returns:

Dictionary containing normalization sizes for autosomes, sex chromosomes, and all chromosomes.

Return type:

dict

cns.process.normalize.normalize_feature(samples, feature, norm_sizes)

Normalizes a feature in the samples DataFrame based on the provided normalization sizes.

Parameters:

samples (pandas.DataFrame) – DataFrame containing sample information.
feature (str) – Name of the feature to normalize.
norm_sizes (dict) – Dictionary containing normalization sizes for autosomes, sex chromosomes, and all chromosomes.

Returns:

DataFrame with the normalized feature.

Return type:

pandas.DataFrame

cns.process.segments module

cns.process.segments.align_segs_to_assembly(segs, sorted=False, assembly=<cns.utils.assemblies.Assembly object>)

Aligns segments to the reference assembly by adding missing segments and adjusting boundaries.

Parameters:

segs (dict) – A dictionary with chromosome names as keys and lists of segments as values.
sorted (bool, optional) – If True, sorts the segments for each chromosome. Default is False.
assembly (object, optional) – The genome assembly to use. Default is hg19.

Returns:

A dictionary with chromosome names as keys and lists of segments as values.

Return type:

dict

cns.process.segments.cns_df_to_segments(segs_df, process=None)

Converts a DataFrame of segments to a dictionary of segments by chromosome.

Parameters:

segs_df (pd.DataFrame) – A DataFrame containing segment information with columns “chrom”, “start”, “end”, and optionally “name”.
process (str, optional) – The processing step to apply to the segments. Can be “merge” or “unify” or None.

Returns:

A dictionary with chromosome names as keys and lists of segments as values.

Return type:

dict

cns.process.segments.count_segments(segs)

Counts the total number of segments in the input dictionary.

Parameters:: segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
Returns:: Total number of segments.
Return type:: int

cns.process.segments.do_segments_overlap(segs, is_sorted=False)

Checks if any segments overlap in the input dictionary.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.

Returns:

True if any segments overlap, False otherwise.

Return type:

bool

cns.process.segments.filter_cons_size(chr_segs, min_size)

Filters segments based on a minimum size, keeping only consecutive segments.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
min_size (int) – Minimum size of segments to keep.

Returns:

Dictionary of filtered segments.

Return type:

dict

cns.process.segments.filter_min_size(chr_segs, min_size, merge_first=False)

Filters segments based on a minimum size.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
min_size (int) – Minimum size of segments to keep.

Returns:

Dictionary of filtered segments.

Return type:

dict

cns.process.segments.find_overlaps(segs, is_sorted=False)

Finds overlapping segments in the input dictionary.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.

Returns:

Dictionary of overlapping segments with chromosome names as keys and list of overlapping segments as values.

Return type:

dict

cns.process.segments.get_consecutive_segs(segs)

Groups consecutive segments for each chromosome.

Parameters:: segs (list of tuples) – List of segments for a chromosome.
Returns:: List of lists of consecutive segments.
Return type:: list of lists

cns.process.segments.make_segments(segment_type, assembly=<cns.utils.assemblies.Assembly object>)

Selects and returns specific genomic regions of the selected type.

Parameters:

select (str) – The selection criteria for the regions. Options include: - “”: Returns an empty dictionary. - “arms”: Returns chromosome arms. - “bands”: Returns cytogenetic bands. - “whole”: Returns whole genome segments. - “gaps”: Returns gap regions. - “centromeres”: Returns centromeric regions. - “chrX”: Returns the whole chromosome X (replace X with the chromosome number or name).
assembly (object, optional) – The genome assembly to use. Default is hg19.

Returns:

A dictionary with chromosome names as keys and lists of tuples representing the regions as values. Each tuple contains (start, end, name).

Return type:

dict

Raises:

ValueError – If an invalid chromosome is specified in the select parameter.

cns.process.segments.merge_segments(segs, is_sorted=False)

Merges overlapping segments in the input dictionary.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.

Returns:

Dictionary of merged segments with chromosome names as keys and list of merged segments as values.

Return type:

dict

cns.process.segments.segment_difference(segs_a, segs_b, sorted=False)

Computes the difference between two sets of segments.

Parameters:

segs1 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
segs2 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

Returns:

Dictionary of segments representing the difference between the input segments.

Return type:

dict

cns.process.segments.segment_union(segs_a, segs_b, merge=True)

Computes the union of two sets of segments.

Parameters:

segs1 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
segs2 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

Returns:

Dictionary of segments representing the union of the input segments.

Return type:

dict

cns.process.segments.split_segment(seg_start, seg_end, seg_name, step_size, strategy='scale')

Splits a segment into smaller segments of specified size.

Parameters:

start (int) – Start position of the segment.
end (int) – End position of the segment.
step_size (int) – Size of each smaller segment.
strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.

Returns:

List of smaller segments.

Return type:

list of tuples

cns.process.segments.split_segments(segments, step_size, strategy='scale')

Splits segments into smaller segments of specified size.

Parameters:

segments (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
step_size (int) – Size of each smaller segment.
strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.

Returns:

Dictionary of smaller segments.

Return type:

dict