cns.process package

cns.process.aggregation module

cns.process.aggregation.add_cum_mid(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=True)

Adds a ‘cum_mid’ column to the CNS data representing the cumulative middle position across chromosomes based on the specified genome assembly.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data with ‘chrom’, ‘start’, and ‘end’ columns.

  • assembly (object, optional) – Genome assembly to use for chromosome lengths. Default is hg19.

  • inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.

Returns:

DataFrame with the added ‘cum_mid’ column.

Return type:

pandas.DataFrame

cns.process.aggregation.add_mid(cns_df, inplace=True)

Adds a ‘mid’ column to the CNS data representing the middle point of each segment.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data with ‘start’ and ‘end’ columns.

  • inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.

Returns:

DataFrame with the added ‘mid’ column.

Return type:

pandas.DataFrame

cns.process.aggregation.add_total_cn(cns_df, cn_columns=None, remove_cn_columns=False, inplace=True)

Adds a total copy number (CN) column to the CNS data.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • remove_cn_columns (bool, optional) – If True, removes the individual CN columns after adding the total CN column. Default is False.

  • inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.

Returns:

DataFrame with the added total CN column.

Return type:

pandas.DataFrame

cns.process.aggregation.aggregate_by_break_type(cns_df, break_type, assembly=<cns.utils.assemblies.Assembly object>, how='mean', cn_columns=None, print_info=True)

Aggregates CNS data by break type.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • break_type (str) – Type of break to use for aggregation.

  • assembly (object, optional) – Genome assembly to use. Default is hg19.

  • how (str, optional) – Aggregation method to use. Default is “mean”.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is True.

Returns:

DataFrame with aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.aggregate_by_breaks(cns_df, breaks, how='mean', cn_columns=None, print_info=True)

Aggregates CNS data by specified breaks.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • breaks (list of tuples) – List of breakpoints to use for aggregation.

  • how (str, optional) – Aggregation method to use. Default is “mean”.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is True.

Returns:

DataFrame with aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.aggregate_by_segments(cns_df, segs, how='mean', cn_columns=None, print_info=True)

Aggregates CNS data by specified segments.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • segments (pandas.DataFrame) – DataFrame containing segments to use for aggregation.

  • how (str, optional) – Aggregation method to use. Default is “mean”.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is True.

Returns:

DataFrame with aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.group_samples(cns_df, cn_columns=None, how='mean', group_name='grouped')

Groups CNS data by samples and aggregates the results.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • how (str, optional) – Aggregation method to use. Options are “mean”, “max”, “min”. Default is “mean”.

  • group_name (str, optional) – Name for the grouped samples. Default is “grouped”.

Returns:

DataFrame with grouped and aggregated CNS data.

Return type:

pandas.DataFrame

cns.process.aggregation.mean_value_per_seg(cns_df, segs, score_col)

Calculate weighted scores for segments based on overlap with copy number segments.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing copy number segments

  • segs (dict) – Dictionary mapping chromosome names to lists of segments (start, end, name)

  • score_col (str) – Column name containing the scores to aggregate

Returns:

DataFrame with scores for each segment

Return type:

pandas.DataFrame

cns.process.aggregation.stack_groups(cns_dfs, labels=None)

Stacks multiple CNS DataFrames into a single DataFrame.

Parameters:
  • cns_dfs (list of pandas.DataFrame) – List of CNS DataFrames to stack.

  • labels (list of str, optional) – List of labels for the stacked DataFrames. If specified, the length of labels must be equal to the number of DataFrames.

Returns:

Stacked DataFrame.

Return type:

pandas.DataFrame

cns.process.breakpoints module

cns.process.breakpoints.make_breaks(break_type, strategy='scale', assembly=<cns.utils.assemblies.Assembly object>)

Creates breakpoints based on the specified break type.

Parameters:
  • break_type (str) – Type of break to use for creating breakpoints. Options are “arms”, “bands”, “whole”, or a step size (e.g., “1MB”).

  • assembly (object, optional) – Genome assembly to use. Default is hg19.

Returns:

Dictionary with chromosome names as keys and list of breaks as values.

Return type:

dict

Raises:

ValueError – If the break type is not recognized.

cns.process.breakpoints.split_into_bins(reg_len, step_size, strategy='after')

Splits a region into bins of specified size using a given strategy.

Parameters:
  • reg_len (int) – Length of the region to split.

  • step_size (int) – Size of each bin.

  • strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.

Returns:

List of bin boundaries.

Return type:

list of int

cns.process.clustering module

cns.process.clustering.cluster_segments(input_segs, clust_dist, keep_ends=True, print_info=False)

Clusters segments based on the specified clustering distance.

Parameters:
  • input_segs (dict) – Dictionary of input segments with chromosome names as keys and list of segments as values.

  • clust_dist (int) – Clustering distance to use for merging segments.

  • keep_ends (bool, optional) – If True, keeps the ends of the segments. Default is True.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

Dictionary of clustered segments with chromosome names as keys and list of segments as values.

Return type:

dict

cns.process.imputation module

cns.process.imputation.add_missing(cns_df, samples_df=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)

Adds missing chromosomes to the CNS data.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • samples_df (pandas.DataFrame) – DataFrame containing sample information.

  • chr_lens (dict) – Dictionary of chromosome lengths.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with missing chromosomes added.

Return type:

pandas.DataFrame

cns.process.imputation.add_tails(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)

Adds tails to the CNS data.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • assembly (Assembly object) – Assembly object containing chromosome length information.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with tails added.

Return type:

pandas.DataFrame

cns.process.imputation.cns_infer(cns_df, samples_df, method='extend', cn_columns=None, print_info=True)

Infers NaN values in the CNS data.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • samples_df (pandas.DataFrame) – DataFrame containing sample information.

  • method (str, optional) – Inference method to use. Options are “extend”, “diploid”, or “zero”. Default is “extend”. Note - if “name” is present in cns_df, it will be removed when using “extend” method.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with imputed copy number values.

Return type:

pandas.DataFrame

cns.process.imputation.fill_gaps(cns_df, print_info=True)

Fills gaps in the CNS data.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with gaps filled.

Return type:

pandas.DataFrame

cns.process.imputation.fill_nans_with_zeros(cns_df, cn_columns=None, print_info=True)

Fills NaN values in the CNS data with zeros.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with NaN values filled with zeros.

Return type:

pandas.DataFrame

cns.process.imputation.merge_cns_df(cns_df, cn_columns=None, print_info=True)

Merges consecutive CNS segments with the same copy number values.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with merged CNS segments.

Return type:

pandas.DataFrame

cns.process.imputation.remove_outliers(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)

Removes outliers from the CNS data.

Parameters:
  • cns_df (pandas.DataFrame) – DataFrame containing CNS data.

  • assembly (Assembly object) – Assembly object containing chromosome length information.

  • print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

DataFrame with outliers removed.

Return type:

pandas.DataFrame

cns.process.normalize module

cns.process.normalize.get_norm_sizes(segs, assembly=<cns.utils.assemblies.Assembly object>)

Calculates normalization sizes for autosomes, sex chromosomes, and all chromosomes.

Parameters:
  • segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • assembly (object, optional) – Genome assembly to use. Default is hg19.

Returns:

Dictionary containing normalization sizes for autosomes, sex chromosomes, and all chromosomes.

Return type:

dict

cns.process.normalize.normalize_feature(samples, feature, norm_sizes)

Normalizes a feature in the samples DataFrame based on the provided normalization sizes.

Parameters:
  • samples (pandas.DataFrame) – DataFrame containing sample information.

  • feature (str) – Name of the feature to normalize.

  • norm_sizes (dict) – Dictionary containing normalization sizes for autosomes, sex chromosomes, and all chromosomes.

Returns:

DataFrame with the normalized feature.

Return type:

pandas.DataFrame

cns.process.segments module

cns.process.segments.align_segs_to_assembly(segs, sorted=False, assembly=<cns.utils.assemblies.Assembly object>)

Aligns segments to the reference assembly by adding missing segments and adjusting boundaries.

Parameters:
  • segs (dict) – A dictionary with chromosome names as keys and lists of segments as values.

  • sorted (bool, optional) – If True, sorts the segments for each chromosome. Default is False.

  • assembly (object, optional) – The genome assembly to use. Default is hg19.

Returns:

A dictionary with chromosome names as keys and lists of segments as values.

Return type:

dict

cns.process.segments.cns_df_to_segments(segs_df, process=None)

Converts a DataFrame of segments to a dictionary of segments by chromosome.

Parameters:
  • segs_df (pd.DataFrame) – A DataFrame containing segment information with columns “chrom”, “start”, “end”, and optionally “name”.

  • process (str, optional) – The processing step to apply to the segments. Can be “merge” or “unify” or None.

Returns:

A dictionary with chromosome names as keys and lists of segments as values.

Return type:

dict

cns.process.segments.count_segments(segs)

Counts the total number of segments in the input dictionary.

Parameters:

segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

Returns:

Total number of segments.

Return type:

int

cns.process.segments.do_segments_overlap(segs, is_sorted=False)

Checks if any segments overlap in the input dictionary.

Parameters:
  • segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.

Returns:

True if any segments overlap, False otherwise.

Return type:

bool

cns.process.segments.filter_cons_size(chr_segs, min_size)

Filters segments based on a minimum size, keeping only consecutive segments.

Parameters:
  • segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • min_size (int) – Minimum size of segments to keep.

Returns:

Dictionary of filtered segments.

Return type:

dict

cns.process.segments.filter_min_size(chr_segs, min_size, merge_first=False)

Filters segments based on a minimum size.

Parameters:
  • segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • min_size (int) – Minimum size of segments to keep.

Returns:

Dictionary of filtered segments.

Return type:

dict

cns.process.segments.find_overlaps(segs, is_sorted=False)

Finds overlapping segments in the input dictionary.

Parameters:
  • segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.

Returns:

Dictionary of overlapping segments with chromosome names as keys and list of overlapping segments as values.

Return type:

dict

cns.process.segments.get_consecutive_segs(segs)

Groups consecutive segments for each chromosome.

Parameters:

segs (list of tuples) – List of segments for a chromosome.

Returns:

List of lists of consecutive segments.

Return type:

list of lists

cns.process.segments.make_segments(segment_type, assembly=<cns.utils.assemblies.Assembly object>)

Selects and returns specific genomic regions of the selected type.

Parameters:
  • select (str) – The selection criteria for the regions. Options include: - “”: Returns an empty dictionary. - “arms”: Returns chromosome arms. - “bands”: Returns cytogenetic bands. - “whole”: Returns whole genome segments. - “gaps”: Returns gap regions. - “centromeres”: Returns centromeric regions. - “chrX”: Returns the whole chromosome X (replace X with the chromosome number or name).

  • assembly (object, optional) – The genome assembly to use. Default is hg19.

Returns:

A dictionary with chromosome names as keys and lists of tuples representing the regions as values. Each tuple contains (start, end, name).

Return type:

dict

Raises:

ValueError – If an invalid chromosome is specified in the select parameter.

cns.process.segments.merge_segments(segs, is_sorted=False)

Merges overlapping segments in the input dictionary.

Parameters:
  • segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.

Returns:

Dictionary of merged segments with chromosome names as keys and list of merged segments as values.

Return type:

dict

cns.process.segments.segment_difference(segs_a, segs_b, sorted=False)

Computes the difference between two sets of segments.

Parameters:
  • segs1 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • segs2 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

Returns:

Dictionary of segments representing the difference between the input segments.

Return type:

dict

cns.process.segments.segment_union(segs_a, segs_b, merge=True)

Computes the union of two sets of segments.

Parameters:
  • segs1 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • segs2 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

Returns:

Dictionary of segments representing the union of the input segments.

Return type:

dict

cns.process.segments.split_segment(seg_start, seg_end, seg_name, step_size, strategy='scale')

Splits a segment into smaller segments of specified size.

Parameters:
  • start (int) – Start position of the segment.

  • end (int) – End position of the segment.

  • step_size (int) – Size of each smaller segment.

  • strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.

Returns:

List of smaller segments.

Return type:

list of tuples

cns.process.segments.split_segments(segments, step_size, strategy='scale')

Splits segments into smaller segments of specified size.

Parameters:
  • segments (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.

  • step_size (int) – Size of each smaller segment.

  • strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.

Returns:

Dictionary of smaller segments.

Return type:

dict