cns.process package
cns.process.aggregation module
- cns.process.aggregation.add_cum_mid(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=True)
Adds a ‘cum_mid’ column to the CNS data representing the cumulative middle position across chromosomes based on the specified genome assembly.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data with ‘chrom’, ‘start’, and ‘end’ columns.
assembly (object, optional) – Genome assembly to use for chromosome lengths. Default is hg19.
inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.
- Returns:
DataFrame with the added ‘cum_mid’ column.
- Return type:
pandas.DataFrame
- cns.process.aggregation.add_mid(cns_df, inplace=True)
Adds a ‘mid’ column to the CNS data representing the middle point of each segment.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data with ‘start’ and ‘end’ columns.
inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.
- Returns:
DataFrame with the added ‘mid’ column.
- Return type:
pandas.DataFrame
- cns.process.aggregation.add_total_cn(cns_df, cn_columns=None, remove_cn_columns=False, inplace=True)
Adds a total copy number (CN) column to the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
remove_cn_columns (bool, optional) – If True, removes the individual CN columns after adding the total CN column. Default is False.
inplace (bool, optional) – If True, modifies the original DataFrame. Default is True.
- Returns:
DataFrame with the added total CN column.
- Return type:
pandas.DataFrame
- cns.process.aggregation.aggregate_by_break_type(cns_df, break_type, assembly=<cns.utils.assemblies.Assembly object>, how='mean', cn_columns=None, print_info=True)
Aggregates CNS data by break type.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
break_type (str) – Type of break to use for aggregation.
assembly (object, optional) – Genome assembly to use. Default is hg19.
how (str, optional) – Aggregation method to use. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is True.
- Returns:
DataFrame with aggregated CNS data.
- Return type:
pandas.DataFrame
- cns.process.aggregation.aggregate_by_breaks(cns_df, breaks, how='mean', cn_columns=None, print_info=True)
Aggregates CNS data by specified breaks.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
breaks (list of tuples) – List of breakpoints to use for aggregation.
how (str, optional) – Aggregation method to use. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is True.
- Returns:
DataFrame with aggregated CNS data.
- Return type:
pandas.DataFrame
- cns.process.aggregation.aggregate_by_segments(cns_df, segs, how='mean', cn_columns=None, print_info=True)
Aggregates CNS data by specified segments.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
segments (pandas.DataFrame) – DataFrame containing segments to use for aggregation.
how (str, optional) – Aggregation method to use. Default is “mean”.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is True.
- Returns:
DataFrame with aggregated CNS data.
- Return type:
pandas.DataFrame
- cns.process.aggregation.group_samples(cns_df, cn_columns=None, how='mean', group_name='grouped')
Groups CNS data by samples and aggregates the results.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
how (str, optional) – Aggregation method to use. Options are “mean”, “max”, “min”. Default is “mean”.
group_name (str, optional) – Name for the grouped samples. Default is “grouped”.
- Returns:
DataFrame with grouped and aggregated CNS data.
- Return type:
pandas.DataFrame
- cns.process.aggregation.mean_value_per_seg(cns_df, segs, score_col)
Calculate weighted scores for segments based on overlap with copy number segments.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing copy number segments
segs (dict) – Dictionary mapping chromosome names to lists of segments (start, end, name)
score_col (str) – Column name containing the scores to aggregate
- Returns:
DataFrame with scores for each segment
- Return type:
pandas.DataFrame
- cns.process.aggregation.stack_groups(cns_dfs, labels=None)
Stacks multiple CNS DataFrames into a single DataFrame.
- Parameters:
cns_dfs (list of pandas.DataFrame) – List of CNS DataFrames to stack.
labels (list of str, optional) – List of labels for the stacked DataFrames. If specified, the length of labels must be equal to the number of DataFrames.
- Returns:
Stacked DataFrame.
- Return type:
pandas.DataFrame
cns.process.breakpoints module
- cns.process.breakpoints.make_breaks(break_type, strategy='scale', assembly=<cns.utils.assemblies.Assembly object>)
Creates breakpoints based on the specified break type.
- Parameters:
break_type (str) – Type of break to use for creating breakpoints. Options are “arms”, “bands”, “whole”, or a step size (e.g., “1MB”).
assembly (object, optional) – Genome assembly to use. Default is hg19.
- Returns:
Dictionary with chromosome names as keys and list of breaks as values.
- Return type:
dict
- Raises:
ValueError – If the break type is not recognized.
- cns.process.breakpoints.split_into_bins(reg_len, step_size, strategy='after')
Splits a region into bins of specified size using a given strategy.
- Parameters:
reg_len (int) – Length of the region to split.
step_size (int) – Size of each bin.
strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.
- Returns:
List of bin boundaries.
- Return type:
list of int
cns.process.clustering module
- cns.process.clustering.cluster_segments(input_segs, clust_dist, keep_ends=True, print_info=False)
Clusters segments based on the specified clustering distance.
- Parameters:
input_segs (dict) – Dictionary of input segments with chromosome names as keys and list of segments as values.
clust_dist (int) – Clustering distance to use for merging segments.
keep_ends (bool, optional) – If True, keeps the ends of the segments. Default is True.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
Dictionary of clustered segments with chromosome names as keys and list of segments as values.
- Return type:
dict
cns.process.imputation module
- cns.process.imputation.add_missing(cns_df, samples_df=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)
Adds missing chromosomes to the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame) – DataFrame containing sample information.
chr_lens (dict) – Dictionary of chromosome lengths.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with missing chromosomes added.
- Return type:
pandas.DataFrame
- cns.process.imputation.add_tails(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)
Adds tails to the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
assembly (Assembly object) – Assembly object containing chromosome length information.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with tails added.
- Return type:
pandas.DataFrame
- cns.process.imputation.cns_infer(cns_df, samples_df, method='extend', cn_columns=None, print_info=True)
Infers NaN values in the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
samples_df (pandas.DataFrame) – DataFrame containing sample information.
method (str, optional) – Inference method to use. Options are “extend”, “diploid”, or “zero”. Default is “extend”. Note - if “name” is present in cns_df, it will be removed when using “extend” method.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with imputed copy number values.
- Return type:
pandas.DataFrame
- cns.process.imputation.fill_gaps(cns_df, print_info=True)
Fills gaps in the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with gaps filled.
- Return type:
pandas.DataFrame
- cns.process.imputation.fill_nans_with_zeros(cns_df, cn_columns=None, print_info=True)
Fills NaN values in the CNS data with zeros.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with NaN values filled with zeros.
- Return type:
pandas.DataFrame
- cns.process.imputation.merge_cns_df(cns_df, cn_columns=None, print_info=True)
Merges consecutive CNS segments with the same copy number values.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from cns_df.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with merged CNS segments.
- Return type:
pandas.DataFrame
- cns.process.imputation.remove_outliers(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=True)
Removes outliers from the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing CNS data.
assembly (Assembly object) – Assembly object containing chromosome length information.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.
- Returns:
DataFrame with outliers removed.
- Return type:
pandas.DataFrame
cns.process.normalize module
- cns.process.normalize.get_norm_sizes(segs, assembly=<cns.utils.assemblies.Assembly object>)
Calculates normalization sizes for autosomes, sex chromosomes, and all chromosomes.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
assembly (object, optional) – Genome assembly to use. Default is hg19.
- Returns:
Dictionary containing normalization sizes for autosomes, sex chromosomes, and all chromosomes.
- Return type:
dict
- cns.process.normalize.normalize_feature(samples, feature, norm_sizes)
Normalizes a feature in the samples DataFrame based on the provided normalization sizes.
- Parameters:
samples (pandas.DataFrame) – DataFrame containing sample information.
feature (str) – Name of the feature to normalize.
norm_sizes (dict) – Dictionary containing normalization sizes for autosomes, sex chromosomes, and all chromosomes.
- Returns:
DataFrame with the normalized feature.
- Return type:
pandas.DataFrame
cns.process.segments module
- cns.process.segments.align_segs_to_assembly(segs, sorted=False, assembly=<cns.utils.assemblies.Assembly object>)
Aligns segments to the reference assembly by adding missing segments and adjusting boundaries.
- Parameters:
segs (dict) – A dictionary with chromosome names as keys and lists of segments as values.
sorted (bool, optional) – If True, sorts the segments for each chromosome. Default is False.
assembly (object, optional) – The genome assembly to use. Default is hg19.
- Returns:
A dictionary with chromosome names as keys and lists of segments as values.
- Return type:
dict
- cns.process.segments.cns_df_to_segments(segs_df, process=None)
Converts a DataFrame of segments to a dictionary of segments by chromosome.
- Parameters:
segs_df (pd.DataFrame) – A DataFrame containing segment information with columns “chrom”, “start”, “end”, and optionally “name”.
process (str, optional) – The processing step to apply to the segments. Can be “merge” or “unify” or None.
- Returns:
A dictionary with chromosome names as keys and lists of segments as values.
- Return type:
dict
- cns.process.segments.count_segments(segs)
Counts the total number of segments in the input dictionary.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
- Returns:
Total number of segments.
- Return type:
int
- cns.process.segments.do_segments_overlap(segs, is_sorted=False)
Checks if any segments overlap in the input dictionary.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.
- Returns:
True if any segments overlap, False otherwise.
- Return type:
bool
- cns.process.segments.filter_cons_size(chr_segs, min_size)
Filters segments based on a minimum size, keeping only consecutive segments.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
min_size (int) – Minimum size of segments to keep.
- Returns:
Dictionary of filtered segments.
- Return type:
dict
- cns.process.segments.filter_min_size(chr_segs, min_size, merge_first=False)
Filters segments based on a minimum size.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
min_size (int) – Minimum size of segments to keep.
- Returns:
Dictionary of filtered segments.
- Return type:
dict
- cns.process.segments.find_overlaps(segs, is_sorted=False)
Finds overlapping segments in the input dictionary.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.
- Returns:
Dictionary of overlapping segments with chromosome names as keys and list of overlapping segments as values.
- Return type:
dict
- cns.process.segments.get_consecutive_segs(segs)
Groups consecutive segments for each chromosome.
- Parameters:
segs (list of tuples) – List of segments for a chromosome.
- Returns:
List of lists of consecutive segments.
- Return type:
list of lists
- cns.process.segments.make_segments(segment_type, assembly=<cns.utils.assemblies.Assembly object>)
Selects and returns specific genomic regions of the selected type.
- Parameters:
select (str) – The selection criteria for the regions. Options include: - “”: Returns an empty dictionary. - “arms”: Returns chromosome arms. - “bands”: Returns cytogenetic bands. - “whole”: Returns whole genome segments. - “gaps”: Returns gap regions. - “centromeres”: Returns centromeric regions. - “chrX”: Returns the whole chromosome X (replace X with the chromosome number or name).
assembly (object, optional) – The genome assembly to use. Default is hg19.
- Returns:
A dictionary with chromosome names as keys and lists of tuples representing the regions as values. Each tuple contains (start, end, name).
- Return type:
dict
- Raises:
ValueError – If an invalid chromosome is specified in the select parameter.
- cns.process.segments.merge_segments(segs, is_sorted=False)
Merges overlapping segments in the input dictionary.
- Parameters:
segs (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
is_sorted (bool, optional) – If True, assumes the segments are already sorted. Default is False.
- Returns:
Dictionary of merged segments with chromosome names as keys and list of merged segments as values.
- Return type:
dict
- cns.process.segments.segment_difference(segs_a, segs_b, sorted=False)
Computes the difference between two sets of segments.
- Parameters:
segs1 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
segs2 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
- Returns:
Dictionary of segments representing the difference between the input segments.
- Return type:
dict
- cns.process.segments.segment_union(segs_a, segs_b, merge=True)
Computes the union of two sets of segments.
- Parameters:
segs1 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
segs2 (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
- Returns:
Dictionary of segments representing the union of the input segments.
- Return type:
dict
- cns.process.segments.split_segment(seg_start, seg_end, seg_name, step_size, strategy='scale')
Splits a segment into smaller segments of specified size.
- Parameters:
start (int) – Start position of the segment.
end (int) – End position of the segment.
step_size (int) – Size of each smaller segment.
strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.
- Returns:
List of smaller segments.
- Return type:
list of tuples
- cns.process.segments.split_segments(segments, step_size, strategy='scale')
Splits segments into smaller segments of specified size.
- Parameters:
segments (dict) – Dictionary of segments with chromosome names as keys and list of segments as values.
step_size (int) – Size of each smaller segment.
strategy (str, optional) – Strategy to use for splitting. Options are “scale”, “pad”, “after”. Default is “after”.
- Returns:
Dictionary of smaller segments.
- Return type:
dict