cns.utils package

cns.utils.anomaly module

cns.utils.anomaly.calc_angles(cns_df, cn_col, group_by='sample')

Calculate angles between segments across the entire dataset, handling discontinuities appropriately.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing copy number segments
cn_col (str) – Column name containing copy number values

Returns:

Series of angles indexed by the original DataFrame indices

Return type:

pandas.Series

cns.utils.anomaly.calc_angles_cons(cns_df, cn_col)

Calculate normalized angle differences (second derivative) of copy number values across genomic segments. This function computes a score for each segment in the input DataFrame, reflecting the change in copy number relative to the segment’s position and length. The score is based on the normalized first and second differences of the copy number values, adjusted by segment midpoints and mean segment length. :type cns_df: :param cns_df: DataFrame containing at least ‘start’, ‘end’, and the specified copy number column. :type cns_df: pandas.DataFrame :type cn_col: :param cn_col: Name of the column in cns_df containing copy number values. :type cn_col: str

Returns:: Array of normalized angle difference scores for each segment.
Return type:: np.ndarray
Raises:: ValueError – If the segments are not strictly ordered by their midpoints.

cns.utils.anomaly.calculate_signed_angle(s1, s2)

Calculates the signed angle between two slopes. Concave angles (rising curve) are positive, convex angles (falling curve) are negative.

Parameters:

s1 (float) – Slope of the first line.
s2 (float) – Slope of the second line.

Returns:

angle_degrees – Signed angle in degrees.

Return type:

float

cns.utils.anomaly.count_below_lim(vals, min_val=0, max_val=1, steps=1000)

Counts how many samples are below each cutoff value, linearly spaced.

Parameters:

vals (array-like) – Array of values to count.
min_val (float, optional) – Minimum value for the cutoff range. Default is 0.
max_val (float, optional) – Maximum value for the cutoff range. Default is 1.
steps (int, optional) – Number of steps between min_val and max_val. Default is 1000.

Returns:

cutoffs (numpy.ndarray) – Array of cutoff values.
counts (numpy.ndarray) – Array of counts of samples below each cutoff value.

cns.utils.anomaly.count_cum_val(vals, min_val=0, max_val=1)

Counts the number of samples below each present value.

Parameters:

vals (array-like) – Array of values to count.
min_val (float, optional) – Minimum value to consider. Default is 0.
max_val (float, optional) – Maximum value to consider. Default is 1.

Returns:

unique_vals (numpy.ndarray) – Array of unique values within the specified range.
cumulative_count (numpy.ndarray) – Cumulative count of samples below each unique value.

cns.utils.anomaly.find_bends(vals, min_val=None, max_val=None)

Finds the knee and elbow in a cumulative distribution of values.

Parameters:

vals (array-like) – Array of values to analyze.
min_val (float, optional) – Minimum value to consider. Default is 0.
max_val (float, optional) – Maximum value to consider. Default is 1.

Returns:

X (numpy.ndarray) – Array of unique values within the specified range.
Y (numpy.ndarray) – Cumulative count of samples below each unique value.
knee_index (int) – Index of the knee.
knee_value (float) – Value of the knee.
elbow_index (int) – Index of the elbow.
elbow_value (float) – Value of the elbow.

cns.utils.anomaly.find_knee(x, y, knee=True)

Finds a knee or elbow in the curve using convex/concave curves.

Parameters:

x (array-like) – array if indices.
y (array-like) – Array of values at the indices.
knee (bool, optional) – If True, finds the knee. If False, finds the elbow. Default is True.

Returns:

int – Index of the knee or elbow.
float – Value of the knee or elbow.

cns.utils.anomaly.z_score_filter(vals, min_val=-3, max_val=3)

Removes values with z-scores below min_val or above max_val.

Parameters:

vals (array-like) – Array of values to filter.
min_val (float, optional) – Minimum z-score to keep. Default is -3.
max_val (float, optional) – Maximum z-score to keep. Default is 3.

Returns:

Filtered array of values.

Return type:

numpy.ndarray

cns.utils.assemblies module

class cns.utils.assemblies.Assembly(name, chr_lens, x_name='chrX', y_name='chrY', cytobands=None, gaps=None)

Bases: object

A class to represent a genomic assembly.

name

The name of the assembly.

Type:: str

chr_lens

The lengths of the chromosomes.

Type:: dict

chr_x

The name of the X chromosome.

Type:: str

chr_y

The name of the Y chromosome.

Type:: str

cytobands

The cytobands of the chromosomes.

Type:: list

gaps

The gaps in the chromosomes.

Type:: list

cns.utils.assemblies.get_assembly(assembly_id)

Retrieve an Assembly instance by its ID.

Parameters:: assembly_id (str) – The ID of the assembly to retrieve. Valid values are “hg19” and “hg38”.
Returns:: The Assembly instance corresponding to the given ID.
Return type:: Assembly
Raises:: ValueError – If the assembly_id is not “hg19” or “hg38”.

cns.utils.assemblies.hg19 = <cns.utils.assemblies.Assembly object>: An instance of the Assembly class representing the hg19 genomic assembly.

cns.utils.assemblies.hg38 = <cns.utils.assemblies.Assembly object>: An instance of the Assembly class representing the hg38 genomic assembly.

cns.utils.canonization module

cns.utils.canonization.canonize_chroms(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)

Standardizes chromosome names and validates against assembly.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining valid chromosomes
print_info (bool) – Whether to print info messages

Returns:

DataFrame with canonized chromosome names

Return type:

pd.DataFrame

Raises:

ValueError – If no valid chromosomes found

cns.utils.canonization.canonize_cns_df(cns_df, input_cn_columns=None, order_columns=False, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)

Canonizes copy number DataFrame by standardizing all columns.

Applies standard canonization steps in order: 1. Sample ID column 2. Chromosome names and validation 3. Position columns (start/end) 4. Copy number columns 5. Optional name column

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame to canonize
assembly – Reference assembly for chromosome validation
input_cn_columns (str|list, optional) – Copy number column(s) to validate
print_info (bool) – Whether to print info messages

Returns:

Canonized copy number DataFrame list: Names of canonized copy number columns

Return type:

pd.DataFrame

Raises:

ValueError – If required columns missing or invalid

cns.utils.canonization.canonize_name(cns_df, print_info=False)

Standardizes optional name column if present.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
print_info (bool) – Whether to print info messages

Returns:

DataFrame with canonized name column

Return type:

pd.DataFrame

cns.utils.canonization.canonize_positions(cns_df, print_info=False)

Ensures standard start/end position columns.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
print_info (bool) – Whether to print info messages

Returns:

DataFrame with canonized position columns

Return type:

pd.DataFrame

cns.utils.canonization.canonize_sample_id(df, print_info=False)

Ensures DataFrame has standardized sample_id column.

Parameters:

df (pd.DataFrame) – DataFrame to canonize
print_info (bool) – Whether to print info messages

Returns:

DataFrame with canonized sample_id column

Return type:

pd.DataFrame

cns.utils.canonization.get_cn_cols(cns_df, cn_cols=None)

Gets or validates copy number columns from DataFrame.

Parameters:

cns_df (pd.DataFrame) – DataFrame containing copy number data
cn_cols (str|list, optional) – Column name(s) to validate. If None, discovers CN columns

Returns:

Valid copy number column names

Return type:

list

Raises:

ValueError – If specified columns not found or invalid number of columns

cns.utils.canonization.rename_cn_cols(cns_df, cn_columns=None, print_info=False)

Renames copy number columns in the DataFrame.

Parameters: cns_df (pd.DataFrame): The DataFrame containing copy number data. cn_columns (list): List of column names to be renamed. print_info (bool): Flag to indicate whether to print information.

Returns: pd.DataFrame: The DataFrame with renamed columns. list: The updated list of column names.

cns.utils.conversions module

cns.utils.conversions.bins_to_features(cns_df, cn_columns=None, drop_sex=True)

cns.utils.conversions.breaks_to_segments(breakpoints)

cns.utils.conversions.calc_cum_mid(cns_df, assembly=<cns.utils.assemblies.Assembly object>)

cns.utils.conversions.calc_lengths(cns_df)

cns.utils.conversions.calc_mid(cns_df)

cns.utils.conversions.calc_nan_cols(cns_df, cn_columns=None)

cns.utils.conversions.chrom_to_sortable(chrom, aut_count=22)

cns.utils.conversions.cytobands_to_df(cytobands)

cns.utils.conversions.gaps_to_df(gaps)

cns.utils.conversions.genome_to_segments(assembly=<cns.utils.assemblies.Assembly object>)

cns.utils.conversions.segments_to_breaks(segments)

cns.utils.conversions.segments_to_cns_df(segments, sample_id='segment')

cns.utils.conversions.sortable_to_chrom(sortable, aut_count=22)

cns.utils.conversions.tuples_to_segments(tuples)

cns.utils.conversions.values_count(values_dict)

cns.utils.files module

cns.utils.files.fill_sex_if_missing(cns_df, samples_df)

Fills the sex column in the samples DataFrame if missing, based on the presence of chrY in the CNS data.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing the CNS data.
samples_df (pandas.DataFrame) – DataFrame containing the samples data.

Returns:

Updated samples DataFrame with the sex column filled if missing.

Return type:

pandas.DataFrame

cns.utils.files.load_cns(path, cn_columns=None, sep=None, sort=False, change_coords=True, order_columns=False, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)

Loads a CNS file into a pandas DataFrame. Loading includes canonization process, where the positions column names are standardized “sample_id”, “chrom”, “start”, “end”. CN columns are rename to one of [“major_cn”, “minor_cn”], [“hap1_cn”, “hap2_cn”], [“total_cn”]. If these are not found, other typical column names are searched. If that fails, the first 4 columns are used as position, the following 1-2 as CNs. Coordinates are 1-based on input by default.

Parameters:

path (str) – Path to the CNS file.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from the file.
sep (str, optional) – Separator for the file. If None, the separator is inferred from the file extension.
sort (bool, optional) – If True, sorts the DataFrame by sample_id, chrom, and start.
change_coords (bool, optional) – If True, changes the coordinates to 0-based.
order_columns (bool, optional) – If True and there are two columns, the individual will be ordered as major/minor instead of their exising order and renamed to major_cn/minor_cn
assembly (object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing.

Returns:

DataFrame containing the CNS data.

Return type:

pandas.DataFrame

cns.utils.files.load_samples(path, sep=None, print_info=False)

Loads a samples file into a pandas DataFrame. Loading includes canonization process, where the index column is set to “sample_id”. The column is found by exact or similar name, if not found, the first column is used. If sex column is not found, it is added with value “NA”.

Parameters:

path (str) – Path to the samples file.
sep (str, optional) – Separator for the file. If None, the separator is inferred from the file extension.

Returns:

DataFrame containing the samples data.

Return type:

pandas.DataFrame

cns.utils.files.load_segments(path)

Loads segments (chrom, start, end, name) from a file into a list of tuples.

Parameters:: path (str) – Path to the segments file.
Returns:: List of segments.
Return type:: list of tuples

cns.utils.files.obtain_segments(segs_source, in_cols=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)

cns.utils.files.samples_df_from_cns_df(cns_df, fill_sex=True)

Creates a samples DataFrame (sample_is, sex) from a CNS DataFrame.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing the CNS data.
fill_sex (bool, optional) – If True, fills the sex column in the samples DataFrame based on the presence of chrY in the CNS data.

Returns:

DataFrame containing the samples data.

Return type:

pandas.DataFrame

cns.utils.files.save_cns(cns_df, path, sep=None, sort=False, change_coords=True, mode='w')

Saves a CNS DataFrame to a file. Coordinates are 1-based on output by default.

Parameters:

cns_df (pandas.DataFrame) – DataFrame containing the CNS data.
path (str) – Path to save the file.
sort (bool, optional) – If True, sorts the DataFrame by sample_id, chrom, and start.
change_coords (bool, optional) – If True, changes the coordinates to 1-based before saving.
mode (str, optional) – Mode to open the file. Default is “w” (write). For append, header is not printed.

Return type:

None

cns.utils.files.save_samples(samples_df, path, mode='w')

Saves a samples DataFrame to a file.

Parameters:

samples_df (pandas.DataFrame) – DataFrame containing the samples data.
path (str) – Path to save the file.
mode (str, optional) – Mode to open the file. Default is “w” (write). For append, header is not printed.

Return type:

None

cns.utils.files.save_segments(segs, path)

Saves segments (chrom, start, end, name) to a 0-indexed BED file.

Parameters:

segs (list of tuples) – List of segments to save.
path (str) – Path to save the file.

Return type:

None

cns.utils.logging module

cns.utils.logging.configure_worker_logging(queue, verbose)

Configure logging in a worker process.

Parameters:

queue (multiprocessing.Queue) – Queue for sending log records to main process.
verbose (bool) – If True, sets logging level to INFO.

cns.utils.logging.get_logger(): Get the CNS logger instance.

cns.utils.logging.handle_exception(func)

Decorator to handle exceptions by logging and exiting.

Wraps a function to catch any exceptions, log them as errors, and terminate the program with exit code 1.

Parameters:: func (callable) – The function to wrap.
Returns:: The wrapped function.
Return type:: callable

cns.utils.logging.log_error(text)

Log an error message.

Parameters:: text (str) – The error message to log.

cns.utils.logging.log_info(text, suppress=False)

Log an info message.

The message will only be displayed if verbose mode is enabled via set_verbose(True).

Parameters:

text (str) – The message to log.
suppress (bool, optional) – If True, suppresses the log message. Default is False.

cns.utils.logging.log_warn(text, suppress=False)

Log a warning message.

Parameters:

text (str) – The warning message to log.
suppress (bool, optional) – If True, suppresses the log message. Default is False.

cns.utils.logging.set_verbose(verbose)

Set the logging level based on verbosity flag.

Parameters:: verbose (bool) – If True, sets logging level to INFO. Otherwise, sets to WARNING.

cns.utils.logging.setup_mp_logging()

Setup logging for multiprocessing.

Returns a queue that worker processes can use to send log records to the main process.

Returns:: Queue for sending log records from workers to main process.
Return type:: multiprocessing.Queue

cns.utils.logging.stop_mp_logging(): Stop the multiprocessing logging listener.

cns.utils.logging.suppress_errors(suppress=True)

Suppress error messages from being displayed.

Parameters:: suppress (bool) – If True, sets logging level to CRITICAL (suppressing ERROR and below). If False, restores to WARNING level.

cns.utils.selection module

cns.utils.selection.cn_not_nan(cns_df, cn_columns, any_col)

Filters copy number DataFrame for non-NaN values.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
cn_columns (list) – Column names to check for NaN values
any_col (bool) – If True, keep rows with any non-NaN values, if False, keep rows with all non-NaN values

Returns:

Filtered DataFrame with non-NaN values

Return type:

pd.DataFrame

cns.utils.selection.cns_head(cns_df, n=5)

cns.utils.selection.cns_random(cns_df, n=5, seed=0)

cns.utils.selection.cns_tail(cns_df, n=5)

cns.utils.selection.dataframe_array_split(samples_df, n_splits)

Splits a DataFrame into n_splits parts as equally as possible.

Parameters: - samples_df: The pandas DataFrame to split. - n_splits: The number of parts to split the DataFrame into.

Returns: - A list of pandas DataFrame objects.

cns.utils.selection.drop_Y(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=False)

Remove Y chromosome data from DataFrame.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining Y chromosome
inplace – If True, modify DataFrame in place, otherwise make a copy

Returns:

DataFrame without Y chromosome data

Return type:

pd.DataFrame

cns.utils.selection.get_chr_sets(cns_df, assembly=<cns.utils.assemblies.Assembly object>)

Groups chromosomes into autosomal and sex chromosome sets.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame with ‘chrom’ column
assembly – Reference assembly defining chromosome types

Returns:

Dictionary with keys:: ’aut’: list of autosomal chromosomes ‘sex’: list of sex chromosomes (if present) ‘all’: combined list (if sex chromosomes present)

Return type:

dict

Raises:

ValueError – If no autosomal chromosomes found

cns.utils.selection.only_aut(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=False)

Filter DataFrame for autosomes.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining sex chromosomes
inplace – If True, modify DataFrame in place, otherwise make a copy

Returns:

DataFrame with only autosomes.

Return type:

pd.DataFrame

cns.utils.selection.only_sex(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=False)

Filter DataFrame for sex chromosomes.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining sex chromosomes
inplace – If True, modify DataFrame in place, otherwise make a copy

Returns:

DataFrame with only sex chromosome entries

Return type:

pd.DataFrame

cns.utils.selection.sample_head(samples_df, n=5)

cns.utils.selection.sample_random(samples_df, n=5, seed=0)

cns.utils.selection.sample_tail(samples_df, n=5)

cns.utils.selection.select_CNS_samples(cns_df, samples_df)

Select specific samples from copy number DataFrame.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
samples_df (pd.DataFrame) – DataFrame containing sample IDs

Returns:

DataFrame containing only specified samples

Return type:

pd.DataFrame

cns.utils.selection.select_cns_by_type(cns_df, samples_df, type_val, type_col='type')

Select copy number segments by their type.

Parameters:

cns_df (pd.DataFrame) – Copy number DataFrame
samples_df (pd.DataFrame) – DataFrame containing sample IDs
type_val (str) – Type value to select
type_col (str) – Column name containing type information (default: “type”)

Returns:

DataFrame filtered by specified type(s)

Return type:

pd.DataFrame

Raises:

KeyError – If type_col not found in DataFrame