cns.utils package
cns.utils.anomaly module
- cns.utils.anomaly.calc_angles(cns_df, cn_col, group_by='sample')
Calculate angles between segments across the entire dataset, handling discontinuities appropriately.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing copy number segments
cn_col (str) – Column name containing copy number values
- Returns:
Series of angles indexed by the original DataFrame indices
- Return type:
pandas.Series
- cns.utils.anomaly.calc_angles_cons(cns_df, cn_col)
Calculate normalized angle differences (second derivative) of copy number values across genomic segments. This function computes a score for each segment in the input DataFrame, reflecting the change in copy number relative to the segment’s position and length. The score is based on the normalized first and second differences of the copy number values, adjusted by segment midpoints and mean segment length. :type cns_df: :param cns_df: DataFrame containing at least ‘start’, ‘end’, and the specified copy number column. :type cns_df: pandas.DataFrame :type cn_col: :param cn_col: Name of the column in cns_df containing copy number values. :type cn_col: str
- Returns:
Array of normalized angle difference scores for each segment.
- Return type:
np.ndarray
- Raises:
ValueError – If the segments are not strictly ordered by their midpoints.
- cns.utils.anomaly.calculate_signed_angle(s1, s2)
Calculates the signed angle between two slopes. Concave angles (rising curve) are positive, convex angles (falling curve) are negative.
- Parameters:
s1 (float) – Slope of the first line.
s2 (float) – Slope of the second line.
- Returns:
angle_degrees – Signed angle in degrees.
- Return type:
float
- cns.utils.anomaly.count_below_lim(vals, min_val=0, max_val=1, steps=1000)
Counts how many samples are below each cutoff value, linearly spaced.
- Parameters:
vals (array-like) – Array of values to count.
min_val (float, optional) – Minimum value for the cutoff range. Default is 0.
max_val (float, optional) – Maximum value for the cutoff range. Default is 1.
steps (int, optional) – Number of steps between min_val and max_val. Default is 1000.
- Returns:
cutoffs (numpy.ndarray) – Array of cutoff values.
counts (numpy.ndarray) – Array of counts of samples below each cutoff value.
- cns.utils.anomaly.count_cum_val(vals, min_val=0, max_val=1)
Counts the number of samples below each present value.
- Parameters:
vals (array-like) – Array of values to count.
min_val (float, optional) – Minimum value to consider. Default is 0.
max_val (float, optional) – Maximum value to consider. Default is 1.
- Returns:
unique_vals (numpy.ndarray) – Array of unique values within the specified range.
cumulative_count (numpy.ndarray) – Cumulative count of samples below each unique value.
- cns.utils.anomaly.find_bends(vals, min_val=None, max_val=None)
Finds the knee and elbow in a cumulative distribution of values.
- Parameters:
vals (array-like) – Array of values to analyze.
min_val (float, optional) – Minimum value to consider. Default is 0.
max_val (float, optional) – Maximum value to consider. Default is 1.
- Returns:
X (numpy.ndarray) – Array of unique values within the specified range.
Y (numpy.ndarray) – Cumulative count of samples below each unique value.
knee_index (int) – Index of the knee.
knee_value (float) – Value of the knee.
elbow_index (int) – Index of the elbow.
elbow_value (float) – Value of the elbow.
- cns.utils.anomaly.find_knee(x, y, knee=True)
Finds a knee or elbow in the curve using convex/concave curves.
- Parameters:
x (array-like) – array if indices.
y (array-like) – Array of values at the indices.
knee (bool, optional) – If True, finds the knee. If False, finds the elbow. Default is True.
- Returns:
int – Index of the knee or elbow.
float – Value of the knee or elbow.
- cns.utils.anomaly.z_score_filter(vals, min_val=-3, max_val=3)
Removes values with z-scores below min_val or above max_val.
- Parameters:
vals (array-like) – Array of values to filter.
min_val (float, optional) – Minimum z-score to keep. Default is -3.
max_val (float, optional) – Maximum z-score to keep. Default is 3.
- Returns:
Filtered array of values.
- Return type:
numpy.ndarray
cns.utils.assemblies module
- class cns.utils.assemblies.Assembly(name, chr_lens, x_name='chrX', y_name='chrY', cytobands=None, gaps=None)
Bases:
objectA class to represent a genomic assembly.
- name
The name of the assembly.
- Type:
str
- chr_lens
The lengths of the chromosomes.
- Type:
dict
- chr_x
The name of the X chromosome.
- Type:
str
- chr_y
The name of the Y chromosome.
- Type:
str
- cytobands
The cytobands of the chromosomes.
- Type:
list
- gaps
The gaps in the chromosomes.
- Type:
list
- cns.utils.assemblies.get_assembly(assembly_id)
Retrieve an Assembly instance by its ID.
- Parameters:
assembly_id (str) – The ID of the assembly to retrieve. Valid values are “hg19” and “hg38”.
- Returns:
The Assembly instance corresponding to the given ID.
- Return type:
- Raises:
ValueError – If the assembly_id is not “hg19” or “hg38”.
- cns.utils.assemblies.hg19 = <cns.utils.assemblies.Assembly object>
An instance of the Assembly class representing the hg19 genomic assembly.
- cns.utils.assemblies.hg38 = <cns.utils.assemblies.Assembly object>
An instance of the Assembly class representing the hg38 genomic assembly.
cns.utils.canonization module
- cns.utils.canonization.canonize_chroms(cns_df, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Standardizes chromosome names and validates against assembly.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining valid chromosomes
print_info (bool) – Whether to print info messages
- Returns:
DataFrame with canonized chromosome names
- Return type:
pd.DataFrame
- Raises:
ValueError – If no valid chromosomes found
- cns.utils.canonization.canonize_cns_df(cns_df, input_cn_columns=None, order_columns=False, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Canonizes copy number DataFrame by standardizing all columns.
Applies standard canonization steps in order: 1. Sample ID column 2. Chromosome names and validation 3. Position columns (start/end) 4. Copy number columns 5. Optional name column
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame to canonize
assembly – Reference assembly for chromosome validation
input_cn_columns (str|list, optional) – Copy number column(s) to validate
print_info (bool) – Whether to print info messages
- Returns:
Canonized copy number DataFrame list: Names of canonized copy number columns
- Return type:
pd.DataFrame
- Raises:
ValueError – If required columns missing or invalid
- cns.utils.canonization.canonize_name(cns_df, print_info=False)
Standardizes optional name column if present.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
print_info (bool) – Whether to print info messages
- Returns:
DataFrame with canonized name column
- Return type:
pd.DataFrame
- cns.utils.canonization.canonize_positions(cns_df, print_info=False)
Ensures standard start/end position columns.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
print_info (bool) – Whether to print info messages
- Returns:
DataFrame with canonized position columns
- Return type:
pd.DataFrame
- cns.utils.canonization.canonize_sample_id(df, print_info=False)
Ensures DataFrame has standardized sample_id column.
- Parameters:
df (pd.DataFrame) – DataFrame to canonize
print_info (bool) – Whether to print info messages
- Returns:
DataFrame with canonized sample_id column
- Return type:
pd.DataFrame
- cns.utils.canonization.get_cn_cols(cns_df, cn_cols=None)
Gets or validates copy number columns from DataFrame.
- Parameters:
cns_df (pd.DataFrame) – DataFrame containing copy number data
cn_cols (str|list, optional) – Column name(s) to validate. If None, discovers CN columns
- Returns:
Valid copy number column names
- Return type:
list
- Raises:
ValueError – If specified columns not found or invalid number of columns
- cns.utils.canonization.rename_cn_cols(cns_df, cn_columns=None, print_info=False)
Renames copy number columns in the DataFrame.
Parameters: cns_df (pd.DataFrame): The DataFrame containing copy number data. cn_columns (list): List of column names to be renamed. print_info (bool): Flag to indicate whether to print information.
Returns: pd.DataFrame: The DataFrame with renamed columns. list: The updated list of column names.
cns.utils.conversions module
- cns.utils.conversions.bins_to_features(cns_df, cn_columns=None, drop_sex=True)
- cns.utils.conversions.breaks_to_segments(breakpoints)
- cns.utils.conversions.calc_cum_mid(cns_df, assembly=<cns.utils.assemblies.Assembly object>)
- cns.utils.conversions.calc_lengths(cns_df)
- cns.utils.conversions.calc_mid(cns_df)
- cns.utils.conversions.calc_nan_cols(cns_df, cn_columns=None)
- cns.utils.conversions.chrom_to_sortable(chrom, aut_count=22)
- cns.utils.conversions.cytobands_to_df(cytobands)
- cns.utils.conversions.gaps_to_df(gaps)
- cns.utils.conversions.genome_to_segments(assembly=<cns.utils.assemblies.Assembly object>)
- cns.utils.conversions.segments_to_breaks(segments)
- cns.utils.conversions.segments_to_cns_df(segments, sample_id='segment')
- cns.utils.conversions.sortable_to_chrom(sortable, aut_count=22)
- cns.utils.conversions.tuples_to_segments(tuples)
- cns.utils.conversions.values_count(values_dict)
cns.utils.files module
- cns.utils.files.fill_sex_if_missing(cns_df, samples_df)
Fills the sex column in the samples DataFrame if missing, based on the presence of chrY in the CNS data.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing the CNS data.
samples_df (pandas.DataFrame) – DataFrame containing the samples data.
- Returns:
Updated samples DataFrame with the sex column filled if missing.
- Return type:
pandas.DataFrame
- cns.utils.files.load_cns(path, cn_columns=None, sep=None, sort=False, change_coords=True, order_columns=False, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
Loads a CNS file into a pandas DataFrame. Loading includes canonization process, where the positions column names are standardized “sample_id”, “chrom”, “start”, “end”. CN columns are rename to one of [“major_cn”, “minor_cn”], [“hap1_cn”, “hap2_cn”], [“total_cn”]. If these are not found, other typical column names are searched. If that fails, the first 4 columns are used as position, the following 1-2 as CNs. Coordinates are 1-based on input by default.
- Parameters:
path (str) – Path to the CNS file.
cn_columns (list of str, optional) – List of column names for copy number data. If None, columns are inferred from the file.
sep (str, optional) – Separator for the file. If None, the separator is inferred from the file extension.
sort (bool, optional) – If True, sorts the DataFrame by sample_id, chrom, and start.
change_coords (bool, optional) – If True, changes the coordinates to 0-based.
order_columns (bool, optional) – If True and there are two columns, the individual will be ordered as major/minor instead of their exising order and renamed to major_cn/minor_cn
assembly (object, optional) – Genome assembly to use. Default is hg19.
print_info (bool, optional) – If True, prints informational messages during processing.
- Returns:
DataFrame containing the CNS data.
- Return type:
pandas.DataFrame
- cns.utils.files.load_samples(path, sep=None, print_info=False)
Loads a samples file into a pandas DataFrame. Loading includes canonization process, where the index column is set to “sample_id”. The column is found by exact or similar name, if not found, the first column is used. If sex column is not found, it is added with value “NA”.
- Parameters:
path (str) – Path to the samples file.
sep (str, optional) – Separator for the file. If None, the separator is inferred from the file extension.
- Returns:
DataFrame containing the samples data.
- Return type:
pandas.DataFrame
- cns.utils.files.load_segments(path)
Loads segments (chrom, start, end, name) from a file into a list of tuples.
- Parameters:
path (str) – Path to the segments file.
- Returns:
List of segments.
- Return type:
list of tuples
- cns.utils.files.obtain_segments(segs_source, in_cols=None, assembly=<cns.utils.assemblies.Assembly object>, print_info=False)
- cns.utils.files.samples_df_from_cns_df(cns_df, fill_sex=True)
Creates a samples DataFrame (sample_is, sex) from a CNS DataFrame.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing the CNS data.
fill_sex (bool, optional) – If True, fills the sex column in the samples DataFrame based on the presence of chrY in the CNS data.
- Returns:
DataFrame containing the samples data.
- Return type:
pandas.DataFrame
- cns.utils.files.save_cns(cns_df, path, sep=None, sort=False, change_coords=True, mode='w')
Saves a CNS DataFrame to a file. Coordinates are 1-based on output by default.
- Parameters:
cns_df (pandas.DataFrame) – DataFrame containing the CNS data.
path (str) – Path to save the file.
sort (bool, optional) – If True, sorts the DataFrame by sample_id, chrom, and start.
change_coords (bool, optional) – If True, changes the coordinates to 1-based before saving.
mode (str, optional) – Mode to open the file. Default is “w” (write). For append, header is not printed.
- Return type:
None
- cns.utils.files.save_samples(samples_df, path, mode='w')
Saves a samples DataFrame to a file.
- Parameters:
samples_df (pandas.DataFrame) – DataFrame containing the samples data.
path (str) – Path to save the file.
mode (str, optional) – Mode to open the file. Default is “w” (write). For append, header is not printed.
- Return type:
None
- cns.utils.files.save_segments(segs, path)
Saves segments (chrom, start, end, name) to a 0-indexed BED file.
- Parameters:
segs (list of tuples) – List of segments to save.
path (str) – Path to save the file.
- Return type:
None
cns.utils.logging module
- cns.utils.logging.configure_worker_logging(queue, verbose)
Configure logging in a worker process.
- Parameters:
queue (multiprocessing.Queue) – Queue for sending log records to main process.
verbose (bool) – If True, sets logging level to INFO.
- cns.utils.logging.get_logger()
Get the CNS logger instance.
- cns.utils.logging.handle_exception(func)
Decorator to handle exceptions by logging and exiting.
Wraps a function to catch any exceptions, log them as errors, and terminate the program with exit code 1.
- Parameters:
func (callable) – The function to wrap.
- Returns:
The wrapped function.
- Return type:
callable
- cns.utils.logging.log_error(text)
Log an error message.
- Parameters:
text (str) – The error message to log.
- cns.utils.logging.log_info(text, suppress=False)
Log an info message.
The message will only be displayed if verbose mode is enabled via set_verbose(True).
- Parameters:
text (str) – The message to log.
suppress (bool, optional) – If True, suppresses the log message. Default is False.
- cns.utils.logging.log_warn(text, suppress=False)
Log a warning message.
- Parameters:
text (str) – The warning message to log.
suppress (bool, optional) – If True, suppresses the log message. Default is False.
- cns.utils.logging.set_verbose(verbose)
Set the logging level based on verbosity flag.
- Parameters:
verbose (bool) – If True, sets logging level to INFO. Otherwise, sets to WARNING.
- cns.utils.logging.setup_mp_logging()
Setup logging for multiprocessing.
Returns a queue that worker processes can use to send log records to the main process.
- Returns:
Queue for sending log records from workers to main process.
- Return type:
multiprocessing.Queue
- cns.utils.logging.stop_mp_logging()
Stop the multiprocessing logging listener.
- cns.utils.logging.suppress_errors(suppress=True)
Suppress error messages from being displayed.
- Parameters:
suppress (bool) – If True, sets logging level to CRITICAL (suppressing ERROR and below). If False, restores to WARNING level.
cns.utils.selection module
- cns.utils.selection.cn_not_nan(cns_df, cn_columns, any_col)
Filters copy number DataFrame for non-NaN values.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
cn_columns (list) – Column names to check for NaN values
any_col (bool) – If True, keep rows with any non-NaN values, if False, keep rows with all non-NaN values
- Returns:
Filtered DataFrame with non-NaN values
- Return type:
pd.DataFrame
- cns.utils.selection.cns_head(cns_df, n=5)
- cns.utils.selection.cns_random(cns_df, n=5, seed=0)
- cns.utils.selection.cns_tail(cns_df, n=5)
- cns.utils.selection.dataframe_array_split(samples_df, n_splits)
Splits a DataFrame into n_splits parts as equally as possible.
Parameters: - samples_df: The pandas DataFrame to split. - n_splits: The number of parts to split the DataFrame into.
Returns: - A list of pandas DataFrame objects.
- cns.utils.selection.drop_Y(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=False)
Remove Y chromosome data from DataFrame.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining Y chromosome
inplace – If True, modify DataFrame in place, otherwise make a copy
- Returns:
DataFrame without Y chromosome data
- Return type:
pd.DataFrame
- cns.utils.selection.get_chr_sets(cns_df, assembly=<cns.utils.assemblies.Assembly object>)
Groups chromosomes into autosomal and sex chromosome sets.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame with ‘chrom’ column
assembly – Reference assembly defining chromosome types
- Returns:
- Dictionary with keys:
’aut’: list of autosomal chromosomes ‘sex’: list of sex chromosomes (if present) ‘all’: combined list (if sex chromosomes present)
- Return type:
dict
- Raises:
ValueError – If no autosomal chromosomes found
- cns.utils.selection.only_aut(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=False)
Filter DataFrame for autosomes.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining sex chromosomes
inplace – If True, modify DataFrame in place, otherwise make a copy
- Returns:
DataFrame with only autosomes.
- Return type:
pd.DataFrame
- cns.utils.selection.only_sex(cns_df, assembly=<cns.utils.assemblies.Assembly object>, inplace=False)
Filter DataFrame for sex chromosomes.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
assembly – Reference assembly defining sex chromosomes
inplace – If True, modify DataFrame in place, otherwise make a copy
- Returns:
DataFrame with only sex chromosome entries
- Return type:
pd.DataFrame
- cns.utils.selection.sample_head(samples_df, n=5)
- cns.utils.selection.sample_random(samples_df, n=5, seed=0)
- cns.utils.selection.sample_tail(samples_df, n=5)
- cns.utils.selection.select_CNS_samples(cns_df, samples_df)
Select specific samples from copy number DataFrame.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
samples_df (pd.DataFrame) – DataFrame containing sample IDs
- Returns:
DataFrame containing only specified samples
- Return type:
pd.DataFrame
- cns.utils.selection.select_cns_by_type(cns_df, samples_df, type_val, type_col='type')
Select copy number segments by their type.
- Parameters:
cns_df (pd.DataFrame) – Copy number DataFrame
samples_df (pd.DataFrame) – DataFrame containing sample IDs
type_val (str) – Type value to select
type_col (str) – Column name containing type information (default: “type”)
- Returns:
DataFrame filtered by specified type(s)
- Return type:
pd.DataFrame
- Raises:
KeyError – If type_col not found in DataFrame