cns.data_utils module

Utility functions for loading and managing copy number segment (CNS) data.

This module provides functions to load, filter, and manage CNS data from different sources (PCAWG, TRACERx, TCGA). It includes utilities for handling samples, loading binned data, and managing file paths.

cns.data_utils.get_root_path()

Get the root path of the CNSistent package.

Returns:: Absolute path to the package root directory.
Return type:: str

cns.data_utils.load_COSMIC()

cns.data_utils.load_ENSEMBL()

cns.data_utils.load_cns_file(filename, print_info=False)

Load CNS data from the output directory.

Parameters:

filename (str) – Name of the file to load from the output directory.
print_info (bool, optional) – If True, print information about the loading process.

Returns:

DataFrame containing CNS data.

Return type:

pd.DataFrame

cns.data_utils.load_samples_file(filename, use_filter=False, print_info=False)

Load sample data from the output directory.

Parameters:

filename (str) – Name of the sample file to load.
filter (bool, optional) – Whether to filter samples based on coverage and aneuploidy.
print_info (bool, optional) – Whether to print progress information.

Returns:

DataFrame containing sample data.

Return type:

pd.DataFrame

cns.data_utils.load_segs_out(filename, print_info=False)

Load segment data from the output directory.

Parameters:

filename (str) – Name of the segment file to load.
print_info (bool, optional) – Whether to print progress information.

Returns:

DataFrame containing segment data.

Return type:

pd.DataFrame

cns.data_utils.main_load(segment_type, dataset='all', use_filter=True, concat=True, print_info=False)

Loads and filters samples from the specified dataset along with the imputed CNS data.

Parameters:

segment_type (str, optional) – Bin size for loading binned data. If None, no CNS data is loaded. Options include: [“1MB”, “2MB”, “3MB”, “5MB”, “10MB”, “250KB”, “500KB”, “whole”, “arms”, “bands”, “COSMIC”, “ENSEMBL”].
dataset (str, optional) – Dataset to load. Options include: “PCAWG”, “TCGA”, “TRACERx”, or “all”. Default is “all”.
use_filter (bool, optional) – If True, filters samples based on coverage and aneuploidy. Default is True.
concat (bool, optional) – If True, concatenates samples and CNS data from multiple datasets, otherwise a named dictionary is returned. Default is True.
print_info (bool, optional) – If True, prints informational messages during processing. Default is False.

Returns:

A tuple containing two pandas DataFrames: - samples_df: DataFrame containing sample information and statistics. - cns_df: DataFrame containing the CNS data or binned data. If concat is False, the DataFrames are returned as a dictionary thereof.

Return type:

tuple

Examples

>>> samples_df, cns_df = main_load("imp")
>>> samples_df, cns_df = main_load("3MB", "PCAWG")

cns.data_utils.samples_above_threshold(samples_df, threshold=50)

Returns the original values from samples_df where ‘type’ occurs at least threshold times and is not ‘Other’.

Parameters:

samples_df (pd.DataFrame) – DataFrame containing a ‘type’ column.
threshold (int, optional) – Minimum number of occurrences for a type to be included.

Returns:

Series of ‘type’ values from the original DataFrame that meet the criteria.

Return type:

pd.Series