Repository Data Quickstart

The below is only relevant if you have cloned the Bitbucket repository and have access to the bundled data.

Otherwise refer to the Tool usage (CLI) or Library usage (API) sections for usage of the CNSistent Python package.

Processing the bundled data

The cnsistent repository contains data from the PCAWG, TCGA, and TRACERx studies, which can be preprocessed and segmented using bundled scripts under the following steps:

  1. Start a bash-compatible shell.

  2. Create the conda environment: conda env create -f ./cnsistent.yml

  3. Activate the environment: conda activate cns

  4. Install the package from location: pip install -e .

  5. Process data: bash ./scripts/data_process.sh

  6. Optional: Aggregate processed data: bash ./scripts/data_aggregate.sh

Accessing the bundled data

The above produced data can be accessed using the cns.data_utils module. For example, to load the imputed data:

from cns.data_utils import main_load
samples_df, cns_df = main_load("imp")

The samples_df and cns_df are Pandas dataframes. The former contains information about each samples as well as its statistics (e.g. ane_hom_all for homozygous aneuploidy across all chromosomes). The latter contains the copy number segments for each sample in the form of sample_id, chrom, start, end, major_cn, minor_cn, name where name identifies each segment. For example to load CNs for the COSMIC genes, data you can use the same function:

samples_df, cns_df = main_load("COSMIC")
cns_df.head()

would produce

  sample_id chrom start    end     major_cn  minor_cn name
0 SP101724  chr1  2160133  2241558 2         2        SKI
1 SP101724  chr1  2487077  2496821 2         2        TNFRSF14
2 SP101724  chr1  2985731  3355185 2         2        PRDM16
3 SP101724  chr1  6241328  6269449 2         2        RPL22
4 SP101724  chr1  6845383  7829766 2         2        CAMTA1

Alternativelly you can call:

  • main_load to only load samples,

  • main_load("preprocess") to load the raw data,

  • main_load("agg_type") to load the aggregated bins, if the aggregation has been done, which can be one of: ["1MB", "2MB", "3MB", "5MB", "10MB", "250KB", "500KB", "whole", "arms", "bands", "COSMIC", "ENSEMBL"].

Notes

  • By default, 16 threads are used, if that causes problems (crashes), reduce the number of threads in the data_process.sh and data_aggregate.sh scripts.

  • You can also install the package with pip install ., however there is a set of utility functions for loading data in cns.data_utils.py that will not be accesible then.

  • Conda is optional, you can also install required packages manually using PIP based on the list in [cnsistent.yml](./cnsistent.yml).