Repository Data Quickstart
The below is only relevant if you have cloned the Bitbucket repository and have access to the bundled data.
Otherwise refer to the Tool usage (CLI) or Library usage (API) sections for usage of the CNSistent Python package.
Processing the bundled data
The cnsistent repository contains data from the PCAWG, TCGA, and TRACERx studies, which can be preprocessed and segmented using bundled scripts under the following steps:
Start a
bash-compatible shell.Create the conda environment:
conda env create -f ./cnsistent.ymlActivate the environment:
conda activate cnsInstall the package from location:
pip install -e .Process data:
bash ./scripts/data_process.shOptional: Aggregate processed data:
bash ./scripts/data_aggregate.sh
Accessing the bundled data
The above produced data can be accessed using the cns.data_utils module. For example, to load the imputed data:
from cns.data_utils import main_load
samples_df, cns_df = main_load("imp")
The samples_df and cns_df are Pandas dataframes.
The former contains information about each samples as well as its statistics (e.g. ane_hom_all for homozygous aneuploidy across all chromosomes).
The latter contains the copy number segments for each sample in the form of sample_id, chrom, start, end, major_cn, minor_cn, name where name identifies each segment.
For example to load CNs for the COSMIC genes, data you can use the same function:
samples_df, cns_df = main_load("COSMIC")
cns_df.head()
would produce
sample_id chrom start end major_cn minor_cn name
0 SP101724 chr1 2160133 2241558 2 2 SKI
1 SP101724 chr1 2487077 2496821 2 2 TNFRSF14
2 SP101724 chr1 2985731 3355185 2 2 PRDM16
3 SP101724 chr1 6241328 6269449 2 2 RPL22
4 SP101724 chr1 6845383 7829766 2 2 CAMTA1
Alternativelly you can call:
main_loadto only load samples,main_load("preprocess")to load the raw data,main_load("agg_type")to load the aggregated bins, if the aggregation has been done, which can be one of:["1MB", "2MB", "3MB", "5MB", "10MB", "250KB", "500KB", "whole", "arms", "bands", "COSMIC", "ENSEMBL"].
Notes
By default, 16 threads are used, if that causes problems (crashes), reduce the number of threads in the
data_process.shanddata_aggregate.shscripts.You can also install the package with
pip install ., however there is a set of utility functions for loading data incns.data_utils.pythat will not be accesible then.Conda is optional, you can also install required packages manually using PIP based on the list in [cnsistent.yml](./cnsistent.yml).