CNSistent Logo

Introduction

CNSistent is a Python tool for processing and analyzing copy number data. It is designed to work with data from a variety of sources. The tool is designed to be easy to use, and to provide a comprehensive set of analyses and visualizations. CNSistent can be used as a Python package, or downloaded together with the respective data (PCAWG, TRACERx, TCGA, genomic locations):

Example of the API

Files used below were adapted from TRACERx Zenodo archive.

1. Load CNS Data and Display Heatmap

Load CNS data from a CSV file and visualize the first 5 rows using a heatmap.

import cns
import cns.data_utils as cdu
samples_df, raw_df = cdu.main_load("raw", "TRACERx")
cns.fig_heatmap(cns.cns_head(raw_df, 5), max_cn=6)
Raw Data Heatmap

2. Impute Missing Segments

Fill in missing segments in the data, impute using the extension method, and display a heatmap for the first 5 rows.

imp_df = cns.main_impute(raw_df, print_info=True)
cns.fig_heatmap(cns.cns_head(imp_df, 5), max_cn=6)
Imputed Data Heatmap

3. Create 3 mb Segments and convert to a feature array

Aggregate the imputed CNS data into 3 MB segments and convert it into a feature array.

segs = cns.main_segment(split_size=3_000_000)
seg_df = cns.main_aggregate(imp_df, segs, print_info=True)
features, rows, columns = cns.bins_to_features(seg_df)
print("Samples: {0}, Alleles: {1}, Bins: {2}.".format(*features.shape))

Printed output:

Alleles: 2, samples: 403, bins: 960.

4. Group Segments by Cancer Type

Group the CNS data by cancer type, calculate the total CN, and visualize mean linear profiles.

type_groups = {c: cns.select_cns_by_type(seg_df, samples_df, c, "type") for c in ["LUAD", "LUSC"]}
groups_df = cns.stack_groups([cns.group_samples(v, group_name=k) for k, v in type_groups.items()])
cns.fig_lines(cns.add_total_cn(groups_df), cn_columns="total_cn")
Grouped Data Heatmap

The example code is also in example_API.py.

Example in terminal

CNSistent reads SCNA profiles as .tsv files. Have an example file data.tsv:

sample_id    chrom   start   end     total_cn
sample1      chr1    100     200     1
...

Note

Column naming is fully describe in the Input format section.

To preprocess the segments:

cns impute data.tsv --out imputed.csv

To create statistics:

cns coverage data.tsv --out samples.tsv
cns ploidy imputed.tsv --samples samples.tsv --out samples.tsv
cns signatures imputed.tsv --samples samples.tsv --out samples.tsv

To calculate the mean ploidy per chromosome arm:

segment arms --out arms.bed
cns aggregate imputed.tsv --segments arms.bed --out a_bins.tsv

To conduct breakpoint clustering with 1 mb distance:

segment imputed.tsv --merge 1000000 --out clust.bed
cns aggregate imputed.tsv  --segments clust.bed --out c_bins.tsv

To conduct segmentation using 5 mb bins:

segment whole --step 5000000 --out clust.bed
cns aggregate data.tsv  --segments clust.bed --out c_bins.tsv

Extention of the example is in example_CLI.py.

LICENSE