Introduction
CNSistent is a Python tool for processing and analyzing copy number data. It is designed to work with data from a variety of sources. The tool is designed to be easy to use, and to provide a comprehensive set of analyses and visualizations. CNSistent can be used as a Python package, or downloaded together with the respective data (PCAWG, TRACERx, TCGA, genomic locations):
Installation links
Example of the API
Files used below were adapted from TRACERx Zenodo archive.
1. Load CNS Data and Display Heatmap
Load CNS data from a CSV file and visualize the first 5 rows using a heatmap.
import cns
import cns.data_utils as cdu
samples_df, raw_df = cdu.main_load("raw", "TRACERx")
cns.fig_heatmap(cns.cns_head(raw_df, 5), max_cn=6)
2. Impute Missing Segments
Fill in missing segments in the data, impute using the extension method, and display a heatmap for the first 5 rows.
imp_df = cns.main_impute(raw_df, print_info=True)
cns.fig_heatmap(cns.cns_head(imp_df, 5), max_cn=6)
3. Create 3 mb Segments and convert to a feature array
Aggregate the imputed CNS data into 3 MB segments and convert it into a feature array.
segs = cns.main_segment(split_size=3_000_000)
seg_df = cns.main_aggregate(imp_df, segs, print_info=True)
features, rows, columns = cns.bins_to_features(seg_df)
print("Samples: {0}, Alleles: {1}, Bins: {2}.".format(*features.shape))
Printed output:
Alleles: 2, samples: 403, bins: 960.
4. Group Segments by Cancer Type
Group the CNS data by cancer type, calculate the total CN, and visualize mean linear profiles.
type_groups = {c: cns.select_cns_by_type(seg_df, samples_df, c, "type") for c in ["LUAD", "LUSC"]}
groups_df = cns.stack_groups([cns.group_samples(v, group_name=k) for k, v in type_groups.items()])
cns.fig_lines(cns.add_total_cn(groups_df), cn_columns="total_cn")
The example code is also in example_API.py.
Example in terminal
CNSistent reads SCNA profiles as .tsv files. Have an example file data.tsv:
sample_id chrom start end total_cn
sample1 chr1 100 200 1
...
Note
Column naming is fully describe in the Input format section.
To preprocess the segments:
cns impute data.tsv --out imputed.csv
To create statistics:
cns coverage data.tsv --out samples.tsv
cns ploidy imputed.tsv --samples samples.tsv --out samples.tsv
cns signatures imputed.tsv --samples samples.tsv --out samples.tsv
To calculate the mean ploidy per chromosome arm:
segment arms --out arms.bed
cns aggregate imputed.tsv --segments arms.bed --out a_bins.tsv
To conduct breakpoint clustering with 1 mb distance:
segment imputed.tsv --merge 1000000 --out clust.bed
cns aggregate imputed.tsv --segments clust.bed --out c_bins.tsv
To conduct segmentation using 5 mb bins:
segment whole --step 5000000 --out clust.bed
cns aggregate data.tsv --segments clust.bed --out c_bins.tsv
Extention of the example is in example_CLI.py.