Manual

Data Types

There are five main data formats used within CNSistent. The SCNA data are stored in cns_df. Sample metadata are stored in samples_df. BAM files are converted into segments and mapped per chromosome. Internally, breakpoints are used for collecting per chromosomes breakpoints. For reference, the Assembly class is used to provide information about the genome.

cns_df

A pandas DataFrame with the following columns: sample_id, chrom, start, end, CN, .... The CN columns are the copy number values for each segment. The chrom column is expected to be in the format chr1, chr2, …, chrX, chrY, chrM. The start and end columns are 0-based coordinates.

sample_id  chrom     start       end  CN1  CN2
0        s1  chr19         0  13000000    1    1
1        s1  chr19  13000000  59128983    3    1
2        s2  chr19         0  26500000    2    0
3        s2  chr19  26500000  59128983    0    0

samples_df

A pandas DataFrame with the following columns: sample_id, sex. The sex column is expected to be xy or xx.

            sex
sample_id
s1          xx
s2          xy

segments

A dictionary of lists with keys being chromosomes and each segment being 0-indexed triples of start and end coordinates and a string id.

{'chr19': [(0, 13000000, 'chr19_0'), (13000000, 26500000, 'chr19_1'), (26500000, 59128983, 'chr19_2')]}

breakpoints

A dictionary of lists with keys being chromosomes and each breakpoint being a 0-indexed position in the chromosome.

{'chr19': [0, 13000000, 26500000, 59128983]}

Assembly

Assembly is a class that provides information about the species. CNSistent currently supports hg19 and hg38 assemblies. If you want to use it with a different assembly, you need to create a new object.

Note that sex chromosomes are always expected to be named chrX and chrY.

Assembly(name, chr_lens, chr_x, chr_y, gaps, cytobands)
  • chr_lens is a dictionary with chromosome names as keys and lengths as values. The chr_x and chr_y are

  • chr_x, chr_y are the string ids for sex chromosomes. “chrX” and “chrY” are used by default.

  • gaps, cytobands are segment dictionaries for gaps and cytobands, respectively.

These can be null unless you use regions_select("bands") or regions_select("gaps").

Pipelines

Pipelines are used internally to map the command line commands and arguments to functions, and thus correspond to Tool usage (CLI). Each command has a corresponding pipeline. In addition, two combined pipelines are provided as shorthands.

  • main_impute: Will align the missing regions and infer the missing values.

  • main_seg_agg: Will create and apply segmentation.

See the tutorial for examples of how to use the pipelines.

Segmentation

CNSistent operates over segments. Segments are dictionaries of tuples {chr: [(start, end, name), ...], ...}, where the start is inclusive, and the end is exclusive.

Note that you can pass longer tuples, but the result will discard the 4th and further elements.

The following functions can be used to manipulate segments:

  • split_segments: Will split into equidistant chunks based on specified size (useful for binning).

  • merge_segments: Will merge overlapping segments, merging is possible if end==start for two consecutive segments on the same chromosome. Note that if the segments are not sorted, you need to set sort=True to sort them first.

  • segment_union: Will merge segments from two lists of segments.

  • get_consecutive_segs: Having a list of segments, creates lists of consecutive segments.

  • segment_difference: Will remove regions from a list of segments found in another list of segments.

  • regions_select: A versatile function for creation of segments, see regions_select.

  • filter_min_size: Will remove segments strictly smaller than the specified size.

Imputation

Functions for adding missing segments and values in the CNS data. The process is to first add missing regions with NaN values and then impute the missing values.

There are separate functions to fill the telomeres, fill the gaps, and add missing chromosomes.

If guessing values in imputation is not desired, the fill_nans_with_zeros function can be used to simply fill with 0 instead.

Clustering

Clustering merges neigbouring breakpoints. The breakpoints are then merged using a greedy algorithm on a predefined region (usually a whole chromosome). Starting from the leftmost breakpoint, all breakpoints within the merge distance m are accumulated and a new breakpoint is created as their average. This is then repeated from the leftmost not yet merged breakpoint, until the end of the region is reached.

Clustering can either preserve endpoints (and only merge internal breakpoints), or also merge the endpoints. Example on list [0, 5, 7, 10, 16, 20] with merge distance 5:

Original clustering data Clustering with merged endpoints Clustering with preserved endpoints

> When using aggregation with clustering, the endpoints are always preserved.

Aggregation

Aggregation will produce segments of a certain size, aggregating the copy number values of the segment chunks into a single segment.

There are the following aggregate functions: mean, min, max, and none. The none function will just split existing bins, without additional aggregation. This is useful if you want to introduce additional breakpoints into the data.

Aggregation can be done either using explicit segments, explicit breakpoints, or a breakpoint type (e.g. arms, 1000000).

Analyze

The analyze module calculates statistics for the CNS data.

  • coverage: Calculates the proportion of genome with assigned (not NaN) CN values.

  • ploidy: Calculates the proportion of genome with aneuploid CN values (different from 2 or 1 for male sex chromosomes).

  • breakage: Calculates the signatures related statistics ` currently it only calculates breakpoints per sample/chromosome.

Plotting

Display functions are in three categories:

  • fig: A Whole figure with labels.

  • plot: Takes an axis and plots on it.

  • labels: Takes an axis and adds background / ticks.

For the figures, the first parameter is always the CNS_df, or a list thereof in the case of joint plots. There is one feature (line, dots…) per sample_id and one plot per cn_column.

Following optional parameters: * cn_columns: a string describing the column to plot or a list thereof. If none is specified, all columns matching the CN column pattern are used. * chrom: a string describing the chromosome to plot. If none is specified, all chromosomes are used. * size: Size of the feature of the plot - line/boundary width or dot size.

Examples can be found in the plotting notebook.

Utils

Utils contain the specification for the hg19, hg38 assemblies, including the gaps and cytobands. In addition, functions for files and data are provided:

Files

  • load_cns/save_cns: Load/Save CNS data from a TSV file, with optional header and sample_id. By default, this moves from 1-based to 0-based coordinates.

  • load_regions/save_regions: Load/Save regions from a TSV file, reading only the chrom, start, end columns. By default, this moves from 1-based to 0-based coordinates.

  • load_samples/save_samples: Load samples from a TSV file. The first column is used as “sample_id” index and should match the CNS sample names.

  • get_cn_columns: Get the CN columns from a CNS DataFrame (start or end with CN or cn).

Selection

Functions to select samples set from CNS df (head, tail, random), to filter chromosomes (autosomes, sex chromosomes…) and samples/CNS by type.

Conversions

Converts between CNS df, breakpoints, and segments.

Data Utils

  • Functions to load the datasets (PCAWG, TCGA, TRACERx) and gene sets (Ensembl, COSMIC).

  • Default filtering to remove samples from the datasets (low coverage, diploid, blacklisted, …)

  • Loading binned data / processed samples.