[1]:

# auto reload modules
%load_ext autoreload
%autoreload 2

Examples: Data Processing

This is a trivial example of how to create data for CNSistent, fill missing segments, impute missing values, create consistent segmentation, aggregate the values and plot the results.

We will go through the following steps:

Create data
Fill missing segments
Infer missing values
Create consistent segmentation
Aggregate the values

[2]:

%load_ext autoreload
%autoreload 2

import io
import pandas as pd

# Import the cns module
import cns

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Create data

Firs we create two SCNA profiles for samples s1, s2 with copy number columns CN1, CN2.

For the tutorial we will limit ourselves to chr19.

[3]:

cns_source = """sample_id, chrom, start, end, CN1, CN2
s1, chr19, 1000000, 3000000, 1,
s1, chr19, 3000000, 12000000, 1, 1
s1, chr19, 12000000, 14000000, , 1
s1, chr19, 14000000, 21000000, 3, 1
s1, chr19, 21000000, 25000000, 3,
s1, chr19, 28000000, 58500000, 3,
s2, chr19, 1000000, 24000000, 2,
s2, chr19, 29000000, 58000000, 0,
"""
raw_cns_df = pd.read_csv(io.StringIO(cns_source.strip()), sep=',\s*', engine='python')
raw_cns_df.head()

[3]:

	sample_id	chrom	start	end	CN1	CN2
0	s1	chr19	1000000	3000000	1.0	NaN
1	s1	chr19	3000000	12000000	1.0	1.0
2	s1	chr19	12000000	14000000	NaN	1.0
3	s1	chr19	14000000	21000000	3.0	1.0
4	s1	chr19	21000000	25000000	3.0	NaN

We plot heatmaps of the above values, note that the positions are on the linear genome.

[4]:

cns.fig_heatmap(raw_cns_df)

[4]:

(<Figure size 300x300 with 2 Axes>,
 array([<Axes: xlabel='CN1'>, <Axes: xlabel='CN2'>], dtype=object))

For real datasets we would need to limit ourselves to a subset of a data for visualization purposes, e.g.:

[5]:

cns.fig_heatmap(cns.cns_head(raw_cns_df, 1))

[5]:

(<Figure size 300x300 with 2 Axes>,
 array([<Axes: ylabel='CN1'>,
        <Axes: xlabel='position on the linear genome', ylabel='CN2'>],
       dtype=object))

Coverage and filling

Next we will analyze missing segments and fill them with the average value of the surrounding segments.

Since we are only focusing on chr19, we create a mask for chr19 only.

[6]:

chr19_seg = cns.make_segments("chr19")
print(chr19_seg)

{'chr19': [(0, 59128983, 'chr19')]}

First we calculate the proportion of missing segments:

[7]:

cover_df = cns.main_coverage(raw_cns_df, segs=chr19_seg)
cover_df.head()

[7]:

	sex	chrom_count	chrom_missing	cover_any_aut	cover_any_sex	cover_any_all	cover_both_aut	cover_both_sex	cover_both_all
sample_id
s1	xx	1	[]	0.921714	0	0.921714	0.270595	0	0.270595
s2	xx	1	[]	0.879433	0	0.879433	0.000000	0	0.000000

Then we fill the missing segments with NaN values:

[8]:

chr19_seg["chr19"] = cns.genome_to_segments()["chr19"]
align_cns_df = cns.main_align(raw_cns_df, segs=chr19_seg)
align_cns_df.head()

[8]:

	sample_id	chrom	start	end	CN1	CN2	name
0	s1	chr19	0	1000000	NaN	NaN	chr19
1	s1	chr19	1000000	3000000	1.0	NaN	chr19
2	s1	chr19	3000000	12000000	1.0	1.0	chr19
3	s1	chr19	12000000	14000000	NaN	1.0	chr19
4	s1	chr19	14000000	21000000	3.0	1.0	chr19

When plotting we can see that the data are now extended to the boundaries of chromosome 19

[9]:

cns.fig_heatmap(align_cns_df, cn_columns=['CN1', 'CN2']);

Imputation and aneuploidy

Next we fill values into the missing segments.

Leading and trailing NaN values are filled with the first and last non-NaN value, respectively.

The remaining NaN values are filled with the mean of the non-NaN values.

[10]:

imp_cns_df = cns.main_infer(align_cns_df)
imp_cns_df.head()

[10]:

	sample_id	chrom	start	end	CN1	CN2
0	s1	chr19	0	13000000	1.0	1.0
1	s1	chr19	13000000	59128983	3.0	1.0
2	s2	chr19	0	26500000	2.0	0.0
3	s2	chr19	26500000	59128983	0.0	0.0

We can see that the number of segments decreased considerably, to only 2 actual breakpoints, which we can see on the imputed plot.

[11]:

cns.fig_heatmap(imp_cns_df, cn_columns=['CN1', 'CN2']);

We expect that the imputed data are more representative of the actual genome, therefore we preferably analyze the imputed data further.

[26]:

ploidy_df = cns.main_ploidy(imp_cns_df)
ploidy_df.head()

[26]:

	sex	loh_both_aut	loh_any_aut	ane_any_aut	ane_both_aut	imb_CN1_aut	imb_CN2_aut	ploidy_CN1	ploidy_CN2
sample_id
s1	xx	0.000000	0.000000	0.016011	0.000000	0.016011	0.0	2.560283	1.0
s2	xx	0.011325	0.020524	0.020524	0.020524	0.009198	0.0	0.896346	0.0

Similarly, we can calculate information about breakpoints. Note that the “total_cn” column is added to the data for the calculation:

[28]:

breakage_df = cns.main_breakage(imp_cns_df)
breakage_df.head()

[28]:

	sex	breaks_CN1_aut	step_CN1_aut	breaks_CN2_aut	step_CN2_aut	breaks_total_cn_aut	step_total_cn_aut
sample_id
s1	xx	1	2.0	0	0	1	2.0
s2	xx	1	2.0	0	0	1	2.0

Segment and aggregate

First we create a consistent segmentation of the data. Since there are two breakpoints, we will have three segments.

[31]:

mcs_segs = cns.cns_df_to_segments(imp_cns_df, process="unify")
mcs_segs

[31]:

{'chr19': [(0, 13000000, 'chr19_0'),
  (13000000, 26500000, 'chr19_1'),
  (26500000, 59128983, 'chr19_2')]}

Using the segmentation, we can aggregate the values of the segments.

[32]:

mcs_agg_df = cns.main_aggregate(imp_cns_df, mcs_segs)
mcs_agg_df

[32]:

	sample_id	chrom	start	end	CN1	CN2	name
0	s1	chr19	0	13000000	1.0	1.0	chr19_0
1	s1	chr19	13000000	26500000	3.0	1.0	chr19_1
2	s1	chr19	26500000	59128983	3.0	1.0	chr19_2
3	s2	chr19	0	13000000	2.0	0.0	chr19_0
4	s2	chr19	13000000	26500000	2.0	0.0	chr19_1
5	s2	chr19	26500000	59128983	0.0	0.0	chr19_2

We then use fig_bars as it shows the exact segmentation points.

[33]:

cns.fig_bars(mcs_agg_df);

Splitting and filtering

Here we further explore the segmentation by removing regions of low mappability (gaps) and splitting the segments into smaller ones.

First we load the gap regions.

[34]:

gap_segs = cns.make_segments("gaps")
gap_segs["chr19"]

[34]:

[(0, 10000, 'telomere'),
 (10000, 60000, 'contig'),
 (7346004, 7396004, 'contig'),
 (8687198, 8737198, 'contig'),
 (20523415, 20573415, 'clone'),
 (24631782, 24681782, 'heterochromatin'),
 (24681782, 27681782, 'centromere'),
 (27681782, 27731782, 'heterochromatin'),
 (59118983, 59128983, 'telomere')]

We then conduct the segmentation with the following:

10 kb filtering will ignore gaps below this size
applying the gaps will remove the regions that overlap with the gaps
the remainder is split into segments of 3 mb

[35]:

complex_segs = cns.main_segment(mcs_segs, remove_segs=gap_segs, split_size=3*10**6, filter_size=10**5)
complex_segs["chr19"][:5]

[35]:

[(0, 3250000, 'chr19_0_0'),
 (3250000, 6500000, 'chr19_0_1'),
 (6500000, 9750000, 'chr19_0_2'),
 (9750000, 13000000, 'chr19_0_3'),
 (13000000, 15907946, 'chr19_1_0')]

We again aggregated the CNs into the segments.

[36]:

complex_agg_df = cns.main_aggregate(imp_cns_df, complex_segs)
complex_agg_df.head()

[36]:

	sample_id	chrom	start	end	CN1	CN2	name
0	s1	chr19	0	3250000	1.0	1.0	chr19_0_0
1	s1	chr19	3250000	6500000	1.0	1.0	chr19_0_1
2	s1	chr19	6500000	9750000	1.0	1.0	chr19_0_2
3	s1	chr19	9750000	13000000	1.0	1.0	chr19_0_3
4	s1	chr19	13000000	15907946	3.0	1.0	chr19_1_0

And display the segmentation. This time we first add a sum column to the data and only plot that one (total copy number).

[37]:

with_total_cn_df = cns.add_total_cn(complex_agg_df)
cns.fig_bars(with_total_cn_df, cn_columns='total_cn');

For more fine grained segmentations we can also plot CN lines. Note the interpolation between neighbours.

[38]:

cns.fig_lines(with_total_cn_df, cn_columns="total_cn");

Conversions

Below we show conversions between data formats.

This is in particular useful for data processing pipelines, where we might want to have our data as feature sets. These can be then fed into machine learning models.

[39]:

features, samples, positions = cns.bins_to_features(complex_agg_df)
print(features.shape)
print(samples)
positions.head()

(2, 3, 18)
['s1', 's2']

[39]:

	chrom	start	end
name
chr19_0_0	chr19	0	3250000
chr19_0_1	chr19	3250000	6500000
chr19_0_2	chr19	6500000	9750000
chr19_0_3	chr19	9750000	13000000
chr19_1_0	chr19	13000000	15907946

Conversions between segments and CNS_df:

[41]:

cns.cns_df_to_segments(complex_agg_df)["chr19"]

[41]:

[(0, 3250000, 'chr19_0_0'),
 (3250000, 6500000, 'chr19_0_1'),
 (6500000, 9750000, 'chr19_0_2'),
 (9750000, 13000000, 'chr19_0_3'),
 (13000000, 15907946, 'chr19_1_0'),
 (15907946, 18815891, 'chr19_1_1'),
 (18815891, 21723837, 'chr19_1_2'),
 (21723837, 24631782, 'chr19_1_3'),
 (27731782, 30871502, 'chr19_2_0'),
 (30871502, 34011222, 'chr19_2_1'),
 (34011222, 37150942, 'chr19_2_2'),
 (37150942, 40290662, 'chr19_2_3'),
 (40290662, 43430383, 'chr19_2_4'),
 (43430383, 46570103, 'chr19_2_5'),
 (46570103, 49709823, 'chr19_2_6'),
 (49709823, 52849543, 'chr19_2_7'),
 (52849543, 55989263, 'chr19_2_8'),
 (55989263, 59128983, 'chr19_2_9'),
 (0, 3250000, 'chr19_0_0'),
 (3250000, 6500000, 'chr19_0_1'),
 (6500000, 9750000, 'chr19_0_2'),
 (9750000, 13000000, 'chr19_0_3'),
 (13000000, 15907946, 'chr19_1_0'),
 (15907946, 18815891, 'chr19_1_1'),
 (18815891, 21723837, 'chr19_1_2'),
 (21723837, 24631782, 'chr19_1_3'),
 (27731782, 30871502, 'chr19_2_0'),
 (30871502, 34011222, 'chr19_2_1'),
 (34011222, 37150942, 'chr19_2_2'),
 (37150942, 40290662, 'chr19_2_3'),
 (40290662, 43430383, 'chr19_2_4'),
 (43430383, 46570103, 'chr19_2_5'),
 (46570103, 49709823, 'chr19_2_6'),
 (49709823, 52849543, 'chr19_2_7'),
 (52849543, 55989263, 'chr19_2_8'),
 (55989263, 59128983, 'chr19_2_9')]

[42]:

cns.segments_to_cns_df(gap_segs).head()

[42]:

	sample_id	chrom	start	end	name	cn
0	segment	chr1	0	10000	telomere	2
1	segment	chr1	177417	227417	clone	2
2	segment	chr1	267719	317719	contig	2
3	segment	chr1	471368	521368	contig	2
4	segment	chr1	2634220	2684220	clone	2

Conversion between segments and breakpoints:

[43]:

cns.segments_to_breaks(gap_segs)["chr19"][:5]

[43]:

[0, 10000, 60000, 7346004, 7396004]

[44]:

cns.breaks_to_segments({"chr19" : [0, 10000, 30000]})

[44]:

{'chr19': [(0, 10000, 'chr19_0'), (10000, 30000, 'chr19_1')]}