3.3.1 - RNA hydrolysis

Parameters

Parameter

Name

Variable

Default

Value

Parameter

Range

Description

FRAG_UR_D0

1

>0

minimum length of fragments produced by hydrolysis

FRAG_UR_DELTA

NaN¹

geometry of the fragmentation process (1=linear, 2=surface-diameter, 3=volume-diameter, etc.); if not explicitly specified (NaN), the geometry of breakage depends logarithmically on the molecule length

FRAG_UR_ETA

NaN¹

intensity of fragmentation, determining the number of breaks per unit length; if not explicitly specified (NaN),

is determined by the corresponding

value and an expectation of 200nt (or the mean filtered fragment size, if size selection is used) long fragments

¹ NaN stands for "Not a Number" and marks the uninitialized state of a parameter

Algorithm

Frequencies of fragment sizes d produced by a uniform random fragmentation process have demonstrated to fall along Weibull distributions , if the fragmentation thermodynamics depends on the molecule size:

f(d)= d/h (d/ h)^d-1 exp—(d/h)^d (2)

Scale parameter h represents the intensity of fragmentation (i.e., breaks per unit length), and—as a determinant of the mean expected fragment size—is assumbed to be constant across molecules of different lengths for fragmentation protocols where the number of produced fragments depends on the molecule length. Shape parameter d reflects the geometric relation in which random fragmentation is breaking a molecule (e.g., d=1 corresponds to uniform fragmentation on the linear chain of nucleotides, d=2 splits uniformly the surface, and d=3 the volume, etc.).

Employing empirical data from spike-in sequences, we evaluated the fitting obtained by weighted subsampling from Weibull distributions with varying shape parameters. Weights for the subsampling (Fig. 2B, solid line) were derived by separating the characteristics of the combined Weibull distributions before filtering (dashed line in Fig. 2B and 2C) from the observed insert size distribution (Fig. 2B, dashed-dotted line). The quality of fit was measured as the p-value computed by a Kolgomorov-Smirnov test, comparing the in silico produced insert size distribution (Fig.2A, dashed lines) for each of the spike-in sequences under investigation with its experimental couterpart (Fig.2A, solid lines) under the null hypothesis that both samples were drawn from the same distribution. By this, we empirically found that the observed differences can be qualitatively reproduced under a constant decay rate (h=200nt), when shape parameter d depends logarithmically on the molecule length (Supplementary Fig.4).

In our uniform random fragmentation model, we adopt a 3-step algorithm to tokenize a respective molecule; first, geometry d and the number n of fragments that are obtained from the molecule are determined. We found empirically that parameter d depends logarithmically on len, the length of the molecule that is fragmented d=log(len). The number of fragments produced from a specific RNA molecule is determined by n=len/E(d_max), where E(d_max) is the expectancy of the most abundant fragment size, computed from h and the gamma-function G of d:

E(d_max)= hG(1/d + 1) (3)

Space shortcuts

Child pages

Parameters

Algorithm