---
title: "Digital PCR Cluster Predictor: a Universal Tool for the Automated
Analysis of Multiplex Digital PCR Data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Digital PCR Cluster Predictor: a Universal Tool for the Automated Analysis of Multiplex Digital PCR Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
Digital polymerase chain reaction (dPCR) is a PCR-based technology that enables
the sensitive quantification of nucleic acids. In a dPCR experiment, nucleic
acids are randomly distributed and end-point amplified in several thousands of partitions that act as micro PCR reactors. The partitioning process and the end-point detection of targets are the foundation of dPCR high sensitivity: each partition receives either zero or few nucleic acid copies, increasing the amplification efficiency; the end-point reaction ensures the amplification of targets to a detectable level.
The signal emitted by hydrolysis probes or intercalating binding dyes is used to detect the partitions containing the targets sequence.
The maximum number of fluorescence signals read in a single sample represents the major limitation of dPCR technology: the majority of dPCR system on the market is able to detect up to two fluorescence signals, limiting the experiment plexity. Several strategies were developed to overcome that limitation (Whale et al., 2016), however data analysis of multiplex assays and clustering of data generated from low input specimens are still an issue: manual annotation is time-consuming, user-dependent and has poor reproducibility.
Digital PCR Cluster Predictor (dPCP) was developed to automate the analysis of multiplex digital PCR data with up to four targets. dPCP supports the analysis multiple digital PCR systems, is independent of multiplexing geometry, and is not influenced by the amount of input nucleic acid.
## dPCP input data and workflow
dPCP requires two types of input files:
- A file containing the raw data for each sample and reference (“.eds” for QuantStudio 3D Digital PCR System, “.csv” for QX100/QX200 Droplet Digital PCR System and other digital PCR systems).
The raw data of QX100/QX200 system can be exported from QuantSoft software selecting *Export Amplitude and Cluster data* in the *Options* pane, it is fundamental to keep the default file name and prepare the plate setup always for two targets (Channel1 and Channel2), even for singleplex assays.
For other digital PCR systems, the raw data have to be prepared in the same format of QX100/QX200 system:
* upload for each sample a “.csv” file with two columns (column 1: fluorescence channel to be shown on y axis of plots; column 2: fluorescence channel to be shown on x axis of plots).
* the file name has be structured as follow: *WellID*_Amplitude.csv.
If it is needed to analyse the files of different experiments, ensure that the same well ID is not used multiple times as the software cannot discriminate file names with the same well ID.
- A *sample table*. The sample table is a csv file containing the information about the samples and the experiment settings.
```{r, results = "hide", eval = FALSE}
#Some examples of input files are included in the package
system.file("extdata", package = "dPCP")
```
The sample table has nine columns:
1. *Sample name*. The name to be assigned to a sample. The samples with the same name are considerate replicates. The field is not mandatory; the samples without a name are not replicates.
2. *Chip ID/Well ID*. This field is mandatory. For QuantStudio 3D Digital PCR System the chip ID is required, while for QX100/QX200 Droplet Digital PCR System and other systems the well ID has to be provided (e.g. A01).
Chip and well ID are used to identify the raw data files, therefore the default file name of QuantStudio 3D Digital PCR System and QX100/QX200 Droplet Digital PCR System must not be changed.
3. *No of targets*. This field is mandatory and must be an integer number
between 0 and 4.
4. *FAM target*. The name of the target detected with a FAM probe.
5. *Target 3*. The name of the third target. If the assay has an orthogonal
geometry, this field should be filled in with the name of the second target
detected with a FAM probe;
6. *Target 4*. The name of the fourth target. If the assay has an orthogonal
geometry, this field should be filled in with the name of the second target
detected with a VIC/HEX probe;
7. *VIC/HEX target*. The name of the target detected with a VIC/HEX probe.
8. *Reference*. This field is optional, a reference must be provided only when sensitive analysis are needed (few positive partition for one or more targets) and/or rain is expexted in sample data.
For QuantStudio 3D Digital PCR System the chip ID of the reference is required, while for QX100/QX200 Droplet Digital PCR System and other systems the complete file name (with file extension) of the reference raw data file has to be provided (e.g. example_ref_Amplitude.csv).
If the field is empty, the corresponding sample is used as reference to identify the empty partitions and single-target clusters with DBSCAN. In this scenario, the sample must have all the requirements to be used as reference (see below). The samples from other runs can be used as reference but in case of experiments with QX100/QX200 Droplet Digital PCR System or other systems, the name of the reference file has to be changed to remove the well ID, which could interfere with the samples well IDs.
Although it is not always required for the analysis, we strongly recommend the use of reference to have faster and more accurate cluster analyses.
9. *Dilution*. This field is mandatory and must be a numeric value
representing the dilution ratio of the sample (e.g. 1:5 dilution has to be
indicated as 0.2).
The sample table has to be filled out by the user with the required information. The table format is fundamental for the analysis and must not be changed.
A file named *Template_sampleTable.csv* is saved as an example file of this package and can be used as template:
```{r, results = "hide", eval = FALSE}
#Show the content of sample table template
read.csv(system.file("extdata", "Template_sampleTable.csv", package = "dPCP"),
stringsAsFactors = FALSE, na.strings = c("NA", ""))
#Copy the template to working directory
file.copy(system.file("extdata", "Template_sampleTable.csv", package = "dPCP"), getwd())
```
The first step carried out by dPCP is the collection of data and information from the input files.
When a reference is used, it is fundamental to have high-quality data as dPCP starts the identification of clusters from the reference. Once a good reference has been identified, it can be used for the analysis of all samples amplified with the same experimental conditions (e.g. same assay, primers and probes concentration, cycling protocol).
The ideal reference has:
- The same experimental settings and conditions of the corresponding sample
- Sufficient input amount to promote the formation at least of high-density empty partitions and single-target clusters;
- Negligible presence of rain;
- Non-cross-reactive probes (see Quality control).
dPCP identifies the empty partitions and single-target clusters in the reference using the non-parametric algorithm called density-based spatial clustering of applications with noise (Ester et al., 1996; Hahsler et al., 2019) (DBSCAN). Maximum distance (ε) between cluster elements and the number of minimum elements (minPts) to assemble a cluster are the input parameters to be chosen by the users. The function *dbscan_combination()* (see Quality control) helps the user to identify the most suitable ε and minPts values.
After the identification of empty partitions and single-target clusters, their centroid position is identified by computing the arithmetic mean of the coordinates of their data elements. The distance between a cluster centroid and the centroid of empty partitions can be represented by a Euclidean vector. As the coordinates of the centroids of multi-target clusters are predicted to be the sum of the coordinates of the centroids of single-target clusters, the position of the centroid of multi-target clusters can be calculated by computing the vector sum of vectors representing the distance of the centroid the single-target clusters to the centroids of empty partitions.
The clustering analysis of sample data is carried out by the unsupervised competitive learning version of the c-means algorithm (Bezdek, 1981; Lai Chung and Lee, 1994; Pal et al., 1996). The principle of fuzzy c-means algorithm is to minimize the variance within the cluster. The intra-cluster variance is defined as the sum of the squared distance of all cluster elements from the cluster centroid. The fluorescence values of sample elements and the coordinates of all centroids are used as input parameter for the analysis. The output of the c-means analysis is a matrix showing the probability of membership of the data elements to each cluster. Each data element is assigned to the cluster whose probability is the highest.
If the highest probability is lower than 0.5 a data element is classified as rain and its membership is recalculated with Mahalanobis distance (Mahalanobis, 1936). Mahalanobis distance computes the distance between a point and a distribution, it is based on measuring at multidimensional level how many standard deviations away is a point from the mean of a distribution.
The rain-tagged elements are assigned to the cluster with the lowest Mahalanobis distance.
The cluster results can be corrected manually by the user with the shiny-based function *manual_correction()*.
Finally, the copies per partition of each target are calculated according to a Poisson model. (Hindson et al., 2011). Precision is calculated as previously described (Majumdar et al., 2015). Replicates can be combined and the copies per partition are re-calculated.
A complete analysis can be executed by the function *dPCP()*.
```{r, results = "hide", eval = FALSE}
library(dPCP)
#Find path of sample table and location of reference and input files
sampleTable <- system.file("extdata", "Template_sampleTable.csv",
package = "dPCP")
fileLoc <- system.file("extdata",package = "dPCP")
#Lunch dPCP analysis
results <- dPCP(sampleTable, system = "bio-rad", file.location = fileLoc,
, eps = 200, minPts = 50, save.template = FALSE, rain = TRUE)
```
Alternatively, a step by step analysis can be carried out following the
abovementioned pipeline:
```{r, results = "hide", eval = FALSE}
library(dPCP)
#Find path of sample table and location of reference and input files
sampleTable <- system.file("extdata", "Template_sampleTable.csv",
package = "dPCP")
fileLoc <- system.file("extdata",package = "dPCP")
#Read sample table file
sample.table <- read_sampleTable(sampleTable, system = "bio-rad",
file.location = fileLoc)
#Read reference files
ref <- read_reference(sample.table, system = "bio-rad",
file.location = fileLoc)
#Read samples files
samp <- read_sample(sample.table, system = "bio-rad", file.location = fileLoc)
#Reference DBSCAN clustering
dbref <- reference_dbscan(ref, sample.table, save.template = FALSE)
#Predict position of clusters centroid from reference DBSCAN results
cent <- centers_data(samp, sample.table,dbref)
#Fuzzy c-means clustering
cmclus <- cmeans_clus(cent)
#Rain classification.
rainclus <- rain_reclus(cmclus)
#Quantification
quantcm <- target_quant(cmclus, sample.table)
quant <- target_quant(rainclus, sample.table)
#Replicates pooling
rep.quant <- replicates_quant(quant, sample.table)
```
dPCP is available also as web app accessible through a web browser at dpcp.lns.lu
## Quality control
Quality controls were developed for the fundamental steps of dPCP analysis. Along with the clustering algorithm, the function *dbscan_combination()* and the *plot()* S3 method were implemented to have a graphical view of the results and to check the quality of the following processes:
1. Choice of DBSCAN input parameters. The identification of empty partitions and single-target clusters in the reference is the first step of dPCP analysis and relies on the DBSCAN algorithm.
The performance of DBSCAN depends on the input parameters *ε* and *minPts* that are the only two values the user has to adapt for dPCP analysis.
In order to simplify the choice of input values, we developed the function *dbscan_combination()* that carries out a DBSCAN simulation for the combinations of *ε* and *minPts* values chosen by the user. The function generates a pdf file for each reference, showing a scatterplot for each combination.
The ideal combination of input values is chosen according to the following criteria:
* identification of the majority of data elements of empty partitions cluster and all single-target clusters;
* absence of multiple subclusters;
* the identified area has to be centred in the cluster centroid.
An example of *dbscan_combination* output is showed in Figure 1.