Abstract
Plasma DNA fragmentomics is an emerging area of research covering plasma DNA sizes, end points, and nucleosome footprints. In the present study, we found a significant increase in the diversity of plasma DNA end motifs in patients with hepatocellular carcinoma (HCC). Compared with patients without HCC, patients with HCC showed a preferential pattern of 4-mer end motifs. In particular, the abundance of plasma DNA motif CCCA was much lower in patients with HCC than in subjects without HCC. The aberrant end motifs were also observed in patients with other cancer types, including colorectal cancer, lung cancer, nasopharyngeal carcinoma, and head and neck squamous cell carcinoma. We further observed that the profile of plasma DNA end motifs originating from the same organ, such as the liver, placenta, and hematopoietic cells, generally clustered together. The profile of end motifs may therefore serve as a class of biomarkers for liquid biopsy in oncology, noninvasive prenatal testing, and transplantation monitoring.
Plasma DNA molecules originating from the liver, HCC and other cancers, placenta, and hematopoietic cells each harbor a set of characteristic plasma DNA end motifs. Such markers carry tissue-of-origin information and represent a new class of biomarkers in the nascent field of fragmentomics.
This article is highlighted in the In This Issue feature, p. 627
Introduction
There is much recent research interest in the molecular characteristics of cell-free DNA (cf DNA) in plasma. One such characteristic is the fragmentation patterns of cfDNA, including information regarding fragment sizes (1), nucleosome relationships (2, 3), and end points (4, 5). This area of research can be broadly named “fragmentomics” (6). cfDNA molecules are known to circulate as short fragments (1, 7) originating from different cell types, including various normal organ systems (8–11) and malignancies from different anatomic sites (12). Interestingly, it has also been found that circulating fetal DNA (1, 2) and cancer DNA molecules (13) are shorter than the “background” cfDNA molecules, which are generally of hematopoietic origin (9). cfDNA molecules also bear signatures of their association with nucleosomes, such as the most abundantly represented plasma DNA size being 166-bp and fragmentation end points showing relationships to nucleosome organization (2, 3). Such knowledge of cfDNA fragmentation patterns has also been used for enhancing the performance of noninvasive prenatal testing (14) and cancer detection (13, 15).
The nucleosome footprints of cfDNA molecules have been found to contain information concerning their tissues of origin (3). Such tissue-of-origin information has also been inferred using plasma DNA methylation profiles (12). More recently, it has been demonstrated that a subset of human genomic locations are preferentially cleaved when plasma DNA molecules are generated, called plasma DNA preferred ends (4, 5). Such plasma DNA preferred ends contain information on the tissue of origin of cfDNA (4, 5).
In this work, we hypothesized whether human plasma DNA ends might have a preponderance of certain nucleotide contexts, i.e., preferred fragment end motifs. The plasma DNA end motifs represent a distinct type of plasma DNA fragmentation signatures, which are different from plasma DNA preferred ends previously studied (4, 5). The plasma DNA preferred ends refer to the specific ending sites in the genome, whereas end motifs are defined as a few nucleotides at plasma DNA ends regardless of the site of origin within the genome. We predict that such end motifs will also show hallmarks of the nonrandom fragmentation process of plasma DNA. We have recently demonstrated that Dnase1l3 gene deletion in a mouse model causes a dramatic reduction in the most common 4-mer end motifs of plasma DNA (e.g., the motif CCCA; ref. 16). These results suggested that aberrations in DNA endonucleases such as DNASE1L3 might play a role in altering plasma DNA end motifs. The genetic or epigenetic changes in cancer genomes might cause aberrations in the expression of DNA endonucleases, potentially resulting in the changes in plasma DNA end motifs. In this proof-of-concept work, we explore the landscape of plasma DNA end motifs as well as associated aberrations in human samples involving hepatocellular carcinoma (HCC) and other cancers. We also attempted to gain tissue-of-origin insights into plasma DNA end motifs using liver transplantation and pregnancy as models.
Results
Plasma DNA End-Motif Determination
The workflow for determining the plasma DNA end motifs is schematically illustrated in Fig. 1. In this study, we analyzed each plasma DNA fragment using massively parallel sequencing. The plasma DNA end motifs were identified using the first 4-nucleotide (i.e., 4-mer) sequence on each 5′ fragment end of plasma DNA after alignment to the reference genome (see Supplementary Methods). The frequency of each plasma DNA end motif was calculated for downstream analysis in an attempt to see if certain end motifs might be over- or underrepresented in fragments from certain organs, or in selected physiologic or pathologic conditions.
Schematic illustration of the determination of plasma DNA end motifs. The plasma DNA fragments carry 3′ protruding single-strand ends or 5′ protruding single-strand ends, or blunt ends (not shown). During end repair, the 3′ protruding single-strand ends are removed, and the 3′ receded ends are elongated using the opposite 5′ protruding single strand as DNA template. Thus, the original 3′ ends will be modified, but the original 5′ ends will be preserved. Paired-end sequencing reads that are mapped to a human reference genome are used to determine the first 4-nucleotide sequence (i.e., a 4-mer motif) on each 5′ fragment end (Watson and Crick strands) of plasma DNA in relation to the reference genome, resulting in a total of 256 4-mer motifs. As shown in the example, each 5′ plasma DNA end motif of both sides of each double-stranded plasma DNA molecules including CCCA and CCAG was used for profiling the 4-mer end motif collectively.
Schematic illustration of the determination of plasma DNA end motifs. The plasma DNA fragments carry 3′ protruding single-strand ends or 5′ protruding single-strand ends, or blunt ends (not shown). During end repair, the 3′ protruding single-strand ends are removed, and the 3′ receded ends are elongated using the opposite 5′ protruding single strand as DNA template. Thus, the original 3′ ends will be modified, but the original 5′ ends will be preserved. Paired-end sequencing reads that are mapped to a human reference genome are used to determine the first 4-nucleotide sequence (i.e., a 4-mer motif) on each 5′ fragment end (Watson and Crick strands) of plasma DNA in relation to the reference genome, resulting in a total of 256 4-mer motifs. As shown in the example, each 5′ plasma DNA end motif of both sides of each double-stranded plasma DNA molecules including CCCA and CCAG was used for profiling the 4-mer end motif collectively.
Alteration of Plasma DNA Motif CCCA in Patients with HCC
The sequence homology of amino acid sequences between the proteins encoded by DNASE1L3 (human) and Dnase1l3 (mouse) was 82%. In both humans and mice, deficiency of the corresponding gene would lead to the development of lupus-like syndromes with the presence of anti–double-stranded DNA autoantibodies (16). We conjectured that the DNASE1L3 in mice would mimic the activity of DNASE1L3 in humans. Thus, we used the DESeq software (17) to analyze the gene-expression profiles for HCC tumors (N = 368) and adjacent normal liver tissues (N = 51) on the basis of RNA-sequencing data generated by The Cancer Genome Atlas (TCGA) Research Network. The mRNA expression of DNAES1L3 was dramatically downregulated in tumor tissues compared with adjacent nontumoral liver tissues (10.3-fold reduction in the median expression level; P value < 0.0001; Fig. 2A). This scenario would partially mimic the pathologic situation of being DNASE1L3-deficient in humans. Intriguingly, we observed that the plasma motif CCCA, which was the most frequent motif in plasma DNA of healthy human controls, was significantly reduced in HCC subjects (a relative decrease of 17.9% in the median motif frequency; Bonferroni-adjusted P value < 0.0001; Fig. 2B). These results resembled the reduction of the CCCA plasma DNA end motif in mice carrying the Dnase1l3 deletion (16).
A, DNASE1L3 expression levels between nontumoral liver tissues and HCC tumoral tissues. B, Box plot of CCCA end-motif frequency in plasma DNA across healthy control subjects (Control; n = 38) and patients infected with chronic HBV (n = 17) and HCC (HCC; n = 34). Triangles and circles represent HBV carriers with and without cirrhosis.
A, DNASE1L3 expression levels between nontumoral liver tissues and HCC tumoral tissues. B, Box plot of CCCA end-motif frequency in plasma DNA across healthy control subjects (Control; n = 38) and patients infected with chronic HBV (n = 17) and HCC (HCC; n = 34). Triangles and circles represent HBV carriers with and without cirrhosis.
Landscape of Plasma DNA End Motifs in Patients with HCC
To study whether the landscape of the 256 4-mer plasma DNA end motifs would be altered in patients with cancer, we sequenced 38 healthy control subjects (Control), 17 patients infected with chronic hepatitis B virus (HBV), and 34 patients with HCC with a median of 38 million paired-end reads (range, 18–65 million).
Hierarchical clustering analysis was used to study whether HCC subjects would share identifiable characteristics of plasma DNA end motifs compared with non-HCC subjects, including healthy controls and HBV subjects. We calculated the frequency of each 4-mer motif. The hierarchical clustering analysis of frequencies of the 256 motifs showed that HCC subjects tended to cluster together, whereas non-HCC subjects tended to form distinct clusters (Fig. 3A). Box-plot analysis of plasma DNA end motifs between HCC and non-HCC groups showed that there were a number of motifs exhibiting the significant differences between these groups (Supplementary Fig. S1). Six representative motifs showing significant differences between HCC and non-HCC subjects are shown in Fig. 3B, including three motifs (CCCA, CCAG, and CCTG) with a significant decrease in HCC subjects and another three motifs (TAAA, AAAA, and TTTT) with a significant increase in HCC subjects. For example, in comparison with non-HCC subjects, the end motif CCCA was significantly lower (a decrease of 17.9%; Bonferroni-adjusted P value < 0.0001, Wilcoxon rank-sum test) in HCC subjects. On the other hand, the end motif AAAA was significantly higher (a relative increase of 16.4%; adjusted P value < 0.0001, Wilcoxon rank-sum test; Fig. 3B; Supplementary Fig. S1; Supplementary Table S1).
MDS analysis for HCC and non-HCC subjects. A, Heat map analysis of motif frequencies between non-HCC and HCC samples. The data are row-normalized. B, Box plot analysis of six representative motifs showing differential frequencies between non-HCC and HCC subjects. The gray dashed line indicates the frequency of 1/256. C, Box plot of MDS of plasma DNA end motifs among healthy control subjects (Control; n = 38) and patients infected with chronic HBV (n = 17) and HCC (n = 34). D, ROC curve analysis between non-HCC and HCC groups. E, MDS values between patients with and without detectable CNAs in plasma. F, End-motif frequencies of those fragments carrying wild-type (nontumoral) and mutant (tumoral) alleles for representative decreased and increased end motifs identified in the HCC group. G, Effects of tumor DNA fraction and the number of DNA molecules on the discriminative power on subjects with and without cancer by computer simulation. The AUC is stated and colored as indicated in the key. H, Downsampling analysis of actual sequencing results. We randomly sampled 500, 1,000, 5,000, 10,000, 50,000, 100,000, 500,000, 1,000,000, 2,000,000, 5,000,000, 10,000,000, 20,000,000, and 30,000,000 plasma DNA fragments to carry out MDS analysis. Downsampling analysis was repeated 10 times for each randomly sampled set of fragments.
MDS analysis for HCC and non-HCC subjects. A, Heat map analysis of motif frequencies between non-HCC and HCC samples. The data are row-normalized. B, Box plot analysis of six representative motifs showing differential frequencies between non-HCC and HCC subjects. The gray dashed line indicates the frequency of 1/256. C, Box plot of MDS of plasma DNA end motifs among healthy control subjects (Control; n = 38) and patients infected with chronic HBV (n = 17) and HCC (n = 34). D, ROC curve analysis between non-HCC and HCC groups. E, MDS values between patients with and without detectable CNAs in plasma. F, End-motif frequencies of those fragments carrying wild-type (nontumoral) and mutant (tumoral) alleles for representative decreased and increased end motifs identified in the HCC group. G, Effects of tumor DNA fraction and the number of DNA molecules on the discriminative power on subjects with and without cancer by computer simulation. The AUC is stated and colored as indicated in the key. H, Downsampling analysis of actual sequencing results. We randomly sampled 500, 1,000, 5,000, 10,000, 50,000, 100,000, 500,000, 1,000,000, 2,000,000, 5,000,000, 10,000,000, 20,000,000, and 30,000,000 plasma DNA fragments to carry out MDS analysis. Downsampling analysis was repeated 10 times for each randomly sampled set of fragments.
We adopted the normalized Shannon entropy to arrive at a motif diversity score (MDS) by comparing frequencies across 256 motifs (see Methods). A higher MDS value indicated that there was a higher variety of plasma DNA molecules with different end motifs in plasma. Conversely, a lower MDS value indicated that there were fewer varieties of plasma DNA end motifs.
As shown in Fig. 3C, HCC subjects were associated with higher MDS values compared with non-HCC subjects. MDS values of HCC subjects (median, 0.945; range, 0.930–0.954) were found to be significantly higher than healthy control subjects (median, 0.941; range, 0.933–0.946) and patients with HBV infection (median, 0.938; range, 0.931–0.946; P value < 0.0001, Kruskal–Wallis test). The MDS values varied across different plasma DNA fragment sizes and were generally higher in HCC subjects than in non-HCC subjects (Supplementary Fig. S2).
Such an increase of plasma DNA end diversity could be generally observed among various cancer types when we performed MDS analysis using sequencing results of plasma DNA downloaded from a published study (Supplementary Fig. S3A; ref. 18), which may reflect the fact that different tumor cells from different anatomic sites would shed their DNA into the blood circulation (19). Of note, even though the data shown in Fig. 3C and Supplementary Fig. S3A were generated using different methods and sequencing instruments (18), it was encouraging to see a general increase of MDS in patients with cancer in both datasets.
To further test the generalizability of MDS changes across different cancer types, we further sequenced an independent cohort with 40 plasma DNA samples of other cancer types, including patients with colorectal cancer (n = 10), lung cancer (n = 10), nasopharyngeal carcinoma (n = 10), and head and neck squamous cell carcinoma (n = 10), with a median of 42 million paired-end reads (range, 19–65 million). As shown in Supplementary Fig. S3B, the MDS values in the group of patients with cancer (median, 0.943; range, 0.939–0.949) were significantly higher than in the control group without cancer (median, 0.941; range, 0.933–0.946; P value < 0.0001, Wilcoxon sum-rank test). Interestingly, we also observed that the expression levels of DNASE1L3 for those cancer types available in the TCGA Research Network were generally downregulated, including in breast cancer, colon cancer, lung cancer, gastric cancer, and head and neck squamous cell carcinoma (Supplementary Fig. S3C).
We employed ROC curve analysis to study the potential diagnostic ability for cancer detection with the use of plasma DNA end motifs (i.e., MDS statistics). The area under the ROC curve (AUC) between HCC and non-HCC identification was 0.86 (Fig. 3D). However, compared with the MDS value calculated using all fragments, we did not observe a significant difference in AUC between MDS values calculated by fragments less than 150 bp, and those for fragments greater than 200 bp (Supplementary Fig. S4).
Combining the data from all cancer and noncancer subjects, we had a total of 129 samples, including healthy controls (n = 38), HBV carriers (n = 17), and patients with HCC (n = 34), colorectal cancer (n = 10), lung cancer (n = 10), nasopharyngeal carcinoma (n = 10), and head and neck squamous cell carcinoma (n = 10). Interestingly, the MDS-based method (AUC = 0.85) appeared to have the best performance (Supplementary Fig. S5) compared with other fragmentomic metrics including fragment size (AUC = 0.74, P value = 0.0040; DeLong test; ref. 14) and orientation-aware plasma cell-free fragmentation signals (AUC = 0.68; P value = 0.0013;ref. 20).
We divided the patients into two groups, namely those with and without detectable copy-number aberrations (CNA) in plasma DNA. The MDS of patients with HCC with CNAs was higher than those without CNAs (P = 0.0024;Fig. 3E).
There is much recent interest in attempting to use methylation signals of plasma DNA to detect a variety of cancers. Massively parallel bisulfite sequencing was commonly used in analyzing methylation profiles of plasma DNA in these studies (21). Thus, we were interested to explore whether plasma DNA end motifs would be preserved in bisulfite sequencing results. We performed bisulfite sequencing on the same sample set presented earlier, namely the 8 healthy control subjects, 17 patients infected with chronic HBV, and 34 patients with HCC to a median of 56 million paired-end reads (range, 50–60 million). We observed a very high correlation between the nonbisulfite and bisulfite sequencing data in terms of the frequency of end-motif CCCA (r = 0.98; P value < 0.0001) and MDS (r = 0.99; P value < 0.0001; Supplementary Fig. S6A and S6B). The patterns of hierarchical clustering analysis, plasma DNA end-motif frequencies, and MDS between non-HCC and HCC subjects were found to be reproducible (Supplementary Fig. S6C–S6E). These results suggested that the plasma DNA end motifs were preserved in the bisulfite sequencing protocol used. The ability to preserve plasma DNA end motifs in bisulfite sequencing could be attributed to the fact that methylated sequencing adaptors were first ligated to plasma DNA molecules prior to bisulfite treatment and those plasma DNA molecules degraded by bisulfite were not able to be amplified and be part of the final sequencing library.
Interestingly, the MDS values deduced from 1- to 5-mer motifs also bore the power of distinguishing patients with and without cancer (Supplementary Fig. S7A).
Plasma DNA End Motifs of Tumor-Derived DNA
One plasma DNA sample from HCC was sequenced with more than 200× haploid human genome coverage in our previous study (5). We identified 266,986 plasma DNA fragments carrying mutant alleles (tumoral DNA) and 2,349,406 plasma DNA fragments carrying wild-type alleles (mainly nontumoral DNA), respectively. We observed that the CCCA motif was less abundant in tumor-derived fragments than the background DNA of predominantly hematopoietic origin (Fig. 3F), and MDS of molecules carrying the mutant alleles (0.949) was greater than those carrying wild-type alleles (0.945). These results provided independent evidence that the tumor-derived DNA fragments carried a different distribution of end motifs from the nontumoral DNA.
Classification Performance Using Plasma DNA End Motifs
To further explore whether a classifier could be built for detecting patients with cancer using plasma DNA end motifs, we used the 256 plasma DNA end motifs to build a classifier to differentiate subjects with (n = 55) and without (n = 74) cancer using support vector machine (SVM) and logistic regression which took into account the magnitude and direction of each end motif. To minimize the issue of overfitting, we adopted a leave-one-out procedure (see Supplementary Methods) to evaluate its performance by using ROC curve analysis.
As a result, we observed a small, but not statistically significant, increase in AUC of using the classifiers with 256 end motifs (AUC = 0.89 for both SVM and logistic regression) compared with the MDS-based analysis (AUC = 0.85; Supplementary Fig. S7B).
We also explored the effects of tumor DNA fraction and the number of plasma DNA molecules analyzed (i.e., sequencing reads) on the performance of MDS-based cancer detection using computer simulation (see Supplementary Methods). As shown in Fig. 3G, the performance of cancer detection progressively improved as the tumor DNA fraction in plasma DNA and the number of DNA molecules increased. For example, at 30 million plasma DNA molecules, the AUC was only 0.66 for those patients with a tumor DNA fraction of 1%, whereas the AUC increased up to 0.95 and 1.0 for those patients with a tumor DNA fraction of 4% and 10%, respectively. On the other hand, the performance of differentiating patients with HCC from control subjects rapidly reached a plateau when the number of sequencing reads reached 50,000, assuming a tumor DNA fraction of 10% (Fig. 3G). For a tumor DNA fraction of 5%, the plateau of the performance was reached at 500,000 sequencing reads (Fig. 3G). Such a rapid attainment of performance plateau for the AUC using MDS was also observed in the downsampling analysis of the dataset generated from actual plasma samples including HCC subjects and subjects without cancers when the number of sequencing reads reached 500,000 (Fig. 3H).
Plasma DNA End Motifs in Liver Transplantation
To allow us to focus on plasma DNA end motifs derived from a solid organ, we investigated end motifs using nonbisulfite sequencing of plasma DNA from 12 liver transplantation recipients previously reported (22). The genotypic differences between the donor and recipient genomes could be used to distinguish the donor and recipient DNA molecules in the plasma of patients with liver transplantation (i.e., recipient's plasma; ref. 11). We made use of informative SNP sites for which the recipient was homozygous (AA) and the donor was heterozygous (AB). As illustrated in Supplementary Fig. S8, the donor-specific molecules carrying the donor-specific alleles (B) were identified. In addition, the molecules carrying the shared allele (A) were also identified, which would predominantly represent the recipient-derived DNA molecules, because the donor DNA molecules were a minor population in the recipient plasma DNA pool. Such recipient background DNA molecules were mainly of hematopoietic origin (22).
We studied the profile of 4-mer end motifs among the DNA molecules in the recipient's plasma. We calculated the proportion of each 4-mer motif and compared the frequencies across the 256 4-mer motifs using MDS and clustering analysis (Supplementary Fig. S8) for the donor-specific and shared sequences. We observed that MDS values were significantly lower (P value = 0.0009, Wilcoxon signed-rank test) in liver-derived DNA molecules than those for shared sequences (Fig. 4A). The hierarchical clustering analysis showed that the patterns of the 256 4-mer end motifs for liver-specific and shared DNA molecules were clustered into two groups (Fig. 4B). These results provided further evidence that plasma DNA end motifs carried information about the tissue of origin of cfDNA.
MDS distributions in plasma DNA of a patient with liver transplantation and pregnant subjects. A, Dot plot of MDS between shared and donor-specific DNA sequences. B, Heat map of shared and donor-specific sequences using 256 end motifs. C, Heat map of shared and fetal-specific sequences using 256 end motifs. D, Box-plot analysis of six representative motifs showing differential frequencies between fetal and shared sequences. The gray dashed line indicated the frequency of 1/256. E, Dot plot of shared and fetal-specific sequences. F, Correlation between MDS and fetal DNA fraction.
MDS distributions in plasma DNA of a patient with liver transplantation and pregnant subjects. A, Dot plot of MDS between shared and donor-specific DNA sequences. B, Heat map of shared and donor-specific sequences using 256 end motifs. C, Heat map of shared and fetal-specific sequences using 256 end motifs. D, Box-plot analysis of six representative motifs showing differential frequencies between fetal and shared sequences. The gray dashed line indicated the frequency of 1/256. E, Dot plot of shared and fetal-specific sequences. F, Correlation between MDS and fetal DNA fraction.
Plasma DNA End Motifs in Pregnant Subjects
Pregnancy is another attractive model for studying the biology of tissue-specific cfDNA molecules (23). Using a similar analytic strategy as for the liver transplantation study described above, we could differentiate the fetal-specific sequences (i.e., fetal-specific DNA) from the shared sequences (mainly recipient's hematopoietically derived DNA) using informative SNPs where the mother was homozygous (AA) and the fetus was heterozygous (AB). The fetal-specific molecules carrying the fetal-specific alleles (B) were determined. On the other hand, the molecules carrying the shared alleles (A) were determined, which would predominantly represent the maternally derived DNA molecules, because the fetal DNA molecules were a minor population in the maternal plasma DNA pool. Such maternal background DNA molecules were mainly hematopoietic in origin (12).
To assess whether there were any differences in end-motif profiles between fetal and maternal DNA molecules, we reanalyzed a dataset from a previous publication reporting bisulfite sequencing results from 10 pregnant women from each of the first (12–14 weeks), second (20–24 weeks), and third (38–40 weeks) trimesters (24). These sequencing data were interpreted using independently generated genotype results based on microarray analysis (HumanOmni2.5 genotyping array Illumina) of the matched maternal buffy coat and fetal samples (24). The shared sequences (mainly maternally derived) and fetally derived sequences were readily differentiated utilizing the informative SNPs. The median fetal DNA fraction among those samples was 17.1% (range, 7.0%–46.8%).
Hierarchical clustering analysis based on the 4-mer end motifs showed that the patterns of end motifs originating from fetal DNA molecules across different samples formed a cluster which was distinct from that of the maternally derived DNA molecules (Fig. 4C). Box-plot analyses between fetal and maternal DNA end motifs showed that there were many motifs exhibiting significant differences in terms of frequencies (Supplementary Fig. S9). Six representative motifs showing significant differences in frequencies between fetal and maternal DNA molecules are shown in Fig. 4D, including three motifs (CCCA, CCAA, and CAAA) with significant enrichment in fetal DNA molecules and another three motifs (ACTT, ACCT, and CTGG) with significant decreases in fetal DNA molecules. For example, in comparison with maternal DNA, the end motif CCCA was significantly higher (a median increase of 3.78%; adjusted P value = 0.0055, Wilcoxon signed-rank test) in fetal DNA molecules, whereas the end motif CTGG was significantly lower (a median decrease of 10.74%; adjusted P value = 0.002, Wilcoxon signed-rank test; Fig. 4D; Supplementary Fig. S9; Supplementary Table S2).
Figure 4E shows that MDS values of fetally derived molecules (median, 0.943; range, 0.940–0.949) are generally lower (P value < 0.0001, Wilcoxon signed-rank test) than those of maternally derived ones (median, 0.947; range, 0.944–0.951). Interestingly, the MDS values derived from all sequenced DNA fragments in plasma of each pregnant woman were negatively correlated with fetal DNA fraction (Spearman's ρ: 0.46; P value = 0.012; Fig. 4F), further suggesting that MDS of plasma DNA may reflect the tissue of origin of those molecules.
Discussion
We have demonstrated that plasma DNA end-motif profiling represents an approach for differentiating the patients with and without cancers. Due to the small sample size of this proof-of-concept work, these results would need to be validated in large-scale studies. Several lines of evidence from previous studies have suggested that plasma DNA fragmentation is a nonrandom process. For example, plasma DNA fragments display a characteristic size distribution with a 166-bp major peak and smaller peaks occurring at 10-bp intervals (2); there are a great number of genomic locations found to be preferentially cleaved (4, 5) when plasma DNA are generated, and the fragmentation of plasma DNA allowed nucleosome footprints to be deduced (2, 3). In this study, we show that certain plasma DNA end motifs are more prevalent than others, in association with the tissue of origin. On the other hand, the cancer-associated motif aberrations are present in patients with cancer.
One key observation of the plasma DNA end motif profile in patients with HCC was that there was an increase in motif diversity in such patients. Considering the fact that DNASE1L3 plays an important role in the fragmentation of plasma DNA as demonstrated by a Dnase1l3 deletion mouse model (16), one possible reason would be due to the significant downregulation of DNASE1L3 in human liver tumor tissues. Deleting Dnase1l3 would result in a dramatic reduction of the six highest-ranked end motifs (e.g., CCCA) that were originally highly enriched in wild-type mice (16). We observed a significant downregulation of DNASE1L3 in HCC tumors and other cancer types (breast cancer, colon cancer, lung cancer, gastric cancer, and head and neck squamous cell carcinoma) that were analyzed in this work and available in the TCGA database (Supplementary Fig. S3C). We observed that the plasma DNA end motif CCCA was reduced in the HCC group compared with the non-HCC group. Taken together, these results suggested that the enzymatic activities of DNASE1L3 may partly contribute to the alterations in motif diversity of plasma DNA.
Intriguingly, the MDS values across a broad spectrum of plasma DNA sizes for patients with HCC are larger than those for non-HCC subjects. We did not observe a differential enhancement in MDS-based diagnostic performance by focusing on selected shorter plasma DNA molecules. In the above-mentioned mouse model, the plasma DNA profiles of pregnant mice in which both copies of Dnase1l3 had been deleted (i.e., Dnase1l3−/−; ref. 16) were studied. If such pregnant Dnase1l3−/− mice carried fetuses of the Dnase1l3+/− (i.e., with one copy of the functioning Dnase1l3 gene) genotype, the fetuses were able to partially correct the plasma DNA profile of the pregnant mothers (16), implying that the fetuses were able to release the DNASE1L3 enzyme into the mother's circulation to alter the mother's plasma DNA end motifs. Thus, we speculate that the observation of the elevation of MDS values across different size ranges in patients with HCC might be associated with an alteration in the gene expression involving genes coding for nucleases, for example, a reduction in the expression of DNASE1L3. Such alteration in the gene expression of those nucleases could result in a perturbation of the relative levels of different nucleases in plasma, leading to a global change in plasma DNA end motifs (i.e., affecting both tumoral and nontumoral plasma DNA of end motifs). We referred to this global change as the “systemic” effect of nuclease perturbation.
Furthermore, in the previous study of pregnant mouse models, an enhanced DNA cutting by DNASE1L3 enzyme expressed by the fetuses was observed in the subset of fetal DNA molecules, indicating a “local” effect (i.e., within fetal tissues) of DNASE1L3. Such analogous “local” effect appeared to also exist in the subset of plasma DNA molecules carrying tumoral mutations. Therefore, we believe that both “global” and “local” effects of nucleases may affect the fragmentomic profile of plasma DNA. The mechanistic basis of this observation requires future investigation, for example, using a mouse model with knockdown or overexpression of different nucleases in different tissues.
We speculated that a number of other nucleases, such as caspase-activated DNase, DNASE II, TREX1 (three prime repair exonuclease), DNASE1, and ENDOG (endonuclease G), involving in the DNA fragmentations during apoptosis might be involved in the generation of plasma DNA ends. In our previous study, we found that the deletion of Dnase1l3 would result in aberrations of plasma DNA size profile, whereas there was no observable effect on plasma DNA size profile for mice with the deletion of Dnase1 (16). Hence, we propose that DNASE1L3 may play a role in the generation of plasma DNA end motifs. Our group is now exploring the effect of the deletion of genes coding for a number of the aforementioned nucleases on plasma DNA end motifs (25). Such ongoing work might shed new mechanistic insights into the generation of cfDNA end motifs. The fragmentation of DNA occurring with cell apoptosis would likely be affected by the nucleosomal profile of different genomic regions, which might vary from tissue to tissue (3, 20), potentially contributing toward the variation of end-motif profiles from different tissues.
These data presented in this study suggest that the profile of plasma DNA end motifs might reflect certain pathophysiologic states. Such a hypothesis was further evidenced by the fact that the profile of plasma DNA end motifs originating from the same tissue type, such as the liver, placenta, and hematopoietic cells, generally clustered together. For example, the end motifs would allow fetal (i.e., from the placenta) and maternal plasma DNA molecules (mainly of hematopoietic origin) to be clustered into different groups in the plasma DNA of pregnant women. In addition, liver-specific and recipient DNA molecules showed unique patterns of end motifs. Taken together, the profile of plasma DNA end motifs represents a new class of biomarkers for liquid biopsy for oncology, noninvasive prenatal testing, and transplantation monitoring.
As demonstrated in the downsampling analyses presented in Fig. 3H, the diagnostic potential of plasma DNA end-motif analysis requires only a relatively small number of molecules to be realized. Hence, for tumor DNA fractions of 5% and 10%, respectively, the plateau of performance would be reached at 500,000 and 50,000 molecules. We consider this observation to reflect a potential strength in the motif-based approach, in that in the future one could adapt this approach to lower-throughput, but cheaper, analytic methods, for example, digital PCR.
In summary, we have developed a generic approach to delineate the profile of plasma DNA end motifs and have revealed its association with pathophysiologic conditions such as pregnancy, transplantation, and cancer. As a member in the emerging field of plasma DNA fragmentomics which also encompasses plasma DNA size profiling (2), preferred ends (4, 5), and nucleosome relationships (2, 3), we believe that plasma DNA end-motif profiling would have many future research and diagnostic applications.
Methods
Sample Collection and Processing
Patients with chronic hepatitis B but without HCC, and patients with HCC were recruited from the Department of Surgery and the Department of Medicine and Therapeutics of the Prince of Wales Hospital, Hong Kong. Healthy subjects were recruited as controls. Patients with nasopharyngeal carcinoma and head and neck squamous cell carcinoma were recruited from the Department of Otorhinolaryngology, Head and Neck Surgery of the Prince of Wales Hospital, Hong Kong. All subjects involved in this study gave written informed consent, and the study was approved by The Joint Chinese University of Hong Kong–Hospital Authority New Territories East Cluster Clinical Research Ethics Committee, under the Declaration of Helsinki.
We collected plasma from the blood samples through centrifugation at 1,600 × g for 10 minutes and then 16,000 × g for 10 minutes at 4°C. A QIAamp Circulating Nucleic Acid Kit (Qiagen) was used to extract DNA from 4 mL plasma.
Sequencing Library Preparation
Plasma DNA was used for preparing sequencing libraries. Libraries were prepared using TruSeq Nano DNA Library Prep Kit (Illumina). For bisulfite sequencing, libraries were treated with two rounds of bisulfite conversion by the EpiTect Plus DNA Bisulfite Kit (Qiagen) according to the manufacturer's instructions. Bisulfite-converted products were amplified with KAPA HiFi HotStart Uracil + ReadyMix (Roche), which were sequenced on a HiSeq 4000 system (Illumina) in a 75-bp × 2 paired-end format. No size selection was performed prior to sequencing. The quality of DNA libraries was assessed on Agilent 4200 TapeStation. The libraries were run on D1000 ScreenTape (Agilent) and checked for size and quantity. A prominent peak observable at around 320 bp indicated a successful library construction. Six Bioanalyzer profiles from each group are shown in Supplementary Fig. S10.
Sequencing Alignment
After base calling, the sequencing reads were preprocessed by removing the adaptor sequences and low-quality bases (i.e., quality score of < 20). The trimmed reads in a FASTQ format were analyzed as described previously (14, 21) for the nonbisulfite and bisulfite sequencing data, respectively. Only paired-end reads with both ends aligned to the same chromosome with the correct orientation, spanning an insert size of ≤600 bp, were used for downstream analysis.
MDS Calculation
To analyze the distribution of frequencies of motifs (e.g., for a total of 256 motifs), a concept of MDS was used. We adopted the normalized Shannon entropy as a mathematical approach for calculating the MDS. MDS was defined using the following equation:
where Pi is the frequency of a particular motif. A higher MDS value indicates a higher diversity (i.e., a higher degree of randomness). The theoretical scale is ranged from 0 to 1.
If the 256 4-mer motifs were equally present in terms of their frequencies, MDS would achieve the maximal value (i.e., 1). In contrast, if the 256 motifs had a skewed distribution in their frequencies, the MDS would decrease. For example, if one particular motif accounted for 99% and the other motifs constituted the remaining 1%, the MDS would decrease to close to 0. Therefore, the decreasing MDS of motif frequencies would imply the increasing skewness in the frequency distribution across end motifs. On the contrary, the increasing MDS of motif frequencies would suggest that the frequencies across motifs would shift toward equal probabilities for those motifs.
Data Deposition
Sequence data for the subjects studied in this work who had consented to data archiving have been deposited at the European Genome–Phenome Archive (EGA), www.ebi.ac.uk/ega/, hosted by the European Bioinformatics Institute (accession no. EGAS00001003409).
Disclosure of Potential Conflicts of Interest
P. Jiang is director at KingMed Future and a consultant for Grail; reports receiving a commercial research grant from Grail; has ownership interest (including patents) in Grail; and receives patent royalties from Grail, Illumina, Sequenom, DRA, Take2 Health, and Xcelom. R.W.Y. Chan has ownership interest in a patent application (US Provisional Patent application 62/782,316). V.W.S. Wong is a consultant for 3V-BIO, AbbVie, Pfizer, Terns, Allergan, Boehringer Ingelheim, Echosens, Gilead Sciences, Intercept, Novartis, Novo Nordisk, and Perspectum Diagnostics; reports receiving a commercial research grant from Gilead Sciences; and reports receiving honoraria from the speakers' bureaus of AbbVie, Bristol-Myers Squibb, Echosens, and Gilead Sciences. L.C. Poon is a consultant for Roche Diagnostics and reports receiving commercial research support from Roche Diagnostics, PerkinElmer Inc., and Thermo Fisher Scientific. W.K.J. Lam is a consultant at Grail; reports receiving a commercial research grant from a Grail Contract Research Agreement; has ownership interest (including patents) in Grail; and receives patent royalties from Grail and Take2 Health. H.L.Y. Chan is a scientific advisor forGrail. K.C.A. Chan is a consultant at Grail; reports receiving a commercial research grant from Grail/Cirina; and has ownership interest in patents on molecular diagnostics, Grail equities, DRA equities, and Take2 equities. R.W.K. Chiu is a consultant for Grail; reports receiving a commercial research grant from Grail; and has ownership interest (including patents) in Take2 Health and Grail. Y.M.D. Lo is a consultant at Grail; is an advisor at Decheng Capital; reports receiving a commercial research grant from Grail; has ownership interest (including patents) in Grail, DRA Limited, Take2 Holdings Limited, Take2 Technologies Limited, and Xcelom Limited; and has received other remuneration from Illumina, Sequenom, Xcelom, DRA Limited, and Grail. No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
Conception and design: K. Sun, K.C.A. Chan, R.W.K. Chiu, Y.M.D. Lo
Development of methodology: P. Jiang, K. Sun, W. Peng, S.H. Cheng, K.C.A. Chan, R.W.K. Chiu, Y.M.D. Lo
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S.H. Cheng, P.C. Yeung, M.M.S. Heung, J. Wong, V.W.S. Wong, L.C. Poon, T.Y. Leung, W.K.J. Lam, J.Y.K. Chan, H.L.Y. Chan, Y.M.D. Lo
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): P. Jiang, K. Sun, W. Peng, S.H. Cheng, M. Ni, P.C. Yeung, T. Xie, Z. Zhou, K.C.A. Chan, R.W.K. Chiu, Y.M.D. Lo
Writing, review, and/or revision of the manuscript: P. Jiang, K. Sun, S.H. Cheng, V.W.S. Wong, L.C. Poon, T.Y. Leung, H.L.Y. Chan, R.W.K. Chiu, Y.M.D. Lo
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): K. Sun, P.C. Yeung, M.M.S. Heung, H. Shang, R.W.Y. Chan, V.W.S. Wong, J.Y.K. Chan
Study supervision: R.W.K. Chiu, Y.M.D. Lo
Others (approval of the final manuscript): T.Y. Leung
Acknowledgments
This work was supported by the Research Grants Council of the Hong Kong SAR Government under the Theme-based research scheme (T12-403/15-N and T12-401/16-W), a collaborative research agreement from Grail and the Vice Chancellor's One-Off Discretionary Fund of The Chinese University of Hong Kong (VCF2014021). Y.M.D. Lo is supported by an endowed chair from the Li Ka Shing Foundation.