Background. The current best predictor of lifetime risk for many forms of cancer is family history. Family history is transmitted through germ line DNA. Analysis of germ-line DNA should be able to predict lifetime risk at least as well family history. Genetic risk scores, used to predict lifetime risk from DNA, usually consist of linear combinations of SNPs. These scores ignore structural variations and epistatic effects (non-linear combinations). The objective of this study is to test whether germline DNA structural information along with machine learning algorithms, which can use non-linear combinations, can be used to predict whether a person will develop different types of cancer.

Methods. We tested this objective using data compiled by The Cancer Genome Atlas (TCGA) project and the UK Biobank. The TCGA data consisted of information on DNA chromosome scale length variation extracted from the peripheral blood of 8,821 different patients, each of whom had one of 32 different types of cancer. The UK Biobank data consisted of about 1500 women diagnosed with breast cancer (cases) and about 5000 women who have no record of any type of cancer(controls). We characterized structural variation by a quantitative measure of chromosomal-scale length variation. Chromosomal-scale length variation is computed from array data. A person's DNA (chromosomes 1-22) can be characterized by a series of 22 numbers, each representing the log base 2 ratio of the chromosome's length compared to the average chromosome length. Using these 22 numbers for each person, we set up a machine learning classification problem to differentiate those people diagnosed with a particular form of cancer from those who have not been diagnosed with that cancer. We used the h2o libraries in R to test how well different machine-learning algorithm can classify with these datasets.

Results. In two independent datasets, the Cancer Genome Atlas (TCGA) project and the UK Biobank, we could classify whether or not a patient had a certain cancer based solely on chromosomal scale length variation. We found that all 32 different types of cancer in the TCGA dataset tested could be predicted better than chance using structural variation data. Specifically, in the TCGA dataset we measured the area under the receiver operator curve, known as the AUC, for ovarian cancer in women (0.89), glioblastoma multiforme (0.86), breast cancer in women (0.75), and colon adenocarcinoma (0.79), with a 95% confidence interval width of less than 0.01 in each case. This method could predict 10% of glioblastoma multiforme cases with less than 1 in 10,000 false positives in the TCGA dataset. In the UK Biobank dataset, we could predict breast cancer in women with an AUC=0.83.

Conclusion. Genetic risk scores based on structural variation can effectively predict whether a person will eventually develop may different types of cancer.

Citation Format: Chris Toh, Charmeine Ko, James P. Brody. Genetic risk scores from structural variation predict life-time individual cancer risk [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2317.