Inherited factors are thought to be responsible for a substantial fraction of breast cancers, but only a small fraction of breast cancers are caused by mutations to known genes, like BRCA1. Most current genetic risk scores for breast cancer focus on the additive effects of single nucleotide polymorphisms (SNPs) found in germ-line DNA. This process ignores epistatic interactions between different genes. We sought to use modern machine learning algorithms to identify complex patterns in germ line DNA that are correlated with the incidence of breast cancer, a strategy which should include epistatic interactions. We looked for these patterns in measurements of chromosomal-scale length variation of germ line DNA. These measurements represents the sum of many insertions, deletions, and copy number variations across the chromosome. We evaluated several different machine learning models, including: a generalized linear model, a distributed random forest, a gradient boosting machine, and a deep learning model. We evaluated these models on two independent large datasets, the Cancer Genome Atlas (TCGA) and the UK Biobank. We found that chromosomal-scale length variation of germline DNA, in combination with machine learning models, can be used to predict whether a woman will develop breast cancer. In the Cancer Genome Atlas (TCGA) dataset, we used a dataset with 968 (cases) women who had been diagnosed with breast cancer and 3715 (controls) women who had not been diagnosed with breast cancer. We found that a gradient boosting algorithm could differentiate between those two classes of women with an AUC=0.73. In the UK Biobank dataset, we identified 1534 (cases) women who had been diagnosed with breast cancer and compared them with a set of 4391 (controls) women who had no report of any type of cancer. In this UK Biobank dataset, we could distinguish between the two classes of women with an AUC =0.83. In conclusion, we found that artificial intelligence algorithms could be used to develop effective genetic risk scores for breast cancer.

Citation Format: Charmeine Ko, Christopher Toh, James P. Brody. Genetic risk scores for breast cancer based on machine learning analysis of chromosomal-scale length variation [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PR-09.