Mutation detection, via genomic sequencing, has become a routine in cancer diagnosis and precision treatments. Although the existing bioinformatics approaches alter the algorithms and strategies for the detection, they always rely on the combinations of statistical features on a set of overlapped reads to recognize the real mutations from false positives. It may be the Achilles heel of such recognition mechanisms, that their preset thresholds are often stiff and ambiguous and the feature combinations with adaptive thresholds could only be enabled by machine learning frameworks. Unfortunately, the learning models have to overcome the challenges caused by tumor heterogeneity. For mutations from different subclones, most of the sequencing features associate with the tumor purity and clonal proportions. It introduces complicated interactions among the features and breaks the independent co-distribution assumption of classic learning models. In addition, both the tumor purity and clonal proportions are variables, thus, any training sets cannot enumerate all possible values. These challenges hurt the specificities of the existing approaches applying to cancer sequencing data. Here, we propose an approach for the scenario of various tumor purity and clonal proportions. The proposed approach incorporates a comprehensive set of the features according to the existing strategies. Then, it requires at least two training sets with different proportions. For any given set, we have fixed tumor purity and clonal proportions. The framework first trains the models according to one set. The trained models focus on the associations between the features and true mutations. These models are defined as a source domain. Next, when the other set is input for training, the framework not only trains another source domain, but focuses on the transformations among the features between the source domains as well. Now, when another fixed tumor purity and clonal proportions are considered, the framework is able to generate, maybe roughly, the models for the new group of the purity and proportions according to the source domains and the transformations. To enhance the performance, we propose to integrate a few source domains to control the systematic errors during the transfer processes. The Boyer-Moore majority-vote algorithm is introduced to achieve the integration. We have carried out a series of experiments on both simulated and real datasets, and compared to the state-of-the-art approaches, including MuTect2, Sentieon, VarScan2, Freebayes and SiNVICT. The results demonstrated that the proposed method adapts well to different diluted sequencing signals and can significantly reduce the false positive. It is implemented as TransVAF. The software package has been uploaded at for academic usage only.

Citation Format: Tian Zheng, Jiayin Wang, Xiao Xiao, Xiaoyan Zhu, Xuanping Zhang, Xin Lai, Yanfang Guan, Xin Yi. TransVAF: A transfer learning approach for recognize genomic mutations with various tumor purity and clonal proportions [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 255.