Background: Accurate identification of somatic variants in a tumor sample is often enabled by utilizing a paired normal tissue sample from the same patient that enables the separation of private germline mutations from somatic variant calls. However, a paired normal sample is not always available from patients, making accurate somatic variant analysis more challenging. Composite proxy normals and other filtering approaches can be used in lieu of a paired normal sample, but the resulting somatic call set may suffer from incomplete germline filtering and reduced sensitivity compared to paired tumor-normal analysis. To address these limitations, we developed a novel, machine learning based tumor-only somatic small variant classifier, which leverages gradient boosted decision trees to substantially increase somatic variant specificity from the tumor-only analysis without reducing overall sensitivity.

Methods: We produced a ground truth set of somatic SNVs and indels from 350 whole exome-sequenced tumor-normal pairs using a validated cancer bioinformatics pipeline. We then generated a feature set from each tumor sample by aggregating pileup attributes including: allelic frequency and read depth, tumor cellularity estimations, germline variant calls from HaplotypeCaller, somatic variant calls from Mutect and Mutect2 using a proxy-normal, copy-number alterations, annotations from databases such as GnomAD and COSMIC, and problematic-region annotations including homopolymers. Using these features and the ground truth set, we trained a gradient-boosted decision tree to predict the somatic likelihood of each variant. Model hyperparameters were optimized using a random search during stratified cross-validation, and model performance was evaluated on a hold-out test set.

Results: Using a classification threshold that optimized F1 score on the validation set, we observed a significant increase in model precision on the test set, with comparable sensitivity to somatic calling using a conventional proxy-normal filtering approach. Because our model outputs somatic probability, the classification threshold can be tuned to favor sensitivity or specificity of the call set, depending on the desired use case. To improve interpretability of our model, we employed shapely additive explanations (SHAP) to obtain feature importance values. Our analysis revealed that annotations such as population frequency and base quality scores were among the most important features.

Conclusions: Our machine learning approach can greatly enhance germline filtering when making somatic variant calls when a paired normal sample is not available without decreasing sensitivity for true somatic variants. Depending on the use-case, classification thresholds can be tuned to improve sensitivity over conventional variant callers for more modest improvements in precision. Finally, model interpretation has revealed a subset of highly discriminative features, which may prove useful for variant interpretation, future feature set expansion, or model tuning.

Citation Format: Nicholas Phillips, Patrick Jongeneel, John West, Richard Chen, Jason Harris. Improved tumor-only somatic variant calling using a gradient boosted machine learning algorithm [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 852.