Probabilistic Mixture Models Improve Calibration of Panel-derived Tumor Mutational Burden in the Context of both Tumor-normal and Tumor-only Sequencing

Background: Tumor mutational burden (TMB) has been investigated as a biomarker for immune checkpoint blockade (ICB) therapy. Increasingly, TMB is being estimated with gene panel–based assays (as opposed to full exome sequencing) and different gene panels cover overlapping but distinct genomic coordinates, making comparisons across panels difficult. Previous studies have suggested that standardization and calibration to exome-derived TMB be done for each panel to ensure comparability. With TMB cutoffs being developed from panel-based assays, there is a need to understand how to properly estimate exomic TMB values from different panel-based assays. Design: Our approach to calibration of panel-derived TMB to exomic TMB proposes the use of probabilistic mixture models that allow for nonlinear relationships along with heteroscedastic error. We examined various inputs including nonsynonymous, synonymous, and hotspot counts along with genetic ancestry. Using The Cancer Genome Atlas cohort, we generated a tumor-only version of the panel-restricted data by reintroducing private germline variants. Results: We were able to model more accurately the distribution of both tumor-normal and tumor-only data using the proposed probabilistic mixture models as compared with linear regression. Applying a model trained on tumor-normal data to tumor-only input results in biased TMB predictions. Including synonymous mutations resulted in better regression metrics across both data types, but ultimately a model able to dynamically weight the various input mutation types exhibited optimal performance. Including genetic ancestry improved model performance only in the context of tumor-only data, wherein private germline variants are observed. Significance: A probabilistic mixture model better models the nonlinearity and heteroscedasticity of the data as compared with linear regression. Tumor-only panel data are needed to properly calibrate tumor-only panels to exomic TMB. Leveraging the uncertainty of point estimates from these models better informs cohort stratification in terms of TMB.


Introduction
Immune checkpoint blockade (ICB) has received FDA approval for several cancer types along with any cancer with mismatch-repair deficiency (1). However, not all patients respond and some may experience adverse events (2), necessitating the search for biomarkers that can predict response to ICB (3). ICB tends to be more effective in cancer types with higher tumor mutational burden (TMB; ref. 4), making it reasonable to explore TMB as a biomarker (5)(6)(7)(8)(9). These they are already part of standard of care to identify potential drug targets.
When panel-restriction experiments have been performed with exome data, the panel-derived TMB (panel TMB) values generally correlate well with the exome-derived TMB (exomic TMB; refs. 14,15), justifying the use of panel data to estimate TMB. Importantly, the use of a panel-based assay for characterizing likelihood of ICB response has been approved by the FDA (16,17).
Gene panels were not specifically designed to accurately measure TMB; rather, they were repurposed to estimate this measure, and the approval of panel TMB as a biomarker in the context of ICB response has come with some controversy (18). Panels are biased to include genes commonly mutated in cancer, have higher depth of sequencing than exome data, and often do not have a paired normal sample from the patient to remove germline variants. Each of these factors biases panels toward reporting a higher value of TMB than would be seen with tumor-normal exome sequencing, using conventional approaches.
Furthermore, it is unclear how to properly translate TMB cutoffs determined from a given gene panel assay to another. These issues are widely recognized and an international consortium has been created with the goal of harmonizing TMB values (19).
In an ideal scenario, each assay would be calibrated using a reference set of samples that contains paired exome and panel sequencing. However, it is difficult and expensive to generate this data at the scale needed for the variety of assays utilized today. As such, most calibration efforts have been performed by computationally restricting conventional tumor-normal exome sequencing to a given gene panel's genomic footprint. Using this approach with the tumor-normal exome sequencing from The Cancer Genome Atlas (TCGA), researchers have modeled the uncertainty of TMB estimation that is inherent to subsampling and gene biases (20,21). But, these efforts have not directly considered differences between germline-subtracted tumor-normal and tumor-only based approaches along with the contributions of differences in read depth between exome and gene panel-based sequencing. Some more recent studies have begun to examine data processing variability by deriving panel TMB estimates from TCGA MC3 variant calls using each laboratory's specific pipelines (22).
When TMB is calculated without a matched normal sample, laboratories rely on various population databases to filter out germline variants. Laboratories may also use algorithms that try to distinguish somatic and germline using the variant allele frequencies (23). Regardless of the approach, there will usually be some error in the ability to accurately identify all germline variants, resulting in so-called "private" germline variants contributing to mutational burden estimates in the context of tumor-only sequencing. This artifact cannot be directly accounted for when using germline-subtracted tumor-normal somatic mutation data such as the commonly used TCGA MC3 data. We suggest the application of models calibrated on germline-subtracted tumor-normal data to tumor-only assays needs to be carefully considered.
Herein we address two of the key sources of bias in gene panel data: (i) regional subsampling of the genome in areas more commonly mutated in cancer, as has been characterized previously, (ii) artifactual contribution of "private" germline variants to TMB in tumor-only sequencing. Importantly, in contrast to other approaches, we leverage a probabilistic regression framework that can model nonlinearity and heteroscedastic, asymmetric error distributions (24). At very low TMB, panel data initially underestimate exomic TMB, then overestimate exomic TMB because of the enrichment of genes commonly mutated in cancer, and then at higher values is a closer approximation to exomic TMB. This results in a nonlinear relationship between panel TMB and exomic TMB and complex/non-normal error distributions. Furthermore, we aim to highlight the use of model uncertainty in guiding the interpretation of panel-derived TMB estimates.

Modeling Approach
Our models were constructed in TensorFlow as fully connected neural networks with the general structure of an input predictor vector X followed by three hidden layers of size 128, 64, 32 each having softplus activation. The final output layer represents the parameters of the mixture model (means, variances, and mixture weights for each log-normal component), which is then fed into a TensorFlow Probability layer. Each input is represented by a vector ranging from just the panel TMB value to a spectrum of counts across variant types in addition to genetic ancestry. Because the median of the modeled distribution was used as the point estimate, we use mean absolute error (MAE) and Spearman rank order-based correlation to measure model performance. Note, models that are monotonically related will report the same rank order-based correlation metrics. The reported model performance metrics are based on the test folds from k-fold cross-validation. Additional details of the modeling procedure can be found in the publicly available code repository (https://github.com/OmnesRes/DeepTMB, archived at Zenodo: https:// doi.org/10.5281/zenodo.7574237).

TMB Calculation
TCGA exomic data were generated by different laboratories across multiple years with different technologies and protocols, resulting in variation in the quality of each sample, including depth of sequencing and exome coverage.
Given that TMB is defined as mutations per megabase (Mb), differences in per sample genomic coverage could be potentially problematic. Instead of assuming a common size of each sample's captured exome, we utilized the available To generate synthetic tumor-only panel data, we reintroduced TCGA Pan-CanAtlas germline calls into the somatic MC3 calls and applied a spectrum of filters to the combined data with the commonly used gnomAD population databases. These counts are available in Supplementary Data S1.

Data Availability Statement
Publicly available data generated by others were used by the authors. TCGA somatic calls (26) were obtained from https://gdc.cancer.gov/aboutdata/publications/mc3-2017. TCGA germline calls (27) were obtained from

Model Fit
Ordinary linear regression models attempt to model y via a linear combination of X predictors via optimizing the mean squared error (MSE). These linear models make the assumption that the distribution of the residuals is normally distributed and homoscedastic. Figure 1A and B show that the assumptions of homoscedasticity and normally distributed error are violated when regressing TMB. Moreover, in Fig. 1A and B, we can appreciate that the "residuals are correlated" in that the linear model predictions tend to slightly overestimate exomic TMB at lower input values and then significantly underestimate at higher input values, violating another assumption in linear regression. These issues are worse in the tumor-only data as compared with the germline-subtracted tumor-normal data ( Fig. 1B and A, respectively).
Minimization of the MSE can be reposed as minimizing the negative log likelihood of fitting a normal distribution to model the observed y at a given input X, but with fixed variance across all X (as presented in Fig. 1A). In our approach, we aim to minimize the negative log likelihood of fitting a mixture model of log-normal distributions to estimate the distribution of observed y at a given input X. As can be seen in the second column of Fig. 1 (under the "TMB Distributions" header), the observed exomic TMB (y) shows a clear right skew at lower input values and then shifts to a more symmetric, unimodal distribution (albeit not necessarily normal distribution) at higher input values. This highlights the need to leverage more flexible models of regression that don't depend on the assumptions of ordinary linear regression. Figure 1A shows the result of fitting a normal distribution (with fixed variance) as a linear function of X to tumor-normal data restricted to the FoundationOne CDx coordinates, which is effectively equivalent to a linear regression that minimizes MSE. The input-output plot shows the data in log-transformed space, allowing for better appreciation of the features of low to intermediate TMB values (exomic TMB < 10). These data would otherwise be compressed and difficult to fully represent and visualize when plotted with rarer high TMB values present in TCGA data. Linear models of exomic TMB based on panel TMB input poorly estimate the central tendency of the label across the range of inputs and incorrectly estimate the error distribution (as described above).
To better account for the complexities in these data, we model the distribution of the observed exomic TMB at any given input value using a mixture of log-normal distributions. This modeling approach explicitly allows for heteroscedasticity, as can be appreciated in Fig. 1C and D. This approach better removes any dependency of the residuals on the input values and far better models the distribution of the observed exomic TMB at different input values. The right skew of lower values is well modeled while still allowing for an approximately normally distributed error at higher values. In particular, the effect of private variants on the distribution can be appreciated and characterized in Fig. 1D from tumor-only data as compared with germline-subtracted tumornormal data in Fig. 1C.
In considering thresholds based on panel TMB to partition patients into different strata, the common strategy has been to do so based on point estimates from panel-based TMB measures. Such an approach cannot fully account for the uncertainty behind these point estimates of exomic TMB. This will result in particular difficulty in accurately distinguishing values near the chosen threshold. To combat this, one could leverage model uncertainty estimates; but these need to model the distribution of the data correctly. By better modeling the distribution of exomic TMB at varying input panel TMB, the proposed modeling approach is better suited to leverage model uncertainty than ordinal linear regression.
When considering the exomic TCGA MC3 data and the FoundationOne CDx coordinates, a threshold of 10 for panel TMB results in the following detection rates for true exomic TMB  Table S1).
The use of panel TMB or point estimates from models does not leverage the uncertainty/variability of the data to make a better stratification. Because we have modeled the distribution of y given X, we can stratify the data based on the 95% probability of model output being above or below a given threshold (as opposed to the median or mean of the distribution being greater than a threshold). This is represented in Fig. 1 in which the nonshaded areas show where model output can be said to be greater than a target of 10 with 95% confidence (to the right of the shaded areas) or less than a target of 10 with 95% confidence (to the left of the shaded areas). This results in a tripartite stratification in which the two strata with 95% confidence (roughly 80% of the data) result in NPV/PPV ranging from 95% to 99% (Supplementary Table S1). Of particular note, the linear model is not able to properly model the distribution of the data and as such its model uncertainty estimates cannot be used to accurately gauge the confidence of the model, and this was particularly evident in tumor-only data. The overall theme of these findings is not specific to the chosen target value of 10.  so-called "hotspot" mutations may be excluded. Using our mixture modeling approach (as presented in Fig. 1), we investigated how different possible inputs (synonymous/nonsynonymous/hotspots) across three different datasets affect model performance in estimating exomic TMB. While not currently employed by any protocols to our knowledge, given that different ancestries have varying degrees of intrinsic germline diversity and representation in the databases used to filter out germline variants from tumor-only data, we also investigated the use of genetic ancestry as an additional input into the model. Table 1 reports metrics of model performance (MAE and Spearman rho) from 5-fold cross-validation (for a figure representation, see Supplementary Fig. S1).

Effect of Panel Inputs
As has been well-described previously, panel TMB tends to overestimate In contrast, including synonymous mutations universally improved MAE metrics. Including ancestry into the model improved the metrics of tumor-only data but had little effect on tumor-normal data, which is somewhat expected in the context of bona fide somatic mutations of tumor-normal data. Finally, we provided the model with a vector input of counts of all the variant types (nonsynonymous/all mutations/hotspots) along with ancestry allowing the model to contextualize the relative importance of these inputs. This formulation uniformly led to the best regression metrics in cross-validated assessments of performance ( Table 1).

Effect of Tumor-only Sequencing
Tumor-only sequencing lacks a matched germline control, resulting in a variable number of rare/private germline variants that cannot be identified as germline using even the largest population databases or cutting-edge bioinformatics techniques. The inclusion of these private germline variants results in bias (increased number of variants) and noise (variability in the number of increased variants per sample) being introduced in the estimation of the true somatic exomic TMB. Of particular concern given much of the recent litera-ture, if a model is trained on tumor-normal data (such as TCGA MC3) and then applied to tumor-only panel data (which would include private germline variants) then biases of tumor-only data will not have been accounted for, thereby affecting model performance ( Fig. 2A vs. Fig 2B and C). To better account for this, we constructed a synthetic tumor-only dataset from TCGA data by reintroducing germline variants into TCGA somatic mutation profiles that could not be filtered out by common strategies across a spectrum of stringencies using the gnomAD database, mimicking common filtering processes used with real-world tumor-only data.
We used gnomAD 2.1 non-cancer exome and gnomAD 2.1 genome as population databases, and considered TCGA itself to be a population database of sorts. Using a self-cohort as a population database has the additional benefit of removing potential high frequency artifacts in the data, as is commonly done in the context of individual laboratories. We applied both permissive and stringent criteria for filtering out germline variants using the maximum population specific allele frequency (popmax_AF greater than 1% vs. 0.1%) in the gnomAD 2.1 non-cancer exome and gnomAD 2.1 genome population databases in conjunction with the overall allele frequency (AF greater than 0.1% vs. 0.01%). These same popmax_AF thresholds were also applied to TCGA "self-cohort" with known hotspots whitelisted (30).  When using the tumor-only dataset we generated to estimate the true somatic, exomic TMB from panel TMB (Fig. 1B and D), the introduction of private germline variants results in a bias (artificially inflating panel TMB estimates) and adding noise (increasing the variability of the exomic TMB distribution for a given tumor-only panel TMB value). The proposed mixture modeling approach is able to better fit the central tendency of the data (as can be appreciated when comparing the residuals in Fig. 1D vs. Fig. 1B); in addition, it is able to better characterize the distribution of exomic TMB, which displays a wider and more asymmetric distribution as compared with tumor-normal matched germline-subtracted data (Fig. 1D vs. Fig. C, respectively).
The number of private germline variants per sample at a given filtering strategy in these data is also dependent on genetic ancestry due to the genetic diversity of different ancestral populations in conjunction with the degree to which they are represented in the databases that are queried ( Supplementary Fig. S2). As such, we observed slightly different regression models for the different genetic ancestries present in these data ( Supplementary Fig. S2) and that inclusion of genetic ancestry as an input into the model improved model performance more consistently in the context of tumor-only data (which contain private germline variants; Supplementary Table S2, N.B.: even in the context of tumor-normal data different groups may have different TMB distributions, Supplementary  Fig. S3).

Proposed Clinical Utility
When deciding whether to prescribe a treatment, clinicians must weigh the likelihood of a patient responding versus the potential risk of the treatment. We suggest that another consideration should be taken into account: the measurement error in the prognostic indicator. To demonstrate what this might look like, we leveraged the associated clinical data for TCGA samples (35). In this dataset, exomic TMB has been shown to be a positive prognostic indicator for bladder urothelial carcinoma (BLCA; ref. 36), and using these samples we identified the optimal exomic TMB cutoff as 6.7. We used our stringently filtered tumor-only data to train models with the BLCA samples excluded, and as can be seen in Fig. 3A although our mixture model was not trained on any of the BLCA samples, its central tendency is a good estimate for the BLCA data and the data are contained within the confidence intervals. In contrast, the linear model's central tendency is clearly biased and its confidence intervals are not applicable (Fig. 3B), mirroring what we saw in Fig. 1. The inability of the linear model to properly represent the data results in a higher degree of uncertainty (gray area) for being able to confidently (95%) predict a sample as being greater or less than the exomic TMB-derived threshold. In contrast, our proposed mixture model both better estimates the central tendency of this regression and importantly also models the variable degree of uncertainty across the data. This affords our approach a smaller region of data in which it cannot confidently distinguish a sample from the exomic-derived threshold. Ultimately, our model is better able to stratify the observed outcome data based on where each model is 95% confident that the exomic value is greater than the exomic-derived threshold ( Fig. 3C vs. Fig. 3D).

Discussion
Given the nonlinear relationship between panel-derived TMB measures and exomic TMB along with the complex error distributions of exomic TMB at different panel-derived TMB ranges, a more flexible modeling approach than linear regression is required to correctly model the data, which we show in Fig. 1. In this study, we developed a probabilistic model that outputs a probability distribution of the model output (y) for any given input X. From these distributions users can easily get the mode, mean, median, or any desired quantile of the probability distribution. Although use of the appropriate modeling strategy is important, it is even more important that every effort be made to ensure the training data resemble the data the model will be applied to. In this study, we underscored how significant this can be in the case of tumor-only data and characterized this by reintroducing private germline variants to TCGA MC3 samples to model tumor-only data.
Previous studies have investigated the performance of different filtering approaches for tumor-only sequencing data, often with the goal of accurately filtering out germline variants (37)(38)(39)(40)(41)(42)(43)(44). Our goal was not to find the optimal filtering algorithm, but to simply use a reasonable spectrum of filtering approaches and characterize the effect that introducing private germline variants would have in the context of calibrating panel TMB measures to exomic TMB.
Using the tumor-only data we constructed from TCGA data, the findings would suggest a more stringent filtering strategy would favor more accurate exomic TMB calibration even if that resulted in filtering out some true somatic variants. This is compatible with the notion of increasing the signal (somatic mutations) Fit of a linear model with unseen BLCA samples plotted. In both A and B to the right of the gray area, the model is at least 95% confident that the data are greater than the cut-off value of 6.7. C, Kaplan-Meier survival curves for BLCA samples which the mixture model could say with at least 95% confidence were above the optimal cutoff versus not. D, Kaplan-Meier survival curves for BLCA samples which the linear model could say with at least 95% confidence were above the optimal cutoff versus not.
to noise (germline mutations) ratio given that the target is somatic exomic TMB and there is a large excess of germline variants in relation to somatic ones. One should note however that these somatic variants filtered out under stringent criteria for the purpose of TMB calibration could still be analyzed for other purposes and interpretations, as appropriate.
When using tumor-normal data for calibration, there is an inherent assumption that no private germline variants will be present in samples the model will be applied to. However, the majority of academic and commercial labs that perform panel-based somatic mutation profiling use tumor-only approaches, which will result in the inclusion of varying degrees of private germline variants. Population databases alone will unlikely be able to perfectly identify all germline variants from tumor-only data; as such, additional strategies would need to be applied to remove these private germline variants in a manner that does not remove true somatic variants. The tumor-only data we generated from TCGA using the genomic coordinates of every panel in AACR GENIE constitutes a labeled training dataset at the level of the variants (somatic vs. germline) to support the development of such algorithms, which will be the focus of a separate study; the aggregate variant counts used in the current study are available (Supplementary Data S1). Interestingly, we show that the amount of private germline variants is dependent on genetic ancestry which has implications on how to estimate exomic TMB from panel TMB.
In addition to often being tumor only, panel data are often sequenced deeper than exomic data. The deeper sequencing has the effect that more mutations will be called with panel data (45). Additional factors that can affect the number of called mutations include sample purity and clonality (46,47). While these additional factors would ideally be taken into account when calibrating TMB to a unified standard, the focus of this article was to model the distributions of tumor-normal and tumor-only data, the shape of which will not change if additional mutations are added to both the input predictors and the target.
It is important to not only accurately estimate TMB from panel data, but to also be able to model the uncertainty of such estimates and then use them to inform the use of the model estimates, particularly to consider flagging estimates with high uncertainty. We acknowledge that current thresholds on panel TMB are derived from association to outcome/treatment response and not necessarily correlation with exomic TMB. However, it should be noted that panel TMB around 10 tends to show the highest degree of variability in terms of what the true exomic TMB is. As such, panel TMB thresholds may need to be considered in light of the variability behind them if conceptually they are being considered an estimate of the true exomic TMB, which can be observed exactly if needed. The importance of characterizing uncertainty/variability of estimates underscores the need to use models more expressive than linear models that can account for the nonlinearity and heteroscedasticity present in these data.
AACRJournals.org Cancer Res Commun; 3(3) March 2023 507 Our proposed mixture modeling approach serves this purpose well and can allow for a dynamic set of inputs spanning various variant tabulations (synonymous/nonsynonymous/hotspot counts) and sample level properties such as genetic ancestry information as well.