In this issue of Blood Cancer Discovery, Wintersinger and colleagues present a new algorithm for quickly and accurately inferring clonal phylogenies from heterogeneous tumors sampled at many timepoints and/or many sites. When coupled with serial sequencing of tumors, this advance promises to increase our understanding of the clonal dynamics that shape tumor evolution and response to therapy.

See related article by Wintersinger et al., p. 208 (9).

The complexity of cancer is amplified by the fact that tumors contain a milieu of genetically defined subpopulations. These subclones arise from the accumulation of mutations in individual tumor cells, and describing their evolutionary dynamics is essential to understanding facets of cancer ranging from precancerous cellular expansions (as in clonal hematopoesis) to therapy resistance and immune evasion. Genomic sequencing of a tumor offers a snapshot of its clonal composition, and compiling serial samples allows for the inference of trajectories of individual subclones. Doing so offers exciting opportunities to understand intratumor heterogeneity and clonal evolution, but translating large, noisy, and sometimes ambiguous sequence data into interpretable clonal phylogenies is a challenge that requires computational assistance.

The first generation of computational methods for querying tumor clonality arose over the last decade and these tools generally offered two types of insights: first, clonal assignments—inferring the clusters of driver and passenger mutations that are markers for distinct genetic subpopulations of cells—and second, clonal phylogenies (or trees)—inferring the order of mutational acquisition and the expansion or contraction of populations over time. Tools such as PyClone, SciClone, PhyloSub, and others have been used thousands of times; when applied to increasingly large and robust cohorts, these algorithms have dramatically reshaped our understanding of tumor evolution (1–3).

Limits on sample collection and sequencing throughput limited early studies to low-resolution views of tumors, often with only two timepoints: one from presentation, and one from relapse. Newer initiatives are seeking to sample tumors with greater frequency, and groups are collecting large sets of spatially or temporally separated samples from individual patients. These includes studies of hematologic malignancies (4), often looking at progression and evolution through therapy, as well as solid tumor research, where both primary tumors and metastases can be sampled and/or subdivided many times to explore spatial heterogeneity (5, 6). Early computational tools, however, were ill-equipped to handle the complexity that arises when assaying dozens of samples from the same patient, each potentially containing many subclones. A second generation of algorithms has emerged that more robustly handles larger sample sets, as well as complications like low tumor cellularity and subclonal copy number alterations (7, 8). However, many do not scale well, requiring unpalatable amounts of time and computational resources. This is due, in part, to the massive number of possible clonal orderings, which grows exponentially as additional samples and subclones are introduced.

The article by Wintersinger and colleagues in this issue of Blood Cancer Discovery (9) describes a new tool, Pairtree, that surmounts this problem with some clever new tricks. The authors demonstrate reconstruction of clonal phylogenies in B-cell acute lymphoblastic leukemia (B-ALL) tumors with as many as 90 samples, and for tumors with as many as 30 subclones. The key insight is the creation of a “pairs tensor” that encapsulates the possible relationships between each subclone and their probabilities, given the input data. This is coupled with an approach that constrains the search space by iteratively generating possible trees, then intelligently modifying them, based on conflicts with the information in the tensor. This allows for efficient identification of trees that capture a tumor's clonal relationships with high probability.

The resulting algorithm has several desirable features: It works well on both high and low-depth samples, which is important because of renewed interest in whole-genome sequencing (WGS). WGS can capture many more marker mutations per sample, but with lower depth, and thus contains higher uncertainly in point estimates of allele frequency. Pairtree incorporates a binomial model of this sampling error. Though it generally operates under the infinite sites assumption (that each site is mutated exactly once), it is capable of identifying sites where violations occur and handling them appropriately. This is important in light of the fact that such events are now believed to occur far more often than previously thought (10). Pairtree also contains routines that can identify small numbers of outlier data points that are inconsistent with any reasonable model and exclude them when doing clustering, a necessary accommodation for noisy mutation data. Users should be aware though, that the old adage “garbage in, garbage out” still holds true and no algorithm can reliably be expected to succeed when provided with large numbers of erroneous mutation calls. Even with accurate mutations, all such tools are working with incomplete information about a tumor's composition: sequencing often has limited power to detect very rare mutations and there is no knowledge of what occurred between sampled timepoints. The accuracy of clonal inference is always constrained by these limitations.

Importantly, Pairtree has a high “success rate”, which the authors define as producing at least one tree in 24 hours of wall-clock time without crashing. As the authors’ comparisons with some other tools demonstrate, this is not a trivial metric when analyzing extensively sampled tumors. No matter how good the results, long run times and expensive compute represent substantial barriers to adoption, and it would be satisfying to see such numbers reported more often in algorithmic papers.

In many tumors, more than one clonal relationship is possible given the data, but the best algorithms assign posterior probabilities to each phylogeny, allowing them to be scored, or ranked according to how well they fit the data. Pairtree does this as well, encapsulating the uncertainty inherent in building clonal trees with imperfect data. Ambiguous situations may occur, for example, when two subclones are present at nearly the same frequency, and there is not enough information to confidently determine which is the parent and which is the child. The weighted consensus graph visualizations that Pairtree produces are an especially nice touch, displaying this uncertainty in a way that is intuitive and interpretable.

In summary, Pairtree is a welcome addition to the family of algorithms for performing clonal inference and ordering, and one that promises to open up new avenues of research. In particular, many hematologic malignancies (or premalignancies) can be sampled using peripheral blood samples, meaning that compiling hundreds of longitudinal samples from patients is within reach. The ability to sequence such samples, and then accurately reconstruct subclonal trajectories, could be incredibly informative for understanding the kinetics of specific subclonal driver mutations: it could allow for a better identification of subclones with small but persistent fitness advantages, versus those that exhibit explosive growth. The biology of metastasis is another obvious application, where dozens of biopsies may be recovered, allowing for insights into the complex and branching paths by which they are seeded and reseeded throughout the body. Though Pairtree provides the field with the ability to routinely assay larger sample sets, the authors do report degraded performance when applied to data sets containing more than 100 samples. While such large sampling schemes are still rare at present, future extensions of this or other tools may be needed to deal with data sets that continue to expand in both numbers of biopsy sites and temporal resolution.

No disclosures were reported.

1.
Roth
A
,
Khattra
J
,
Yap
D
,
Wan
A
,
Laks
E
,
Biele
J
, et al
.
PyClone: statistical inference of clonal population structure in cancer
.
Nat Methods
2014
;
11
:
396
8
.
2.
Miller
CA
,
White
BS
,
Dees
ND
,
Griffith
M
,
Welch
JS
,
Griffith
OL
, et al
.
SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution
.
PLoS Comput Biol
2014
;
10
:
e1003665
.
3.
Jiao
W
,
Vembu
S
,
Deshwar
AG
,
Stein
L
,
Morris
Q
.
Inferring clonal evolution of tumors from single nucleotide somatic mutations
.
BMC Bioinf
2014
;
15
:
35
.
4.
da Silva-Coelho
P
,
Kroeze
LI
,
Yoshida
K
,
Koorenhof-Scheele
TN
,
Knops
R
,
van de Locht
LT
, et al
.
Clonal evolution in myelodysplastic syndromes
.
Nat Commun
2017
;
8
:
15099
.
5.
Dang
HX
,
Krasnick
BA
,
White
BS
,
Grossman
JG
,
Strand
MS
,
Zhang
J
, et al
.
The clonal evolution of metastatic colorectal cancer
.
Sci Adv
2020
;
6
:
eaay9691
.
6.
Yates
LR
,
Gerstung
M
,
Knappskog
S
,
Desmedt
C
,
Gundem
G
,
Van Loo
P
, et al
.
Subclonal diversification of primary breast cancer revealed by multiregion sequencing
.
Nat Med
2015
;
21
:
751
9
.
7.
Myers
MA
,
Satas
G
,
Raphael
BJ
.
CALDER: inferring phylogenetic trees from longitudinal tumor samples
.
Cell Syst
2019
;
8
:
514
22
.
8.
Jiang
Y
,
Qiu
Y
,
Minn
AJ
,
Zhang
NR
.
Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing
.
Proc Natl Acad Sci U S A
2016
;
113
:
E5528
37
.
9.
Wintersinger
JA
,
Dobson
SM
,
Kulman
E
,
Stein
LD
,
Dick
JE
,
Morris
Q
.
Reconstructing complex cancer evolutionary histories from multiple bulk DNA samples using Pairtree
.
Blood Cancer Discov
2022
;
3
:
208
19
.
10.
Demeulemeester
J
,
Dentro
SC
,
Gerstung
M
,
Van Loo
P
.
Biallelic mutations in cancer genomes reveal local mutational determinants
.
Nat Genet
2022
;
54
:
128
33
.