Multiregional Sequencing Analysis Reveals Extensive Genetic Heterogeneity in Gastric Tumors from Latinos

Gastric cancer is a leading cause of cancer mortality and health disparities in Latinos. We evaluated gastric intratumoral heterogeneity using multiregional sequencing of >700 cancer genes in 115 tumor biopsies from 32 patients, 29 who were Latinos. Analyses focused on comparisons with The Cancer Genome Atlas (TCGA) and on mutation clonality, druggability, and signatures. We found that only approximately 30% of all mutations were clonal and that only 61% of the known TCGA gastric cancer drivers harbored clonal mutations. Multiple clonal mutations were found in new candidate gastric cancer drivers such as EYS, FAT4, PCDHA1, RAD50, EXO1, RECQL4, and FSIP2. The genomically stable (GS) molecular subtype, which has the worse prognosis, was identified in 48% of our Latino patients, a fraction that was >2.3-fold higher than in TCGA Asian and White patients. Only a third of all tumors harbored clonal pathogenic mutations in druggable genes, with most (93%) GS tumors lacking actionable clonal mutations. Mutation signature analyses revealed that, in microsatellite-stable (MSS) tumors, DNA repair mutations were common for both tumor initiation and progression, while tobacco, POLE, and inflammation signatures likely initiate carcinogenesis. MSS tumor progression was likely driven by aging- and aflatoxin-associated mutations, as these latter changes were usually nonclonal. In microsatellite-unstable tumors, nonclonal tobacco-associated mutations were common. Our study, therefore, contributed to advancing gastric cancer molecular diagnostics and suggests clonal status is important to understanding gastric tumorigenesis. Our findings of a higher frequency of a poor prognosis associated molecular subtype in Latinos and a possible new aflatoxin gastric cancer etiology also advance cancer disparities research. Significance: Our study contributes to advancing our knowledge of gastric carcinogenesis, diagnostics, and cancer health disparities.


Supplementary figures and files referred to in the manuscript
The manuscript refers to Supplementary figures 1-9 and Supplementary tables 1-4. These figures and tables can be found below in the sections titled "Supplementary figures" and "Supplementary tables". Data files as spreadsheets are referenced by the main manuscript and in this document by their filename ("Supplementary data file XX) and are available along with the manuscript.

Somatic pan-cancer panel design
The cancer panel gene composition is shown in SuppDataFile1_PanCancerPanelGeneComposition.xlsx. To design the panel, cancer gene lists obtained from the literature , Color Genomics test suite (46), Rahman cancer gene list (47), COSMIC Cancer Gene Census (40,41), genes mutated in one or more MSS samples of the TCGA stomach adenocarcinoma study (13,42,43), mutated in three or more of 36 in-house gastric cancer tumor samples, and a list of genes involved in DNA repair (Supplementary table 1) were collected into a master gene list with standardized HGNC (48) approved gene name (49), or, if the gene was not found in the HGNC list, with the NCBI Entrez(50) preferred gene symbol (51).
The final panel probes were designed using Agilent SureSelect XT2 Custom Capture technology (68) with a target region of 3.75 Mbp.

Anonymous Patient and Sample IDs
We assigned each patient and biopsy were an anonymous ID (I_#####) and S_#####, respectively. Each biopsy was also assigned a biopsy identifier in the form of N1, T1, T2, T3, etc., where "N" designates normal tissue and "T" designates tumor tissue.

DNA extraction and sequencing
We collected 2 to 5 individual tumor biopsies of gastric cancer tumors from 37 patients along with normal tissue biopsies adjacent to the tumors, for a total of 168 individual tumor biopsies and 37 normal biopsies. All biopsies were fresh-frozen tissue. DNA was captured with the somatic pan-cancer panel described above.
Each panel capturing run used 16 biopsy samples which were barcoded and pooled into a hybridization group. The goal was to sequence individual tumor biopsies at 1.5x more depth than normal biopsies.
The samples were sequenced on a total of 7 lanes in 3 batches on Illumina HiSeq 4000 (69), using paired-end 150 base-pair sequencing. The hybridization groups, sequencing dates, and lane numbers were recorded for each sequenced sample aliquot in a tabbed table file named PROJECT_info.txt used within the computation pipeline, so these numbers were available for use by the pipeline (for example, for plotting sequencing depth of each sequenced sample aliquot).
There were 287 targeted genes that were covered less than 100%, and 24 of these were covered less than 90% (Supplementary  table 2).

Project BED file construction
Based on the design of the Agilent SureSelect XT2 Custom Capture we were provided with target region BED file 3025671_Covered.bed for the somatic pan-cancer. The panel design was lifted over to hg38 genome coordinates using the liftOver program(87) and padded by 200bp to create a padded target panel BED file. A target CDS BED file was also created by intersecting the CDS BED file above with the padded target panel BED file.
Shorthand names are used in the rest of this document when describing methods that made use of BED files: •

Read filtering and mapping
All FASTQ files (pre and post-processed) were QC'd using FASTQC (90) and FASTX (91). FASTQ files were trimmed using Scythe(88) version 0.991 and Sickle(89) version 1.33. Scythe utilized options -q sangerp 0.05 and illumina_adapters.fa file. Sickle was run with options pe -t sanger -g -q 30 -l 75, to produce reads with a minimum length of 75 bp and with an average PHRED Q score of 30 for the sequence in a sliding window.

Quality control testing for artifacts, cross-sample contamination, and gender detection
Chromosome and target region mapped read counts were assessed using samtools view command with options -q 30 -f 2 -F 3844 (minimum mapping score of 30, reads must be properly aligned, must not be secondary or supplementary alignments, must pass platform/vendor quality controls, and must not be PCR optical duplicates) and bamtobed the bedtools program suite(98) version 2-2. 27. Bar plots of the counts were produced. Target region counts were intersected target BED using the intersect command of the bedtools with the -wo argument. The number of base pairs in the regions of each of the output BED files was summed and aggregated across biopsies and plotted.
Hybrid selection and coverage metrics were generated BAM file using the CollectHsMetrics command of the Picard Tools program suite with options MINIMUM_BASE_QUALITY=22, MINIMUM_MAPPING_QUALITY=30, PER_TARGET_COVERAGE, CLIP_OVERLAPPING_READS=false and interval file produced from the BED_TARGET_RGNS BED file. Coverage metrics were plotted and used as the first of two steps to confirm poor coverage of biopsies first identified with the FASTQC summary plots.
Mapping quality metrics were assessed using CollectMultipleMetrics command of the Picard Tools. Additional mapping quality metrics were generated using bamqc version 2.2 of the Qualimap(99) program with using BED_TARGET_RGNS BED file, to produce a per-biopsy directory of metrics files.
Basepair-level coverage was generated using samtools view with options -q 30 -F 3840 and bedtools coverage command with options -nonamecheck -sorted -d and the panel target regions. These files were aggregated across all biopsies and plots were used as the second of two steps to confirm poor coverage of biopsies prior to removing them from the analysis.
Sequencing artifact statistics were collected using CollectSequencingArtifactMetrics command of the GATK program suite version 4.0.6.0 with options --INCLUDE_UNPAIRED=true --MINIMUM_INSERT_SIZE 10 --MINIMUM_MAPPING_QUALITY=30 --MINIMUM_QUALITY_SCORE=22. Results were aggregated across biopsies and plotted. That data shows that the mean sequencing error rate is about 0.03%, with a peak rate of 0.07% for C>T transitions, which was deemed highly favorable.
Sequence data for each biopsy was tested to verify that gender matched the recorded gender of the person; for each biopsy sample, the ratio of number of reads mapping to chromosome Y to total number of reads was calculated. If the ratio exceeds 0.00025, the biopsy is deemed male, else female. All biopsies were found to match the expected gender.
After quality control, the number of passing biopsies was reduced to 115 individual tumor biopsies from 32 patients.

Extraction of unaligned and Epstein-Barr and H. pylori reads
Amount of unaligned reads was assessed using samtools view command with argument -F 2. The number of unaligned reads were plotted.
Reads mapping to Epstein-Barr virus or H. pylori bacterial genomes were extracted using samtools view with option -F 3844. The number of reads were aggregated and plotted.

Panel of Normals construction
A panel of normal (PON) was used with GATK Mutect2(102) caller using the Best Practices procedure outlined by the BROAD Institute (103). Gnomad v2.0.1 hg38 VCF was used for germline SNPs.

SNV variant calling with multiSNV
SNV variants were called using multiSNV (104), which calls germline and somatic SNVs jointly (simultaneously) in all biopsies of a person.
The multiSNV program was modified in several ways: 1. Modified to improve calling of germline SNVs. The original code called a variant when the major alleles in normal and tumor were different, but this missed germline calls when 100% of reads in both normal and tumor were alternate allele reads (and therefore were the major allele). The code was modified to change this condition to call a variant if the major allele differed from the reference allele. 2. Modified to count reads with lower mapping quality that the normal quality threshold. Option -x (--minMapQualX) was added, taking a mapping quality number as an argument. This causes counts of reads with mapping quality >= the specified mapping quality to be accumulated and written to the output VCF file in the form of XD and XCOUNT FORMAT keys, with XD having counts of reference and alternate allele reads at or above that mapping quality (like AD) and XCOUNT having counts of reads with bases A,C,G,T at or above that quality (like BCOUNT). 3. Modified to speed up the program . The program was modified to keep the reference FASTA file open while processing, and cache reference information for faster access.
Germline SNVs were extracted from the per-person multiSNV VCF using bcftools view command to extract non-wildtype variants for the normal, requiring minimum depth of 15 and minimum VAF of 0.2. Variants were flagged with INFO flag GNOMAD if they appeared in the <GNOMAD_EXOME_VCF> using the bcftools annotate command.
Tandem Repeats Finder, RepeatMasker, and WindowMasker tracks (105) were downloaded from the UCSC Table Browser version hg38 (106)(107)(108)(109), The INFO flags PON, QUES, and RPTS were added to the per-person germline VCF files using BED files to specify the applicable regions: the PON GATK, PON questionable regions, and simple repeats BED files. Germline variants were filtered to produce a VCF of likely germline mutations that were not common population SNPs or in questionable call areas or areas containing simple repeats.
Sequencing artifacts were removed from these VCF files using the GATK FilterByOrientationBias with options --artifactmodes G/T --artifact-modes C/T -OVI -P to produce per-biopsy artifacts-removed somatic variant call set VCF files. The nonpassing variants were filtered using GATK SelectVariants.

Merging per-biopsy VCFs
This project uses multi-regional sequencing with multiple tumors per biopsy, yet the Mutect2 output is on a per-biopsy basis. The Mutect2 per-biopsy somatic indel VCF files of each person were merged to form per-person VCF files of somatic indels. This was a long multi-step procedure, established so as to ensure that the final merged file contained only those somatic variants for which there was solid evidence of either mutant or wild type genotype from either Mutect2 or the GATK caller HaplotypeCaller from each biopsy of the person. The procedure was: 1. Generate a union set of passing germline and somatic variants for each person 2. GATK HaplotypeCaller version 4.0.6.0 was used to call variants that exist in the union set but were not called in the biopsy 3. All per-person biopsy VCFs were then merged 4. The bcftools isec command with options -n=<NUM_TUMORS> -c all -w 1 to intersect all per-biopsy merged VCF files of variants present in all biopsies 5. The bcftools isec command with options -n=2 -c all -w 1 calls at all loci where a mutation occurred in ANY of that person's tumors. 6. The bcftools isec command with options -n=2 -c all -w 1, to produce normal VCF files containing calls at all loci where a mutation occurred in ANY of that person's tumors. 7. The final VCF was created by merging the per-person normal all-loci VCF file and all of the person's per-biopsy enhanced all-loci VCF files.

Merging per-person VCF variant files
Per-person MultiSNV VCF files were merged using the bcftools merge to produce a single project VCF file containing all MultiSNV germline and somatic SNV variants called in any biopsy of any person. In the same way, all of the per-person Mutect2 merged VCF files were merged to produce a single project VCF file containing all Mutect2 indel somatic variants called in any biopsy of any person.

Preparing VCF file of TCGA gastric adenocarcinoma somatic variants for comparison
Utilizing cBioPortal (42), datasets for Nature 2014 (PubmedID: 25079317) downloaded on 11/2016 -data_mutations_extended.txt and data_clinical.txt were utilized to generate a VCF file in GRCh38 using a custom Rscript. The VCF file was then filtered according to padded targets region of the panel.

Forced calling of variants based on evidence in sister biopsies
Manual examination of variant call results revealed examples of cases where it appeared that multiSNV or Mutect2 had made an incorrect call, calling a genotype as wildtype in one or more tumors when it was called as a somatic variant in one or more other tumors. This is expected in Mutect2, which does not do joint-sample calling, and in multiSNV is due to its using a conservative estimate of joint probabilities for mutations that are clonal in nature. Based on our manual review of the data, we implemented a special forced variant calling algorithm using special R code. It read the multiSNV and Mutect2 merged VCF files (all patients, all biopsies in one VCF file) and applied an algorithm to test whether a given variant called in one biopsy should be called in another sister biopsy. If the algorithm indicated the variant should be called, a file of these forced variant calls was created, and subsequently whenever one of the VCF files was read, the forced variant call file was also read and used to modify the VCF file data to include the forced variant calls. The algorithm was as follows. For each variant of each sample in the VCF file in question, a wild-type call is changed to a somatic mutation call if: 1. The sample under consideration is a TUMOR sample 2. The sample locus under consideration is called wildtype 3. The paired NORMAL sample is called wildtype at that locus 4. The locus has only a single record in the VCF file 5. Read count, ref and alt depth, and purity/copy-number-adjusted VAF data (described later) is available for the sample at that locus 6. In the sample under consideration, the read count for the alt allele is >= 2 OR is equal to 1 and the copy-numberadjusted VAF for the allele is >= 0.01 7. There are one or more sister TUMOR samples for the same person 8. At least one sister TUMOR is called variant 8 9. All sister TUMORS that are called variant have their first alternate allele equal to the alt allele in the sample under consideration 10. At least one sister TUMOR that is called variant has its first alternate allele read count >= 6 11. That same sister TUMOR has PureCN gene CNV data available for the gene containing the variant, and its purity/copy-number-adjusted VAF for its first alternate allele is >= 0.025

Filtering variants in VCF variant files
The common R code used for variant analysis included the definition of filters and application of them to the variant list to create filtered subsets of variants. Each filter has a name which starts with "grm" for filters that select only germline variants and with "som" for those that select only somatic variants. For the purpose of describing somatic variants in the main part of the manuscript, the following filters were used: • somPanelGenes: Selects somatic variants described in the manuscript as the full set after filtering out false positives and sex chromosome mutations. Referred to as the "final list" of variants. • somCodSplNoSyn: Selects somatic variants described in the manuscript as non-silent coding or splicing mutations For identifying germline mutations, the following filter was used: • grmReliable: Selects germline variants that are reliably called. For comparing mutation rate between TCGA and this project, the following filter was used: • somCDS_CodSpl: Selects somatic variants in coding or splicing regions of panel CDS target regions. A similar filter was applied to the TCGA mutation data to select mutations for computing TCGA mutation rate, so that the comparison fairly compared mutations in the same regions.
Each filter consists of one or more filter actions. Each filter is actually THREE filters: (1) a variant filter that selects a subset of all variants; (2) a variant-person filter that selects, for each variant, one or more persons carrying that variant; (3) a variant-biopsy filter that selects, for each variant, one or more biopsies carrying that variant. If a variant is not selected in (1), it is also not selected for any person in (2) or any biopsy in (3). If a variant is not selected for some person in (2), it is also not selected for any biopsy of that person in (3). Some filter actions operate at the level of the variant (1), some at the level of person (2), and some at the level of biopsy (3), but all three filter types can be affected by each action. In the following filter descriptions, the filter name is shown after the bullet, and a numbered list of filter actions follows. The descriptions are written in terms of which variants pass the filter and are accepted.
Somatic variant filters are as follows: • som: somatic variants in tumor biopsies only. Selects a set of 0 or more tumor biopsies for each variant.
(1) above som actions applied first (2) read depth in NORMAL >= 20 (3) read depth in ALL TUMORS OF PERSON >= 30 (4) alternate allele depth in AT LEAST ONE TUMOR >= 5 (5) NORMAL alternate allele depth = 0 or =1 and VAF <= 0.01 (6) the variant is called in more than one sample of the person OR it is called in only one sample and the other samples of that person have a total allele count for other than the ref and alt alleles of no more than 1 + 10% of the variant sample's alt allele count (7) if GQ tag is present (multiSNV only), NORMAL GQ <= 1e-8 (8) if GQ tag is present (multiSNV only), GQ <= 1e-8 in all tumors of the person (9) STR tag not present (not a simple tandem repeat) (10) RPT tag not present (not a RepeatMasker repeat) (11) WMR tag not present (not a WindowMasker repeat) (12) Reject loci that seem to be experiencing sequencing artifacts causing higher than normal sequencing error rates, by examining NORMAL biopsies for loci that are called WT but have far fewer than the expected number of reference allele reads (given the total read depth and the known sequencing error rate of 0.0008), or that are called HET but have far fewer than the expected number of alternate reads (given the total read depth and an expected probability of an alternate allele read being no lower than 0.4). At any given locus, if two or more NORMAL biopsies show such unusual read counts, reject ALL VARIANTS IN ALL BIOPSIES at that locus. Tests are based on the binomial distribution, and the test failure thresholds are computed dynamically for each locus that is tested, in order to achieve a total over all variants of 5 or fewer false positive rejections with probability >= 0.95. The dynamic threshold is computed in two steps. First, a single probability pMin is computed, such that IF a locus is rejected at probability pMin under null hypothesis conditions, then there will be fewer than 5 false positive rejections with probability 0.95. Second, for each locus, a single probability threshold is computed for rejecting a NORMAL at that locus. For a given 9 NORMAL, its test fails (rejection) if the number of reads of the requisite type (reference or alternate) has a probability of occurrence under the null hypothesis of less than this threshold probability. The threshold probability is chosen such that the probability of rejection of the locus under the null hypothesis (when 2 or more NORMALS fail their test) is pMin (and since each locus has a varying number of NORMALS with genotype data, that probability changes depending on the locus). (13) variant is not present in the GnomAD genome ALL database or its population frequency is < 0.0001 in that database (14) variant is not present in ExAC database release 1 or its population frequency is < 0.0001 in that database • somExcludeXY: select reliable somatic variants not found on chromosomes X or Y (1) above somReliable actions applied first (2) variant position is not in chrX or chrY • somPanelGenes: select reliable somatic variants in LCC panel genes, not found on X or Y (1) above somExcludeXY actions applied first (2) variant lies in a region given by the BED_MERGED_TGTS bed file (3) there is at least one LCC panel gene within the variant's gene.RefGene list created by Annovar • somCodSplNoSyn: select reliable non-silent coding/splicing somatic variants in LCC panel genes, not on X or Y (1) above somPanelGenes actions applied first (2) VarFunctional is one of nonsynonymous, stoploss, stopgain, frameshift, inframeindel, unknownExonic, splicing, TERT_PROMOTER, other • somCDS_CodSpl: select reliable coding/splicing somatic variants in CDS target regions of LCC panel genes, not on X or Y (1) above somPanelGenes actions applied first (2) VarFunctional is nonsynonymous, synonymous, stoploss, stopgain, frameshift, inframeindel, unknownExonic, splicing, TERT_PROMOTER, other (3) variant lies in a region given by the BED_TARGET_RGNS bed file Germline variant filters are as follows: • grm: germline variants in normal biopsies only. Selects a set of 0 or more normal biopsies for each variant.
(1) above grm actions applied first (2) read depth in NORMAL >= 20 (3) alternate allele depth in NORMAL >= 5 (4) if GQ tag is present (multiSNV only), NORMAL GQ <= 1e-8 (5) STR tag not present (not a simple tandem repeat) (6) RPT tag not present (not a RepeatMasker repeat) (7) WMR tag not present (not a WindowMasker repeat) (8) no NORMAL biopsies with a genotype other than 0/0 and read depth >= 30 may have VAF < 0.16. A variant is filtered out from ALL biopsies if ANY NORMAL BIOPSY fails this test. This would indicate a likely artifact problem at the locus. Sequencing errors will cause occasional NORMALS with very low VAF, but those should be called with a genotype of 0/0. (9) Reject loci that seem to be experiencing sequencing artifacts causing higher than normal sequencing error rates (same method as somReliable filter described above).
The following TCGA data variant filter was created for extracting somatic variants from TCGA data in the same regions as targeted by the LCC panel, for the purpose of comparing mutation rate between TCGA and this project. This filter matches as closely as possible the somCDS_CodSpl filter above, given the available mutation attributes.
• TCGA.somCDS_CodSpl: select somatic variants for comparison to MSEQ project variants to compare mutation rates: (1) select TUMOR biopsies only (TCGA data only contained tumor biopsy data) (2) VarStat = SOM or LOH (3) read depth in ALL TUMORS OF PERSON >= 30 (4) alternate allele depth in AT LEAST ONE TUMOR >= 5 (5) STR tag not present (not a simple tandem repeat) (6) RPT tag not present (not a RepeatMasker repeat) (7) WMR tag not present (not a WindowMasker repeat) (8) total number of somatic mutations < 400 (else presumed to be WGS rather than WES sample; samples weren't annotated with type of sequencing in the downloaded TCGA file). The number 400 was chosen after examining all samples for number of mutations and observing a bimodal distribution that was attributed to type of sequencing. (9) variant is not present in the GnomAD genome ALL database or its population frequency is < 0.0001 in that database (10) variant is not present in ExAC database release 1 or its population frequency is < 0.0001 in that database (11) variant position is not in chrX or chrY (12) variant lies in a region given by the BED_MERGED_TGTS bed file. (13) there is at least one LCC panel gene within the variant's gene list created by Annovar (14) VarFunctional is nonsynonymous, synonymous, stoploss, stopgain, frameshift, inframeindel, unknownExonic, splicing, TERT_PROMOTER, other (15) CDStgtRgns: variant lies in a region given by the BED_TARGET_RGNS BED file

Assessing variant filters
For each of the variant filters listed above, a bar plot was made with one bar per sample, and each bar being a stack of bar segments, one segment for each filtering action within that filter, and with each segment's height corresponding to the number of variants removed by that action for that sample. These plots were reviewed to assess how each filtering action in each filter was behaving, to look for any unexpected or problematic filtering actions.

Removing biopsies judged to be unreliable outliers
A bar plot was made, with one bar per biopsy and with each bar being two stacked bar segments, one segment showing number of "reliable" somatic variants in the biopsy (passing filter "somReliable") and the other showing the number of "unreliable" somatic variants (passing filter "som" but not passing "somReliable"). Review of this plot showed three biopsies that were judged to be unreliable outliers, and they were subsequently removed from the analysis: 1. I_6579_S_26840 (T1): this biopsy had over 4000 total somatic variants, more than 10 times its sibling biopsies. 2. I_9709_S_26846 (T3): this biopsy had almost twice as many somatic variants (225) as its sibling biopsies, all of which had a fairly uniform number (around 125) of somatic variants. 3. I_19343_ S_26861 (T4): this biopsy had more than twice as many somatic variants (260) as its sibling biopsies, all of which had a fairly uniform number (around 100) of somatic variants.

SNV/Indel Clonality assignment
For each SNV or indel variant in each person, clonality is assigned from the GTval values as follows: • CLONAL: NORMAL is not germline variant, TUMORS are all somatic variant (for that variant).
• SUBCLONAL: NORMAL is not a germline variant, more than one but not all TUMORS are somatic variant.
• PRIVATE: NORMAL is not germline, only one TUMOR is somatic variant.
• GERMLINE: NORMAL and all TUMORS are germline variant.
• GONE: the variant, whatever sample it was called in, has been filtered out by the grmReliable or somPanelGenes filter. • NONE: no variant was called in the NORMAL or any TUMORS.

Analysis of coverage data
A custom R program was used to compute and plot numerous coverage statistics (here, "minimum depth" means depth of at least 15X): • mean mapped coverage of each biopsy across all target regions • a series of global coverage statistics • for each of the following, the mean target region depth (across target regions and/or biopsies), standard deviation of depth, and percent of target regions and/or biopsies with minimum depth: a) each biopsy b) each individual c) each target region specified in the RN_NAMED_TARGET_RGNS BED file d) each targeted gene e) each DNA pool hybridization group

Checking biopsy pairing
All pairs of biopsies (regardless of owning individual) were compared for similarity by computing the cosine similarity of the variant allele frequencies (VAFs) of all germline variants present in one or both biopsies and having a depth in both biopsies lying between 80% and 250% of the mean depth of all the biopsy's variants. A cosine similarity threshold is chosen by seeking a minimum in the empirical distribution of the cosine similarities from all pairs (since most pairs are mismatch, but a number of pairs will batch, giving a bimodal distribution). As a result of this testing, two mismatches were discovered, leading to 6 biopsies being removed from the analysis: • I_26180_S_26858 (T1): This tumor biopsy did not match its normal biopsy • I_9298_S_26895 through S_26899 (N1, T1-T4): The normal biopsy matched the I_19343_S_26857 (N1) biopsy and mismatched its tumor biopsies.

Mutation rate analysis and comparison to TCGA
Somatic mutations were filtered using the somCDS_CodSpl.samp filter on a per-sample basis. TCGA somatic mutations from the preprocessed TCGA GC VCF file were filtered using the TCGA.somCDS_CodSpl.samp filter on a per-sample basis. The mutation rates of each MSEQ sample, the mean rate of the MSEQ MSI and non-MSI samples, and the mean rate of the TCGA MSI and non-MSI samples were computed, along with standard error of the mean for each mean rate, and these were plotted on a bar plot for comparison of mutation rates.

Purity, copy-number, and aneuploidy analysis
Copy-number analysis, including tumor purity and ploidy analysis, was performed using PureCN(117), version 1.16.0, that was designed for use with targeted-panel data. A genome mappability file was created using the GEM library(118) program version 1.778 beta and the GATK bundle genome reference file Homo_sapiens_assembly38.fasta. The PureCN IntervalFile.R program was run to create text and BED intervals files, using options --infile <FILTERED_BED_FILE> --fasta Homo_sapiens_assembly38_ LCCpanel.fasta --genome hg38 -force Homo_sapiens_assembly38_LCCpanel.gemV2_100.mappability.bigwig --offtarget --offtargetwidth=200000. The PureCN Coverage.R program was run on each biopsy's final recalibrated BAM files (both NORMAL and TUMOR) with the interval file and options --seed 123 --force --removemapq0, producing four PureCN coverage files per biopsy.
Per-tumor-biopsy VCF files of variants were prepared for PureCN by merging, on a per-tumor-biopsy basis, the two per-person multiSNV and Mutect2 pass-filtered normalized VCF files for each biopsy.

Manual curation of PureCN purity/ploidy solution
A custom R program named ManualCuratePureCNsample.R was run for each tumor biopsy to analyze the PureCN run's results and determine whether or not the chosen purity/ploidy solution should be altered and the resulting manual curation of purity/ploidy established as the new chosen solution. The program was run with options --minLikelihoodFrac 0.7 --minPeakHeatFrac 0.7 --minProductFrac 0.7 --minTgtPloidy 1 --maxTgtPloidy 6 --minTgtPurity 0.1 --maxTgtPurity 1.0 --maxNonHighAdjVAF 1.2 --minNonLowAdjVAF -0.2 --maxVarsBadVAF 2 --maxFracVarsBadVAF 0.2 --CR1_thresh 0.15 --roundCRstart 0.01 --minCRsegs 5 --minCRsegLen 10e6 --roundCRdelta 0.01. It uses three methods to detect poor PureCN solutions. The first method is based on the observation that sometimes PureCN's maximum likelihood solution has a heatmap value significantly lower than the peak heatmap value, and in those cases the solution often appears to be wrong. The method is to reject as unacceptable those PureCN solutions that have ANY of the following conditions: 1. PureCN total likelihood value ranked within bottom --minLikelihoodFrac total likelihoods among all solutions. 2. PureCN heatmap value ranked within bottom --minPeakHeatFrac heatmap values among all solutions, but if this rejects all solutions, instead choose the one solution (after the previous step) that has the highest heatmap value. 3. A combination of the above two factors, obtained by taking the product of the PureCN total likelihood and heatmap values, and rejecting those solutions with a product ranked within the bottom -minProductFrac products among all solutions. If this rejects all solutions, instead choose the one solution (after the previous steps) that has the highest product value. The second method is based on the values of the variant allele frequencies after adjusting them for purity, ploidy, and copy-number. It is observed that sometimes, a large fraction of these VAF values are greater than 1 or less than 0, suggesting that the PureCN purity/ploidy solution is incorrect and caused the adjusted VAF to be too high or low. The method is to reject as unacceptable those PureCN solutions that have BOTH of the following conditions: 1. Number of PureCN-filtered variants with purity/ploidy/copy-number-adjusted VAF (variant allele frequency) outside the acceptable range (which is minNonLowAdjVAF to maxNonHighAdjVAF) is > maxVarsBadVAF 2. Fraction of such variants is > maxFracVarsHighVAF After applying these two algorithms, additional steps are applied to eliminate more solutions: 1. If there is more than one remaining non-rejected solution, reject those with ploidy outside the range minTgtPloidy..maxTgtPloidy, unless this rejects all solutions. 2. If there is more than one remaining non-rejected solution, reject those with purity outside the range minTgtPurity..maxTgtPurity, unless this rejects all solutions. 3. If there is more than one remaining non-rejected solution, choose the one with the highest total log likelihood.
Another function of ManualCuratePureCNsample.R is to read the PureCN segments output file and estimate a biopsy copy ratio that corresponds to the presumed no-CNVs copy ratio. Typically, there will be long segments on many separate chromosomes where the PureCN copy ratio is approximately but not exactly 1, typically differing by no more than 0.01 between such segments. A Wilcoxon test comparing the copy ratios of the markers within such segments to the expected no-CNVs copy ratio of 1.0 often produces a very small p-value because the copy ratios of the markers are slightly different than 1.0 and exhibit almost no variance. This suggests that PureCN does not normalize the copy ratio data as well as it might, as even in a highly disturbed cancer genome there should be a significant genome fraction in dosage balance and having the same copy ratio. That copy ratio should be identified as the no-CNVs copy ratio for the purpose of performing a Wilcoxon test for amplification or deletion. ManualCuratePureCNsample.R uses an algorithm to identify segments whose unadjusted raw copy ratio is very close to 1, and where the number of such segments is relatively high compared to segments with other copy ratios: 1. Eliminate from consideration all segments whose copy ratios are not in the range 1 ± CR1_thresh. If this results in eliminating ALL segments, increase CR1_thresh by 10% and try again, repeating until at least minCRsegs of at least minCRsegLen total length are obtained. 2. Beginning with roundCR = roundCRstart, round each of the remaining segments' unadjusted copy ratios to the nearest multiple of roundCR. 3. Find the most common rounded copy ratio. If more than one copy ratio occurs with the same maximum occurrence frequency, choose the ratio closest to 1. 4. If the number of segments in that most-common-copy-ratio is < minCRsegs, or if the total length of those segments is < minCRsegLen, increase roundCR by roundCRdelta and repeat the above steps, until an acceptable most-commoncopy-ratio is found. The mean copy ratio of those segments having a rounded copy ratio equal to the most-common-copy-ratio is computed and added as a comment to the new curation .csv file, in the 'Comment' column, as for example the comment: 'LOW PURITY;EXCESSIVE LOH;CR1=1.0233'. This number can be divided into any raw copy ratio from PureCN to move it in the direction of what is surmised to be a more accurate copy ratio, and resulting in those most-common-copy-ratio segments having an adjusted copy ratio that is equal to or at least closer to 1.0. The program writes a new copy of the .csv file containing PureCN purity and ploidy solution and the computed CR1 value in a new subdirectory named "curated" (done on a per-tumorbiopsy basis).

Rerunning PureCN.R on curated data
For each tumor biopsy for which a different purity/ploidy solution was chosen in the preceding step, the PureCN PureCN.R program was run a second time, with the --rds option specified to request that it recompute output files using the new purity/ploidy solution.

Mutation burden calculation
The PureCN Dx.R program was run on each tumor biopsy's PureCN .rds file, with options --callable <BED_TARGET_RGNS> --exclude hg38_simpleRepeats_merged.bed -force, to estimate the mutation rate. Per-tumor-biopsy comma-separated variable output files are created.

CNV segment p-values
A custom R program named CreateCNVsegmentPvals.R was run for each tumor biopsy, with options --fdr 0.01 --CR_thresh 0.1 --CRdif_thresh 0.25 --CRdif_frac 0.25 and input files BED_TARGET_RGNS, PureCN curation, .rds data, and segment copy-number files, which assigned each segment a p-value by using a Wilcoxon test: 1. For each segment, get the PureCN markers overlapping it from the PureCN .rds file and adjust their copy ratios by dividing them by the presumed no-CNVs copy ratio (CR1 value obtained from the curation file comment). 2. Further adjust the marker copy ratios for purity and ploidy (see below), using that of the PureCN solution given in the PureCN curation file. Those copy ratios that were originally equal to the CR1 value are now exactly 1. 3. Use a single-tailed Wilcoxon single-sample test to test the marker adjusted copy ratios to see if they satisfy the null hypothesis of having a population mean rank of 1 (or for the alternative hypothesis, that it is less than or more than 1). Two tests are done, one for less than, one for more than. Select the smaller of the resulting p-values as the result pvalue for the gene. If the first p-value is selected, the gene is annotated as having a copy ratio < 1 (DELetion), and for the second p-value, > 1 (DUPlication).
A multiple-testing-adjusted q-value was associated with each such p-value using the Benjamini and Hochberg (FDR) method. Each gene is annotated with adjusted copy ratio, mean marker adjusted copy ratio p-value, and q-value.

CNV gene p-values
A custom R program named CreateCNVgenePvals.R was run for each tumor biopsy, with option --ttest and as input, PureCN curation, .rds data, and gene copy-number files, which assigned each gene a p-value by using a Student's t-test: 1. For each gene, get the PureCN markers overlapping it from the PureCN .rds file and adjust their copy ratios by dividing them by the presumed no-CNVs copy ratio (CR1 value obtained from the curation file comment). 2. Further adjust the marker copy ratios for purity and ploidy (see below), using that of the PureCN solution given in the PureCN curation file. Those copy ratios that were originally equal to the CR1 value are now exactly 1. 3. Use a single-sample Student's t-test to test the marker adjusted copy ratios to see if they satisfy the null hypothesis of having a normal population mean equal to 1 (or for the alternative hypothesis, that it is less than or more than 1). Two tests are done, one for less than, one for more than. Select the smaller of the resulting p-values as the result p-value for the segment. If the first p-value is selected, the segment is annotated as having a copy ratio < 1 (DELetion), and for the second p-value, > 1 (DUPlication). A multiple-testing-adjusted q-value was associated with each such p-value using the Benjamini and Hochberg (FDR) method. Each segment is annotated with adjusted copy ratio, mean marker adjusted copy ratio p-value, q-value, and names of specified target genes contained within it.

Chromosome copy-number
A custom R program named GetChrCNV.R was run for each tumor biopsy, with options --ploidyTolForGain 0.6 --ploidyTolForLoss 0.6 --fracForAltered 0.66, to analyze the PureCN copy-numbers for CNV segments along the genome, in order to estimate copy-numbers and loss-of-heterozygosity for whole chromosomes and chromosome arms. The program used centromere positions given by downloading the human genome version GRCh38 centromeres track data using the UCSC Table  Browser( 1. PureCN purity/ploidy estimates and number of PureCN CNV segments in the chromosome or arm 2. mean adjusted copy ratio and copy-number of the segments. 3. rounded marker adjusted copy-number that covers the most CNV segment distance for chromosome or arm segments with that copy-number 4. fraction of distance covered by the CNV segments that have that most frequent rounded adjusted copy-number 5. fraction of chromosome marker distance over which the segments' adjusted copy-number was > ploidy+0. 75 6. fraction of chromosome marker distance over which the segments' adjusted copy-number was < ploidy-0.75 7. fraction of chromosome marker distance over which the PureCN's estimated segment minor copy number was 0 and not flagged by PureCN as unreliable and where total rounded copy number lies between 1 and 6 (loss-ofheterozygosity, LOH) 8. flags for fractions in 4-7 above exceeding the threshold of 0.66 of the chromosome marker distance 9. number of segments available to estimate copy ratio 10. number of segments with gain if the gain flag is set, or number with loss if the loss flag is set 11. mean copy ratio/number of those segments with gain if gain flag set, or with loss if loss flag set

Ancestral CNV events and TestCNVedgesOfSamplesOfPerson.R
The CNV analysis performed by PureCN is done individually per biopsy to try to identify ancestral CNV events that are common to more than one biopsy of an individual. PureCN does no per-person analysis, so a custom R program was developed for this purpose, named TestCNVedgesOfSamplesOfPerson.R. The fundamental idea behind the program is that ancestral CNV events should have CNV segments with the same start and/or end position within the biopsies descending from the ancestral cell.

Adjusting copy ratios, copy-numbers, and variant allele frequencies for purity and ploidy
The raw output from PureCN programs normally includes raw, unadjusted copy-numbers (for variants, segments, and genes). The purity and ploidy of a biopsy strongly affect these numbers.
Adjustment of copy ratios and copy-numbers for purity and ploidy is discussed in Zack et al (120). In this paper, the algebra is incorrect in the q(x)/T factor in the equation for computing R'(x) on page 13. The corrected algebra is: The corrected equations were incorporated at numerous places into the PureCN CNV analysis presented above. Using R language, the adjustment of a raw copy ratio CRraw for estimated purity and ploidy to produce corrected copy ratio CR was: CR = (purity*ploidy*CRraw + 2*(1-purity)*CRraw -2*(1-purity))/(purity*ploidy) ind = (CR < 0) The inverse operation was also used at times: CRraw = (purity*ploidy*CR + 2*(1-purity))/(purity*ploidy + 2*(1-purity)) ind = (CRraw < 0) CRraw[ind] = 0 Variant allele frequencies (VAFs) are also altered by purity and ploidy. If a given locus has a copy-number (in the tumor cells) of CN and the actual VAF is called VAF, then each tumor cell produces CN*VAF copies of the variant allele. For N cells total, the number of variant allele copies will be CN*VAF*purity*N, since only purity*N of the cells are tumor cells. The total number of copies of the allele (wild type and variant) in N cells is (CN*purity*N + 2*(1-purity)*N). The observed (raw) VAF is therefore: VAFraw = (CN*VAF*purity) / (CN*purity + 2*(1-purity)) Solving this for VAF gives the formula for correcting a raw VAF for purity, given the VAF and copy-number CN at its locus: VAF = VAFraw * (2*(1-purity)/(CN*purity) + 1)

Classifying into molecular subtypes
The TCGA adenocarcinoma paper(13) developed a molecular subtype classification for stomach cancer. The paper did not provide formal methods for testing DNA sequence data from a biopsy in order to classify it, but it did give general guidelines. The four molecular subtypes are prioritized as follows, and their guidelines are: 1. EBV (Epstein-Barr virus): if the biopsy has strong evidence of infection by this virus, it is classified as EBV.
2. MSI (microsatellite instability): otherwise, if the biopsy was determined to have microsatellite instability, it is classified as MSI. 3. CIN (chromosome instability): otherwise, a biopsy with significant aneuploidy was classified as CIN, and chromosomal arms were considered altered if at least 66% of the arm was lost or gained with a log2 copy-number change greater than 0.1. 4. GS (genomically stable): otherwise, the biopsy was classified as genomically stable.
A custom R program named classifyMolecular.R was developed to apply this classification scheme to the biopsies. The specific algorithms used were: 1. EBV: the count of the number of reads mapping to EBV genes in the panel was adjusted for biopsy purity by dividing the count by the purity estimate from PureCN, for each biopsy. The total read count for each biopsy was likewise adjusted. The purity-adjusted EBV read count for each biopsy was adjusted for library size by multiplying it by the mean of the purity-adjusted total read count and dividing by the purity-adjusted total read count in that biopsy. The resulting normalized and purity-adjusted EBV read counts were clustered into two clusters using k-means. The midpoint between the clusters was chosen as a threshold for calling EBV infection and EBV molecular subtype. 2. MSI: the output file from the MSIsensor program was read and biopsies whose MSIsensor score exceeded the threshold of 3.5 were called MSI molecular subtype. 3. CIN: the output file from GetChrCNV.R was read and for each biopsy the number of altered chromosome arms was counted. The counts were clustered into two clusters using k-means. The midpoint between the clusters was chosen as a threshold for calling CIN molecular subtype for a biopsy. The threshold of 66% of the arm lost or gained was used to call an arm altered, as with TCGA. Unlike TCGA, a loss or gain segment was considered to be any segment whose purity/ploidy-adjusted copy-number was > ploidy+0.75 or < ploidy-0.75 (whereas TCGA used 1.07 (= 2^0.1) instead of 0.75). This was because a typical single-chromosome gain or loss will change copy-number by exactly 1.0 in the ideal situation, but since the copy-number values are noisy, such a value could easily fall slightly above 1.0 even though the actual value is indeed 1.0. The TCGA value of 1.07 was considered to be too close to 1.0.
In addition to classifying biopsies for molecular subtype, we classified individuals for molecular subtype. The rules were simple: 1. Individual is EBV if any biopsy was classified as EBV.
2. Otherwise, individual is MSI if any biopsy was classified MSI. 3. Otherwise, individual is CIN if any biopsy is CIN. 4. Otherwise, individual is GS.
Following this classification process, manual examination of the results revealed some cases of questionable classification. We added arguments to classifyMolecular.R to allow manual override of its automatic classifications. We made the following manual overrides: • I_20447_S_26976 (T1): sample was forced to be type MSI. Its MSIsensor score was 1.39, below the threshold of 3.5; its sister tumor I_20447_S_26977 (T2) had a score of 10.66. The two tumors had a similar number of somatic variants. • I_26171_S_26816 (T1): sample was forced to be type GS. It was called as type CIN with an altered chromosome arm count of 11, a little above the threshold of 9.5; its sister tumor I_26171_S_26817 (T2) was 8, and both tumors had a very low number of somatic mutations.

H. pylori infection assessment
Assessment for H. pylori infection was done by classifyMolecular.R in a similar manner as for EBV infection. The count of the number of reads mapping to H. pylori genes (without regard to the strain(s) of H. pylori the reads mapped to) in the panel was adjusted for biopsy purity by dividing the count by the purity estimate from PureCN, for each biopsy. The total read count for each biopsy was likewise adjusted. The purity-adjusted H. pylori read count for each biopsy was adjusted for library size by multiplying it by the mean of the purity-adjusted total read count and dividing by the purity-adjusted total read count in that biopsy. The resulting normalized and purity-adjusted H. pylori read counts were clustered into two clusters using k-means. The midpoint between the clusters was chosen as a threshold for calling H. pylori infection.

Mutation druggability gene list construction
Two lists of genes with drugs approved for targeting specific mutations were used to estimate the potential druggability of the mutations we found. The OncoKB(121) gene list (122) was trimmed by excluding entries with drug-resistance levels-ofevidence and removing entries not annotated with a gene targeted by our panel, leaving 50 of the original 64 genes within OncoKB levels-of-evidence 1 through 4. The second gene list is of 69 FDA GC-targeted therapy genes (123), of which 44 were in our panel. The combined list contained 65 genes, and after two on a sex chromosome and five with poor coverage were removed, 58 druggable genes remained on our list.

Mutation signature analysis
Mutation signatures have been identified de-novo in TCGA cancer data from multiple cancer projects, using non-negative matrix factorization (124,125), including signatures specific to GC (126). The original signatures were single-base-pair substitutions with a single base on each side as context. Subsequently, a new study added new single-base signatures and expanded the signatures to double-base and indel substitutions (127). Both original (V2.0) and newer (V3.2) signatures have been curated in the COSMIC Mutational Signatures database (113). Each signature is defined by a probability vector whose sum is 1 and whose elements are the probabilities that the mutational process active for that signature will create a mutation of a given type. For single base substitutions, there are 96 possible mutation types including a single context base on each side, and so the vector has 96 probabilities in it. The total number of signatures in the V2 set is 30 and in the V3.2 set is 78. De novo signature discovery requires a large number of samples in order to pick up the signals of different signatures accurately. When the number of available samples is small, as it is in this MSEQ project, de novo signature discovery cannot be done.
Signatures operating in single samples can be estimated by multiple linear regression techniques to find the combination of weights of all signatures under consideration that minimizes the error in predicted mutation spectrum vs. actual mutation spectrum. We used one such algorithm is deconstructSigs (128), an R package, to estimate signature weights. The algorithm requires a sufficient number of mutations in order to perform well, and because our project used a panel with limited genome coverage, the mutation count per sample was too low to expect high-quality results. Instead of extracting per-sample signature weights, we combined mutations separately for MSS samples and MSI samples, then ran deconstructSigs on the combined mutation set, to estimate signature weights of signatures caused by mutational processes presumably operating within the set of samples. We used the V3.

Phylogenetic analysis
Existing programs for printing and annotating trees were found to be very inadequate for our needs, so a custom R program named plotTrees.R was developed to generate, annotate, and print phylogenetic trees. Supplementary figure 2 shows a selected subsample of trees plotted by this program. Each tree corresponds to one individual and has one leaf node for each biopsy of the individual (including the normal biopsy). Also, there are four supplementary figures containing phylogenetic trees for every patient (Supplementary figures 3, 5, 6 and 8).
PlotTrees.R uses the R phylogenetic tree packages "ape" (129) and "geiger" (130) to create the trees, and optionally performs bootstrapping on them. It was designed to produce trees from somatic SNV, indel data, and copy-number data of various kinds. It has many options that determine how trees are generated for each biopsy, including whether to use SNV/indel data, CNV data, or both. Three trees can be plotted on one page, a tree made from SNV/indel data, one made from CNV data, and one made from both kinds of data. Different sets of options can be applied to control not only the specific CNV data used to make the trees, but also to control the annotation on the trees. The biggest part of the program involves positioning tree branch annotation to make the tree as uncluttered and readable as possible, and that process is not described here. The plotTrees.R program is not general enough for use outside our lab.
The program computed and plotted one tree at a time, each tree belonging to a single individual. The remaining description applies to creation of one tree, for one individual.
To make a phylogenetic tree, a distance matrix must be created. It is a square matrix with one row and one column for each biopsy, and the values in the matrix represent the similarity of the biopsies of that row and column, with a value of 0 indicating complete dissimilarity and larger values indicating greater similarity. A distance matrix is created by first creating an event matrix, which has one column per biopsy, one row for each event that will be used to determine relative ancestry of the biopsies, and values that may be simply 0 if the biopsy did not have the event and 1 if it did, or the values may be more complex. This is discussed below when the methods for creating each type of tree are described.
After the event matrix is created, it is used to create a distance matrix by applying a distance metric to it. In plotTrees.R the distance metric is the standard Euclidean distance, so that the distance between two biopsies is the sum of the square of the differences between the event matrix columns for those two biopsies. The distance matrix is turned into a standard R distance matrix object with function as.dist(). The phylogenetic tree is then computed with the fastme.bal() function in the "ape" package, which performs the "minimum evolution" algorithm(131) and the root() function is used to root the tree at the normal biopsy.
When plotTrees.R is configured to perform bootstrapping, the boot.phylo() function is called with argument x being the transpose of the event matrix and the argument FUN being a function that recomputes a new tree from a truncated event matrix using the same method as was used for the main tree. The bootstrap result can be annotated on the tree branches as a percent number, which is derived as 100*prop.clades()/numBootstraps, where prop.clades() is the "ape" package function that returns the number that was associated with each tree node by the boot.phylo() function. When bootstrapping is performed, numBootstraps was always 100, i.e. 100 rounds of bootstrapping were performed.

Somatic SNV/indel-based trees
The somReliable filter was used to filter somatic variants to obtain reliably-called variants (both SNVs and indels) from which to compute trees. (The more stringent somCodSplNoSyn filter was used to select variants for annotating onto the trees.) The event matrix was created by examining the VarStat variable for the variants passing the somReliable filter and for the tumor biopsies of the individual. Each biopsy with a WT value in VarStat, and the normal biopsy, were assigned an event matrix value of 0. Those with a SOM value were assigned a value of 1. Those with an LOH value were assigned a value of 2, so that LOH events are weighted twice as much as non-LOH events. SOM and LOH are treated as the same event, even though the LOH may have occurred as two separate events (a SOM event and then another mutation causing LOH). Treating them as separate events was problematic and may not be the actual situation anyway.

CNV-based trees
Several different types of CNV data could be used as events for creating an event matrix: 1. Change in ploidy from the diploid ploidy state 2. Chromosome and chromosome arm copy-number changes 3. Segment edge copy-number changes 4. Gene copy-number changes and loss-of-heterozygosity plotTrees.R allows any subset or all of these data to be used to create the event matrix.
Ploidy events are determined by data in the PureCN QC file, which contains the purity and ploidy estimates of PureCN (curated). Events are entered into the event matrix as one or two rows. Ploidy gain events are treated separately from ploidy loss events. The actual ploidy value was not used beyond comparing it to 2 to determine loss or gain, because the precise value may be unreliable depending on how well PureCN is able to determine the correct purity/ploidy solution. If any tumor biopsy had a ploidy gain, a row was added to the event matrix with all biopsies having 0 except those having the ploidy gain being 1. Likewise, a row was added if any tumor biopsy had a ploidy loss.
Chromosome and chromosome arm copy-number change events are determined by data in the chromosome CNV file containing the output of the GetChrCNV.R program described earlier, specifically by the values isGain, isLoss, isAltered, and isLOH for each chromosome arm and the chromosome as a whole. When one or more of these flags is set for a chromosome in any tumor biopsy of the individual, events are created in the event matrix. Each chromosome can have one or more of these events: 1. whole chromosome gain 2. whole chromosome loss 3. whole chromosome LOH 4. chromosome p arm gain 5. chromosome p arm loss 6. chromosome p arm altered 7. chromosome p arm LOH 8. chromosome q arm gain 9. chromosome q arm loss 10. chromosome q arm altered 11. chromosome q arm LOH Each of these is treated as a separate event. For example, if one biopsy has "whole chromosome isGain" and another has "whole chromosome isLoss", these are separate events, even though the chromosome is the same. A single event can occur in any subset of biopsies or all of the biopsies. For example, biopsies 1 and 2 might have a "whole chromosome #21 gain" while 3 and 4 might have "whole chromosome #21 loss".
The chromosome arm gain events are suppressed if the whole chromosome gain event is present, and likewise for losses. The whole chromosome altered event doesn't exist because it is likely in that case that one arm will have a gain and the other arm will have a loss, or one or both arms will be altered, so the event is picked up in the arms.
The meaning of "isAltered" is that a significant part of the chromosome or arm has either a gain or a loss (gain and loss distance is summed). That is consistent with TCGA definition of chromosome alteration used for assigning chromosome instability to a sample.
The event matrix contains 0 when a biopsy did not have the event, and 1 when it did. Segment edge copy-number change events are determined by data in the CNV edges file containing the output of the TestCNVedgesOfSamplesOfPerson.R program described earlier. Segment edge data is used for these events, rather than creating events for each segment that is identified as having a copy-number change, because segments do not necessarily correspond to single ancestral CNV events. If a second copy-number change occurs in one biopsy in the middle of a segment in which an ancestral copy-number change occurred that was inherited by all the biopsies, the biopsy with two events will have three segments rather than one. It is unlikely that the actual ancestral order of events could be determined from the segment data. Therefore, we took the approach of recognizing that each ancestral segment edge should be present in all biopsies even if each segment were not. By testing the copy-number to the left and right of each segment edge (regardless of which biopsy it occurred in), the edge can be validated as being present, with a p-value attached to the test. The TestCNVedgesOfSamplesOfPerson.R program performs this edge testing. While it is possible that two CNV events with opposite effects on copy-number will occur at the same location within a biopsy, this is regarded as unlikely and not a significant influence on the phylogenetic results. Also, since every CNV event generates two edges, each event could be entered into the event matrix twice (if both edges are tested as significant). The effect of this could be countered with event matrix weights, discussed below.
Edge events are created from tested segment edges with filter=PASS, SVTCON != INCONSISTENT, and are not near the edge of a chromosome arm (which is defined as being located farther than 500,000 bp away from a chromosome end or centromere). The event occurs only in biopsies with SIG=YES, and the edge is ignored if the DIR value differs among the SIG=YES biopsies. The edge is also ignored if the relationship of CR_L to CR_R is inconsistent with DIR in any SIG=YES biopsy. Two types of events are created, copy-number decrease events and copy-number increase events, and edges are ignored if both types of events occur at the same position, because that is unlikely to have occurred independently.
The event matrix contains 0 when a biopsy did not have the event, and 1 when it did.
Gene-CNV-segment copy-number change and LOH events are determined by data in the gene CNV file containing the output of the CreateCNVgenePvals.R program described earlier. A single copy-number change or loss-of-heterozygosity can alter the copy-number of many genes, even hundreds of genes, depending on its length. Therefore, defining an event for each gene that undergoes such a change would completely distort the similarity between different biopsies. To try to avoid this problem, gene-CNV-segment events were implemented in a manner that assigns a single event to multiple genes.
A gene-CNV-segment copy-number gain/loss event is defined as a gene whose "CNVtype", generated by CreateCNVgenePvals.R as a result of t-tests of gene marker copy ratios, is AMP or DEL, and the purity/ploidy-adjusted copynumber changes from the ploidy value in a direction consistent with "type", and the PureCN "C.flagged" flag for the gene is not set (which would indicate unreliable copy-number) in any biopsy of the person. Genes not in the target panel are excluded. Since copy-number can go up in one biopsy due to a segment amplification, and down in a different biopsy due to a segment deletion, amplifications and deletions are treated as two separate events, so that one or more biopsies could have an amplification event in a gene, and one or more OTHER biopsies could have a deletion event in the same gene.
A gene loss-of-heterozygosity event is defined as a gene whose "loh" flag has been set by PureCN, and again the PureCN "C.flagged" flag for the gene is not set in any biopsy of the person. Genes not in the target panel are excluded. Loss-ofheterozygosity can occur with or without copy-number gain or loss, so loss-of-heterozygosity is treated as a separate event from copy-number gain/loss, even though in some cases both might be caused by the same underlying CNV event.
The same segment CNV can generate multiple gene copy-number gain/loss events and/or multiple gene loss-ofheterozygosity events (when the segment extends over more than one gene). In order to not count such a segment event as multiple events, for each gene on the segment, adjacent gene events are combined into a single event whenever they occur on the same segments within each biopsy. For example, if genes A and B have an amplification and both of them are located in CNV segment S1 in biopsy 1, segment S2 in biopsy 2, segment S3 in biopsy 3, etc. then they are combined into the same copynumber gain event assigned to both of those genes. If the segment number changes in any biopsy, the gene events are not combined. The same is done for deletions and for loss-of-heterozygosity. This is expected to eliminate the large majority of multiple gene events from the same segment. The actual copy-numbers of the genes and segments do not take part in this process.
It is possible that a gene event is also called as a chromosome arm event, but we do not attempt to distinguish that case, and treat each type of event independently, so in those cases the event will be counted twice.
The actual value of the copy-number of a gene CNV event is ignored, so if two biopsies have an increase in copy-number in a gene, this is considered the same event even if the copy-number is different in the two biopsies. The reason for this is, first, PureCN copy-numbers are not accurate enough that we can be certain that two different copy-numbers reflect actual differences, and second, most likely if the copy-numbers really are different, there were at least two copy-number events, one of them shared by the two biopsies, but we have no way to be sure of that and splitting the event into two events might or might not reflect the actual events. So, to keep it simple, just one event is created. However, the actual delta copy-numbers of the biopsies can be annotated on the tree if desired.
The event matrix contains 0 when a biopsy did not have the event, and 1 when it did.

Event matrix weighting
The event matrix as described above contains only 0's (no event) and 1's (event) except for SNVs/indels with both WT and LOH genotypes, which have 2 for a value. However, there is no reason to think that one type of event should have the same weight in determining similarity, as another type of event. This is a problem that was addressed by creating a weight vector that assigns a weight to each type of event (SNV/indel event or any of the four types of CNV events). The event matrix entries are multiplied by the appropriate weights for each row of the matrix before it is used to generate the distance matrix. The weight of SNVs/indels is always set to 1.0 as a base reference. The relative weights of the CNV events were chosen by simple seat-of-the-pants estimation. The balance of weights between CNV and SNV/indel events was chosen by generating trees using different weight balances and comparing the SNV/indel-only, CNV-only, and SNV/indel/CNV (combined) trees. The final balance was chosen as the one whose combined trees appeared to have a topology midway between the SNV/indel and CNV trees: 1. SNVevents=1.0 2. Ploidy=0.5 3. ChrCNVs=1.0 4. CNVedges=0.75 5. CNVgenes=0. 5 No algorithm has been found that might let us choose an event weight vector in an unbiased manner, and therefore the combined trees are perhaps to be trusted less than the other two tree types.
When plotTrees.R options are set so as to annotate a particular type of mutation event on the tree, the list of such events is analyzed, one event at a time, to determine on which branch or branches the annotation should be placed, starting at the root node corresponding to the normal biopsy, and moving down the tree towards the leaves. On a particular branch, if all biopsy leaf nodes downstream of that branch have the mutation in question, then that event is annotated on that branch. Otherwise, the same analysis continues on one or both downstream branches, depending on whether any leaf nodes of one or both branches have the mutation. Therefore, it is possible for the same event to be annotated on multiple branches. For example, if a tree has three biopsies, T1, T2, and T3, and T1 branches off first, then if T1 and T2 but not T3 had a certain mutation event, then the event would be annotated on the branches leading to T1 and T2 but not on the branch leading to T3. If on the other hand the event were clonal and occurred in all three biopsies, it would be annotated on the branch leading out of the normal biopsy, because that is the branch that is upstream from all three biopsies. When a branch is annotated with an event, the inference is that the event occurred somewhere along that branch in the ancestry of the leaf node biopsies of that branch. The order of occurrence of events annotated on a branch is not inferred and remains unknown.
Each tree shows one patient's biopsies, and each tree is made from a distance matrix derived from an event matrix of all SNV and indel mutations identified using our cancer gene panel (CNV changes are not used for the event matrix). Leaves are marked with the biopsy ID, with "N" for the normal biopsy and "T#" for tumor biopsies, and trees are rooted at the normal biopsy. Tree edge length is proportional to the total weight of mutations occurring on the edge, and the preponderance of private over clonal and subclonal mutations leads to most terminal edges being longer than internal ones. Tree edge annotations show genes containing a non-silent SNV, indel, or CNV mutation occurring on the edge and that were identified as having an approved drug (OncoKB) that targets gene mutations (not necessarily the ones occurring here; these mutations include all non-silent mutations, not just those identified as pathogenic in the main manuscript). Internal edges correspond to subclonal mutations, present in multiple biopsies.