Large-scale transcriptome analyses reveal new genetic marker candidates of head, neck, and thyroid cancer

A detailed genome mapping analysis of 213,636 expressed sequence tags (EST) derived from nontumor and tumor tissues of the oral cavity, larynx, pharynx, and thyroid was done. Transcripts matching known human genes were identified; potential new splice variants were flagged and subjected to manual curation, pointing to 788 putatively new alternative splicing isoforms, the majority (75%) being insertion events. A subset of 34 new splicing isoforms (5% of 788 events) was selected and 23 (68%) were confirmed by reverse transcription-PCR and DNA sequencing. Putative new genes were revealed, including six transcripts mapped to well-studied chromosomes such as 22, as well as transcripts that mapped to 253 intergenic regions. In addition, 2,251 noncoding intronic RNAs, eventually involved in transcriptional regulation, were found. A set of 250 candidate markers for loss of heterozygosis or gene amplification was selected by identifying transcripts that mapped to genomic regions previously known to be frequently amplified or deleted in head, neck, and thyroid tumors. Three of these markers were evaluated by quantitative reverse transcription-PCR in an independent set of individual samples. Along with detailed clinical data about tumor origin, the information reported here is now publicly available on a dedicated Web site as a resource for further biological investigation. This first in silico reconstruction of the head, neck, and thyroid transcriptomes points to a wealth of new candidate markers that can be used for future studies on the molecular basis of these tumors. Similar analysis is warranted for a number of other tumors for which large EST data sets are available.


Introduction
Head and neck squamous cell carcinomas (HNSCC) are common and aggressive neoplasms, with an estimated annual incidence of 615,900 new cases worldwide (1) and an increasing incidence rate in many areas of the world (2)(3)(4).Thyroid cancer is the most common neoplasia from the endocrine system with an estimated annual incidence of 122,800 worldwide.Most of the incidence variations are probably due to ethnic or environmental factors such as radiation, dietary habits, or other external factors.
Cancer phenotype is defined by the accumulation of mutations and epigenetic changes that may alter protein function and/or cause alterations in transcriptional patterns.These events usually confer competitive advantages to a cell, ultimately leading to the malignant phenotype.Transcriptome wide analysis is an important tool to reveal some of these molecular mechanisms that underlie human malignancies (5,6).
To generate a database of transcriptional profiles derived from tumor and nontumor tissues, a high-throughput cDNA-sequencing project, based on normalized mini-cDNA libraries generated with the open reading frame expressed sequence tags (ORESTES) methodology, was conducted by our group, yielding a set of f1.2 million expressed sequence tags (EST) from diverse human tissues, from which a filtered set was deposited in public databases (6,7).ORESTES sequences are predominantly derived from the central coding region of genes, thus favoring the identification of gene function based on protein similarity.In addition, this methodology partially normalizes the population of tagged genes, enabling the study of rare transcripts (8).The present study reports on a detailed informatics analysis of 134,495 ORESTES from head and neck as well as 79,141 ORESTES from thyroid tissues followed by experimental validation of new transcripts and new splicing isoforms and by an evaluation of putative markers for these malignancies.

Materials and Methods
Patients and Biological Samples.Subjects were selected at random from surgical patients treated at the Head and Neck Surgery and Otorhinolaryngology Department of the Hospital do Ca ˆncer A.C. Camargo in Sa ˜o Paulo, Brazil.All patients were treated and followed up according to prevailing protocols.Eligible patients were asked to sign an informed Note: E.M. Reis  consent previously approved by the hospital's institutional review board.
The criteria for eligibility were determined to insure a good representation of the transcriptional diversity of the tissues studied.On this basis, selected patients had to be at least 18 years old and have a histologically confirmed diagnosis of a non-previously treated (a) oral, pharyngeal, hypopharyngeal, or laryngeal squamous cell carcinoma (31 samples) or (b) thyroid carcinoma, adenoma, or goiter (45 samples).Detailed composition of samples, in terms of tissue origin and pathologic state, is given in Supplementary Methods.Tumor and adjacent nontumor tissues were collected, snap-frozen, and stored; tumor margins that were considered disease free after histologic examination were used here as nontumor samples.Before RNA extraction, all pathologic diagnoses were reconfirmed and samples were macrodissected, reducing nontumor tissues to <20%.RNA Extraction and ORESTES.PolyA + RNA was isolated, treated with DNase and purified as described before, (8) and used for constructing 1,576 ORESTES cDNA minilibraries (8) that were cloned and sequenced (7).The detailed composition of each of these cDNA minilibraries, in terms of tissue origin and pathologic state, is given in Supplementary Methods and in Supplementary Table 1.A total of 1,003 cDNA minilibraries were sequenced, generating 18,620 sequences from nontumor tissues, 190,253 from tumor tissues, and 4,763 from thyroid goiter samples; 134,495 sequences were derived from head and neck and 79,141 from thyroid tissues.mRNA derived from a second set of tumors and nontumor samples as well as from four human cell lines (FaDu, Hep2, HeLa, and Siha) was extracted as described above and used for the experimental validation of new splicing isoforms, new genes, and potential tumor markers predicted by bioinformatics analyses.
cDNA Sequence Analyses.A total of 213,636 ESTs generated from head and neck and thyroid tissues were analyzed along with local copies of the human genome sequence draft (University of California at Santa Cruz, June 2002) and of the RefSeq and mRNA public data sets (www.ncbi.nlm.nih.gov).The complete set of 1,187,342 ESTs produced in the Human Cancer Genome Project (HCGP; ref. 7) from more than 20 tissues was filtered to reduce the presence of low-quality sequences or contaminants, and assembled in 956,456 reads, which included 173,640 sequences derived from head, neck, and thyroid tissues.The data set was clustered using BLAST (http:// www.ncbi.nlm.nih.gov/) and assembled with Phrap (http://www.phrap.org).To further reduce the residual redundancy and to eliminate misassembled contigs or spurious sequences that passed the filtering step, all contigs and singlet sequences were aligned to the draft human genome sequence using BLAT (9).All mapped sequences that exhibited genomic overlap were merged into a unique genomic cluster.Only sequences that aligned with >90% identity to the draft sequence through at least 50% of their length were considered for further analyses.A subset of 119,552 ESTs that were used here and were not analyzed in previous publications (6)(7)(8) were now deposited in dbEST GenBank under accession numbers CV309219 to CV428770.
Identification of Tissue-Specific Transcriptional Markers.Genes that were preferentially expressed in certain head and neck sites (oral cavity, pharynx, and larynx) or thyroid were identified using a Bayesian model, as described in detail in Supplementary Methods.Using this approach, EST count thresholds were established to define 95% confidence intervals for the absence or presence of expression of a certain gene in each anatomic site.
Analysis of New Splicing Isoforms and New Human Transcripts.Four trained investigators, using the same criteria, visually inspected the genomic mapping of all ESTs with putative new splicing isoforms.For all new splicings approved by visual inspection, additional data were collected regarding splicing classification (10), size of the event in terms of nucleotides involved, donor/acceptor sites, and confirmation of the event by other cDNA sequences.Reverse transcription-PCR (RT-PCR) and cDNA sequencing were done to confirm a subset of these new gene isoforms.To reduce individual variations, isoform validations were done using human cell lines as well as pooled samples from different patients and topological locations.
Candidates for new human transcripts were identified by selecting clusters or single ESTs with no similarity to known human genes, as determined by BLASTN queries against a complete set of 162,104 human mRNA full-length sequences deposited in GenBank (cutoff E value = 10 À15 ).In addition, clusters and singlets were identified that mapped to a 2.37-Mb region of chromosome 22 (22q11.2),flanking marker D22S421, previously shown to be deleted in larynx tumors (11).These were selected for manual annotation, which included sequence alignment with in silico gene predictions (GeneScan) and comparison with mouse full-length transcripts and ESTs, using the UCSC Genome Browser (http:// genome.ucsc.edu/).A set of putative new transcripts was validated by RT-PCR, using pooled mRNAs from nontumor or tumor larynx and further evaluated in 45 different mRNA samples derived from larynx (14 nontumor and 13 tumor), tongue (9 nontumor and 3 tumor), and tonsil (4 nontumor and 2 tumor).
Evaluation of Differential Gene Expression.Experimental evaluation of differential expression was done for a subset of ORESTES contigs mapped to genomic regions that, according to the literature data, exhibit recurrent loss of heterozygosity or amplifications in the selected tumors.Three genes were selected for experimental validation of their differential expression by quantitative real-time PCR analysis in a GenAmp 5700 (Applied Biosystems, Foster City, CA).cDNA was produced from individual tumor and paired normal adjacent tissue samples from the larynx and oral cavity.To reduce individual sample variations possibly arising from contamination of normal tissue with adjacent tumor, the average expression level of each gene in all normal samples was used as a reference for each tissue examined.For each patient, fold change of expression was calculated as the ratio between the individual tumor sample and the average level of normal tissue samples.Each cDNA sample was analyzed in duplicate reactions using the SybrGreen PCR Core Reagent (Applied Biosystems), and normalization was done according to the manufacturer's instructions using ACTB or GAPDH as the reference gene.
Web Site.A dedicated Web site was prepared for data analysis, enabling visual inspection of the assemblies and of their genome map coordinates as well as searches using keywords, Gene Ontology classification, or chromosomal location of transcripts detected in head, neck, or thyroid tumor types under study.The site is publicly available at http:// verjo19.iq.usp.br/java/jsp/head_neck/.

Results and Discussion
Assembly and Mapping the Head, Neck, and Thyroid EST Dataset.A set of 213,636 EST sequences derived from head, neck and thyroid tumors was analyzed here, being part of the over 1.2 million ESTs that were generated by the HCGP (7).After filtering for contaminants, the remaining 173,640 ESTs (81%) were assembled together with all HCGP sequences as described under Materials and Methods.This assembly resulted in 32,576 contigs and 24,434 singlets that were aligned to the draft sequence of the human genome.All mapped sequences that exhibited genomic overlap were included in the same genomic cluster, thus reducing the data set to a nonredundant collection of 48,001 sequences (contigs plus singlets) grouped into 20,348 genomic clusters (Table 1).This corresponds to 38% of the total genomic groups formed by all ORESTES (20,348/53,105), and includes a number of clusters containing reads exclusively derived from head and neck (2,712) or thyroid (1,331) tissues.
Patterns of Gene Expression and Tissue Markers.A total of 10,228 distinct RefSeq genes were sampled by at least one EST derived from head and neck or thyroid, as determined by merging all nonoverlapping EST clusters that mapped within the same RefSeq gene sequence.Different amounts of ESTs were generated from each head and neck or thyroid tissue using different sets of primers, which preclude a direct assessment of genes differentially expressed at a distinct topology.Aiming to normalize these experimental factors, we applied a Bayesian statistical model (12) to select only EST counts that were detected within a confidence interval of 95% (see Materials and Methods).When this statistical model was applied, ESTs representing exons of 238 RefSeq genes were selected.The analysis suggested larynx/oral cavity as the most similar pair of tissues in terms of gene expression, followed by oral cavity/ pharynx and larynx/pharynx (Fig. 1A).The same result was obtained when EST clusters that mapped to genomic regions with no RefSeq genes were included in the analysis (not shown).In addition, the trio larynx/oral cavity/pharynx contains more genes in common than trios that include thyroid (Fig. 1B).The 238 RefSeq genes that were identified as being differentially expressed in different topologies may be used as putative markers for identifying the tissue origin of circulating tumor cells, as well as for distinguishing primary tumor from metastatic lesions, a subject that deserves further evaluation.
This analysis revealed genes preferentially expressed in each head and neck site or in thyroid with statistical criteria, being a source of tissue-specific candidate markers for HNSCC or thyroid tumors that may be selected for further validation (list of genes is available as Supplementary Table 2).
New Genes and Noncoding RNAs.Out of the total of 20,348 genomic clusters containing sequences from head and neck or thyroid, a significant fraction (25%, i.e., 5,038 clusters) represents possible new human transcripts because they have no significant similarities to previously known human genes (see Materials and Methods).Among these, a small fraction has evidence of splicing (16%, i.e., 835) and only 507 sequences (10%) have a normalized ESTscan score (13) >1, which is indicative of high protein coding potential.Nevertheless, none of them contain known protein motifs when compared with the PFAM data set.Among these new coding transcript fragments, 254 were found to map to intronic regions of RefSeq genes and may represent new coding exons (see Supplementary Table 3).The remaining 253 map to intergenic regions and may represent fragments of new human genes (see Supplementary Table 4).
The remaining 4,531 clusters that represent new transcripts have a normalized ESTscan score lower than 1; that is, they have a low coding potential.Remarkably, a large fraction of these (50%, i.e., 2,251 clusters) map to intronic regions of RefSeq genes (Supplementary Table 5).Whereas some of these may reflect a certain degree of genomic DNA contamination that might have persisted even after DNase treatment and polyA + mRNA selection, it is likely that most of these ESTs represent bona fide mRNA transcripts.In fact, recent data showed that similar fractions of intronic and exonic transcripts are expressed in 47 samples of tumor and normal prostate, as detected by cDNA microarrays constructed with ORESTES clones (14).
Recent work describing the transcriptional output of the human genome points to the existence of a significant number of noncoding RNA transcripts (15,16) and that 10% to 20% of human transcripts might form sense-antisense pairs (17,18).Furthermore, comparable fractions of transcriptional activity were detected within exons or introns of annotated genes, and nearly half of these intronic transcripts were expressed antisense to their respective well-characterized introns (16,19).Although the role of these antisense intronic noncoding messages in RNA regulation is not yet fully understood (20), it has been recently shown that intronic noncoding transcripts expressed antisense to their respective introns have their expression level significantly correlated to the degree of tumor differentiation in prostate cancer (14).Identification in the present work of a large set of intronic noncoding transcripts expressed in head-neck and thyroid suggests that noncoding intronic transcripts may also play an important role in HNSCC and thyroid tumors.Significantly, the majority of the new transcripts containing sequences from head and neck or thyroid (2,832 of 5,038, i.e., 56%) map to genomic regions previously associated to HNSCC or thyroid tumors (21).We found 12 putative new human transcripts mapping to a 2.37-Mb sequence from human chromosome 22q11-12, a region selected for its involvement with larynx tumors (11).These include three EST clusters containing head, neck, and thyroid sequences.All three had their expression confirmed in larynx by RT-PCR and were further verified by semiquantitative RT-PCR in 45 different mRNA samples obtained from pairs of nontumor/tumor tissues (see Materials and Methods).Positive amplification was obtained in 42 of 45 (LOC91353), 29 of 45 (LOC164615), and 21 of 45 (LOC91355) of the evaluated samples.Although no evidence for differential expression between nontumor and tumor samples was observed for this small set of genes, these results confirm the potential of this approach to identify new human genes.
New Splicing Isoforms.Genomic alignments of all EST clusters matching known human genes were visually inspected to evaluate the presence of putative new splicing isoforms.New events involving at least 10 nucleotides of deletions or insertions in previously known genes were considered as putative new splicing isoforms and were classified according to Wang et al. (10).Presence of conserved donor/acceptor sites as well as eventual confirmation of splicing events by other human ESTs or nonhuman transcripts in the public databases were recorded.A total of 788 new splicing events were identified in 748 different transcripts (see Supplementary Table 6).
Among the new splicing events observed here, 86% and 89% showed the canonical GT/AG donor and acceptor sites, respectively.The majority (75.3%) of new splicings was due to insertion events (Table 2).Overall, the most frequent events were exon extensions at 5V (type II) and insertions at 3V (type III).New splicing events could be confirmed by other human ESTs from GenBank in 28.6% of the cases and by sequences from other organisms in 6%.
Many cases of splicing events affecting protein domains were observed, such as in SMARCB1, PFDN5, AF5Q31, and EWSR1.SMARCB1 is involved in chromatin remodeling and is mutated in a variety of human tumors of diverse tissue origins, including some very aggressive pediatric cancers (22,23).The novel exon of SMARCB1 identified here promotes an in-frame addition of 18 amino acids to the protein, with an unpredicted effect over its function.In PFDN5, the newly identified exon adds 101 amino acid residues (with putative phosphorylation sites and an amidation site) to the protein.Interestingly, isoforms of PFDN5 transcripts were detected in Northern blots by Mori et al. (24) who also showed that this heterohexameric chaperone protein binds to c-Myc, repressing its transcriptional activity.The 101-amino-acid insertion may affect PFDN5 binding capacity and interfere with its ability to repress the transcriptional activity of c-Myc.
Because the ORESTES database was shown to be enriched with rare messages (7), splicing variants described here may represent rare new splicing isoforms related to these tumors.Whereas some false-negative or positive splicing events may have occurred among the 788 new splicing events described, our validation rate (68%) suggests that true events are the majority.Indeed, the set of human ESTs available in public databases has recently been shown to provide potential markers for the classification of cancer (25).The new splicing isoforms reported here contribute to this notion and warrant further research aimed at the identification of tumorspecific transcriptional events.
Finding Tumor Markers Using Transcript Analysis.The complete HCGP data set is composed of a 9.8-fold excess of tumor-derived sequences.Despite this, a set of 8,988 genomic clusters exclusively from nontumor tissues sampled in the project was found.This set probably is enriched in genes down-regulated during tumor development and progression.Within this set, 270 genomic clusters contained sequences derived from normal head, neck, or thyroid tissues (the complete list can be searched at the project Web site).Twenty-three of these contain sequences from normal head and neck tissues and are of particular interest because they mapped to regions known to be frequently deleted in HNSCC (refs.21, 26-28; Supplementary Table 7).Although we found several clusters composed exclusively from sequences derived from normal thyroid, only one of these (R2_CLUSTER_36019) mapped to a region (17p11) known to be deleted in thyroid tumors (21,29,30).Several of the genes included in this subset have antiproliferative, apoptotic, or differentiation-induction activities: CASP10 is involved in the proximal pathway of Fas-mediated apoptosis and is down-regulated in lung cancer, neuroblastoma, and non-Hodgkin lymphomas (31)(32)(33).Two other genes are related to apoptosis: CSEN, a member of the neuronal calcium sensor family (34), and MAPK8, which is important for the induction of apoptosis following stress (35).RGS6 and HHEX are involved in neuronal or thyroid differentiation, respectively (36,37), and APRIN is a putative regulator of cell proliferation (38).Altogether, these observations strongly suggest that genes with a still unknown tumor suppressor activity in HNSCC or thyroid cancer will be found within this selected data set of loss of heterozygosity candidates.
Conversely, clusters exclusively derived from HNSCC or thyroid tumor tissues suggest oncogenes from genomic regions amplified in tumors.A set of 3,776 such clusters were identified (complete lists can be searched at the project Web site), including 144 that mapped to regions previously described as frequently amplified in HNSCC (see Supplementary Table 8) and therefore are the best candidates for further validation.In fact, 12 of these clusters are transcripts from known oncogenes, whereas another subset of about 30% of these clusters are from genes of unknown functions.Genes from this list may include potential antigens that could be used in immunodiagnosis or immunotherapeutic approaches, as well as tumor markers.However, due to the biased amount of sequences from tumor samples in our project, the list of tumor-only EST clusters identified here should be analyzed with caution.
Three transcripts were further investigated for differential expression in HNSCC by quantitative real-time PCR in paired patient samples (eight oral cavity tumors and nine larynx tumors).Selected genes mapped to genomic regions that were frequently amplified (ZRF1 and NDRG1) or frequently lost (RAP140) in HNSCC.ZRF1 mRNA levels were increased in most samples of   3A and B).This is in agreement with a previous study that showed an increase in chromosome 7 and ZRF1 copy numbers in HNSCC (39).RAP140 exhibited a decreased expression in half of the patient samples of larynx tumors (1.6-to 8.1-fold change, 4 of 8 samples, i.e., 50%; Fig. 3C).No information is available in the literature on RAP140 expression or function in tumors.NDRG1 mRNA levels were increased in half of the oral cavity tumor samples (1.7-to 3.5-fold change, 5 of 10 samples, i.e., 50%; Fig. 3D).NDRG1 has been referred to be either under-or overexpressed in a variety of cancers, including lung, brain, melanoma, liver, prostate, breast, and renal cancers and also in colon adenomas and adenocarcinomas (40)(41)(42).The variability observed in the expression of cancer-related genes in individual patient samples probably reflects the biological heterogeneity of HNSCC (43).Nevertheless, the 50% to 75% validation rate achieved in the present work indicates that an informative list of possible candidate markers was generated in this transcriptome analysis, for which further experimental validation is warranted.
In conclusion, two large projects have engaged in extensive EST sequencing of human tumor tissues: the Cancer Genome Anatomy Project led by the National Cancer Institute (44) and the HCGP led by Fundac ¸a ˜o de Amparo a `Pesquisa do Estado de Sa ˜o Paulo/Ludwig Institute for Cancer Research (7), providing a very large diseaseoriented transcriptional database that contains an unprecedented amount of information on expressed human genes.We show here a detailed analysis of a subset of these data, the head, neck, and thyroid sequences, documenting its utility in the identification of new human genes, candidate tumor suppressors/oncogenes, and highlighting the structure of transcript variants.Although the EST data have been widely used by the scientific community, we show that a wealth of additional information remains unexplored and may be used to further the understanding of cancer biology.

Figure 1 .
Figure 1.Transcriptome sharing between HNSCC and thyroid.238 RefSeq genes detected with 95% confidence in the EST data set were scored (see Materials and Methods).Fractions of genes shared by each pair (A ) or trio (B) of topologies are shown.la, larynx; oc, oral cavity; pha, pharynx; thy, thyroid.

Figure 2 .
Figure 2. Experimental validation of new splicing isoforms using RT-PCR.A, type I deletion.B, type I insertion.C, type II insertion.D, type III insertion.Names of the genes are at the top of each panel.Arrows, primers; solid boxes, known exons with their corresponding exon numbers; open boxes, new exons.Horizontal lines, introns.Diagonal lines, alternative splicing events.For all gels, lane L corresponds to the molecular weight marker ladder (100 bp).The sources of template RNA samples used in the PCR reactions are indicated above the gel lanes.They are either replicates of cell line samples from Hep-2 (CCL-23, American Type Culture Collection, Manassas, VA), or normal (N) and tumor (T ) tissue from Larynx , Tonsil , or Tongue , as indicated in each panel.All PCR products were confirmed by sequencing.Predicted band sizes (in base pairs) of PCR products amplified from Known and New mRNA splicing isoforms are indicated next to each diagram, and those predicted from genomic DNA are 2,618 bp for VAV2 , 15,007 bp for PTK2 , 1,560 bp for TRRAP , and 5,197 bp for AF5Q31 .

Figure 3 .
Figure 3. Expression of putative tumor markers of HNSCC.Real-time quantitative PCR was used for the measurement of mRNA levels of candidate genes in paired samples of tumor and adjacent normal tissue from larynx (9 patients) or oral cavity (10 patients).Expression of ZRF1 (A and B ), RAP140 (C ), and NDRG1 (D ) in each individual tumor sample reflects the fold-change increase (black columns ) or decrease (gray columns ) relative to the average expression level of each gene in the normal samples.Bars, SE from two replicate experiments.
and E.P.B. Ojopi contributed equally to this work.Supplementary data for this article are available at Cancer Research Online (http:// cancerres.aacrjournals.org/).

Table 1 .
Genomic mapping of all ORESTES sequences and of head and neck or thyroid assembled sequences

Table 2 .
(10)rnative splicing statistics New splicing isoforms were classified according to Wang et al.(10).Insertions were divided into types I to IV: insertion of a new exon, 5V extension, 3V extension, and junction of two exons, respectively.Deletions were divided into types I to IV: exon skipping, 5V deletion, 3V deletion, and deletion of a central portion of the exon, respectively.