The aggressiveness of a tumor depends on its genomic profile. Accordingly, it should be expected that the overall survival of cancer patients also depends on this, in particular, on the number and nature of mutations and the degree of gene activity. In this work, we try to predict overall survival by the genomic profile of the tumor, both by primary DNA and by RNA activity. One of the objectives of the study is to compare which of the presented baseline data better predict overall survival. The data were taken from the pan-cancer TCGA database (33 types of cancer) on DNA and gene expression. They were split into 2 datasets: DNA data only and expression only. In the DNA data, we select only pathogenic and likely pathogenic variants. The total number of genes containing these mutations was 1806, they are accepted as features. In the expression data, we selected only those genes that belong to the cancer-related pathways in the KEGG database (1821 genes). As a prediction effect for both datasets, a 3-year OS was chosen. Accordingly, if a patient crossed the three-year line of OS, he was considered a positive example, otherwise - a negative one. The DNA dataset contained 2159 positive examples and 1687 negative examples. The expression dataset contained 3363 positive and 2212 negative ones. Machine learning algorithms have been implemented using python 3. To determine the significance of the features, we used the Lasso linear regression algorithm with 5-fold cross validation. The result was obtained in the form of list of genes ordered by decreasing importance on the effect. In the DNA dataset, the algorithm selected 64 significant genes, including a sign (plus or minus) indicating an influence on a positive or negative effect, and a coefficient indicating the relative strength of an influence. For example, age 81-90 and EGFR mutations were at the negative end of the scale, while stage I and HRAS mutations were at the positive end. In the RNA dataset, the algorithm selected 75 of such important genes. At the negative end of the scale there were age 81-90 and changes in CDK6 expression, at the positive end - stage I and changes in RPS6 expression. Only 11 of significant features were shared across the two datasets. To predict the effect, we used a logistic regression algorithm with 5-fold cross-validation. Receiver characteristic curves (ROC), reflecting the sensitivity and specificity of the classification, were evaluated by the area under the curve (AUC). For the DNA dataset, the mean ROC-AUC for the 5 predictions was 0.72 (0.64-0.77), for the RNA dataset 0.74 (0.69-0.77). Predicting overall survival is essential for planning treatment strategies and selecting patients for clinical trials. Sufficiently high indicators of the classification quality show that this approach makes sense for further development. Further tuning of the algorithms will make it possible to predict the effect more accurately. Combinations of different input data must be tested. The list of important genes can be helpful in detecting molecular targets in drug discovery.
Citation Format: Dmitrii K. Chebanov, Nadezhda S. Tatevosova, Irina N. Mikhaylova. Machine learning for predicting overall survival using whole exome DNA and gene expression data and analyzing the significance of features [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PO-045.