Abstract
Gaining insights from medical and molecular omics data with advanced statistical learning methods requires the collaboration of experts with foundational knowledge of the data contents and context as well as computational and statistical background. The purpose of the workshop was to derive the key components that are necessary to apply machine learning methods to non-public real-world clinical, genomic and transcriptomic data from heterogeneous sources in a cross-institutional collaborative effort. Machine learning experts were invited to the premises of the institution that held a comprehensive collection of patient data. During the preparation phase, regular virtual meetings took place in order to discuss decisions regarding the research questions to analyze, transformations of the raw file formats and the technical infrastructure. The interdisciplinary group of experts decided to focus on models that would assign patients matching therapy classes based on their molecular variation profile. During the workshop phase, the experts spent several days at the same location working solely on the analysis in question. Here we present the lessons learned from conducting such an interdisciplinary workshop. The key components can be divided into a preparatory and a workshop phase. The first phase lays the foundation for the collaboration, including the formalities, discussion of the data and the applicable methods. It is especially important to involve all domains in decisions regarding the data preprocessing steps. As an example, a method might have particular requirements considering the data dimensions, which is information that the data experts need to know in order to aid with the meaningful reduction of the degrees of freedom, by e.g. aggregation or filtering. Sharing anonymized sample files that represent the data format in a realistic manner and sample scripts for testing the infrastructure helps all parties to set up the analysis pipelines efficiently. In the second phase, the intense working atmosphere allowed for quick iterations of feature engineering, model training and result evaluation. In the common case that a dataset is not yet publicly sharable, we demonstrate that it is a feasible option to invite analysts to the institution holding the data. Additionally, the focused atmosphere that is created by a limited-time workshop setting is advantageous to the motivation of the collaborators.
Citation Format: Sophia Stahl-Toyota, Katrin Glocker, Analie Pascoe Perez, Alexander Knurr, Alexander Denker, Frank Ückert. Bringing advanced analytics experts to the data: Report of an interdisciplinary machine learning workshop [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2087.