Protein IsoformsFunctional Importance and Tissue Specificity

  1. Rodríguez Carrasco, José Manuel
Supervised by:
  1. Michael Tress Director

Defence university: Universidad Autónoma de Madrid

Fecha de defensa: 14 December 2020

Committee:
  1. Fátima Sánchez Cabo Chair
  2. Enrique Carrillo de Santa Pau Secretary
  3. Gorka Prieto Agujeta Committee member
  4. Javier Herrero Sánchez Committee member
  5. Ana María Rojas Mendoza Committee member

Type: Thesis

Teseo: 644932 DIALNET

Abstract

The number of protein coding genes in the human reference gene sets has stabilized at slightly more than 20,000 genes in recent years, principally as a result of painstaking manual curation efforts. Although the three main gene sets, Ensembl/GENCODE, RefSeq, and UniProtKB, have similar numbers of genes, it is not clear how many of these genes coincide between the three sets. Many researchers were surprised by the relatively low numbers of human coding genes and some have sought other explanations for an assumed human complexity such as alternative splicing. The alternative splicing of messenger Ribonucleic acid (RNA) is a fundamental molecular process that regulates eukaryotic gene expression and can generate a wide range of mature RNA transcripts. Many thousands of alternatively spliced transcripts are routinely detected in RNA-seq studies, although reliable large-scale mass spectrometry-based proteomics analyses identify only a small fraction of annotated alternative isoforms. Indeed, proteomics experiments strongly suggest that most genes have a single main protein isoform. In this thesis, we present three papers on the functional description of coding genes, and of the principal and alternative protein isoforms derived from alternative splicing. In the first publication, we present the updates to the APPRIS Database. APPRIS selects a single protein isoform, the principal isoform, as the reference for each gene based on protein structural and functional features and information from cross-species conservation. Experimental evidence shows that the APPRIS principal isoform almost always coincides with the main cellular protein isoform. In the paper we detail the expansion of gene sets for multiple species, refinements in the core methods that make up the annotation pipeline and the merge of individual Ensembl/GENCODE, RefSeq, and UniProtKB reference gene sets. APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the reference sets. In the second paper, we analyse human protein-coding genes in the three main reference sets: Ensembl/GENCODE, RefSeq and UniProtKB. We find that one in eight of these genes are classified differently in at least one of the reference sets. Evidence from various sources suggests that many of the 22,210 genes in the union of the three sets are unlikely to code for functional proteins. In the final publication, we carried out a reanalysis of a large-scale proteomics study of human tissues in order to determine to what extent tissue-specific alternative splicing can be detected at the protein level. We found evidence of significant tissue-specific differences across more than a third of the splice events that we interrogated. Tissue specific alternative protein forms were particularly abundant in nervous and muscle tissues. By contrasting the proteomics evidence with data from a large-scale transcriptomics analysis, we found that more than 95% of tissue specific events in which proteomics and RNA-seq analyses agree on tissue-specificity evolved over 400 million years ago. Our results suggest that tissue specific alternative splicing has played a crucial role in the development of the brain and the heart in vertebrates.