Extended annotation for the integrated database of human transcriptome, H-InvDB
1,2Chisato Yamasaki, 1,2Katsuhiko Murakami, 1,2Yasuyuki Fujii, 1,2Yoshiharu Sato, 1,2Junichi Takeda, 1,2Erimi Harada, 1,2Ryuichi Sakate, 1,2Takayuki Taniya, 1,2Shingo Kikugawa, 2Teruyoshi Hishiki, 2Tadashi Imanishi, 2,3Takashi Gojobori
1Japan Biological Information Research Center (JBIRC), Japan Biological Informatics Consortium (JBiC), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan, 2National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan, 3Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics (NIG), 1111 Yata, Mishima, Shizuoka, 411-8540, Japan
Here we present our latest annotation for the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), an integrated database of the human transcriptome based on extensive annotation of human full-length cDNA (FLcDNA) clones. The latest release of H-InvDB (release 4.0, in preparation) includes annotation of 175,542 human mRNAs and FLcDNAs extracted from the public DNA databank. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1 assembly) and determined 34,701 human gene clusters, which could define 34,093 (98.3%) protein-coding and 608 (1.8%) non-protein-coding loci, while 860 (2.5%) protein-coding loci overlapped with predicted pseudogenes. We conducted all sorts of analysis for descriptions of their gene structures, alternative splicing isoforms, functional non-protein-coding RNAs, functional domains of proteins, subcellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, evolutionary features and protein-protein interactions. Furthermore, 269 and 2,731 transcripts were annotated as candidates of readthrough and targets of nonsense-mediated decay (NMD) by the extended sequence quality annotation. The annotated data in H-InvDB are shown in two main viewers, the Transcript view and the Locus view, and seven sub-databases with web-based viewers; DiseaseInfo Viewer, H-ANGEL, Clustering Viewer, G-integra, TOPO Viewer, Evola and PPI view, and provided as flat, XML and FASTA sequence files. The latest release also provides new annotation datasets for four selected gene families; T-cell receptors (TCR), human leukocyte antigen (HLA), immunoglobulin (Ig) and olfactory receptors (OR). We believe that H-InvDB will be useful for every aspect of research related with human genes and transcripts.