Introduction: Due to the low cost of high-throughput sequencing techniques, massive amounts of biological information are being accumulated. The Sequence Read Archive (SRA), a bioinformatics database that hosts raw sequencing data, is a valuable resource for studying normal human variation and disease. However, these data often lack phenotype information, which severely limits their utility for addressing targeted biological questions. By precisely annotating brain tissue phenotype and uniformly processing heterogeneous gene data, we have removed all barriers so that neuroscientists can freely study transcriptomic changes in neurological diseases of interest.
Methods: We downloaded study metadata from SRA studies predicted to have brain tissue (using the phenopredict R package) and obtained corresponding journal publications and supplementary materials. Sample metadata tables for all studies were manually created using a common set of biological variables that would be most useful to investigators. Novel tissue attributes found in publication text but not included in the original metadata were added, and a detailed reproducibility document was maintained for each study. Metadata were saved and assembled into an operable database that downloads expression data directly from recount2.
Results: Using the expression data from recount2 (a project costing $140,000 to uniformly process sequence data), our database hosts expression data for >2,000 human brain tissue samples from >100 published projects from the SRA. It includes 18 neurological diseases (with a heavy focus on gliomas classified by cell type, grade, and location), 8 unique brain regions, 5 developmental stages ranging from fetus to adult, as well as demographic data and technical sequencing information.
Conclusions: Too few neuroscientists are taking advantage of sequenced reads by other laboratories. We contend that taking no notice of deposited data is akin to ignoring several independently published replication experiments. Our refined database reuses public data, enhances reproducibility among neuroscientists, and enables translational discovery.
Patient Care: By studying the transcriptomic changes in neurological diseases, we can identify potential molecular markers for diagnostic techniques and targeted therapy that will allow meaningful intervention in patients’ lives.
Learning Objectives: Massive amounts of sequencing data are being accumulated and stored in repositories accessible to the public.
Unlabeled or unannotated variables limit the ability of researchers to analyze these data.
Deposited data that goes unused is wasted data.
Appropriately annotating sequenced data will enhance reproducibility among researchers.
Our database will be made publicly available to allow neuroscientists to study transcriptomic changes in neurological diseases.
Neuroscience researchers will be able to search and filter by age, sex, race, brain region, disease, clinical stage, and neuropathologic grade.
References: Collado-Torres L, Nellore A, Kammers K, Ellis S, Taub M, Hansen K, Jaffe A, Langmead B, Leek J. Reproducible RNA-seq analysis using recount2. Nat Biotechnology 35, 319-321.
Denk F. Don't let useful data go to waste. Nature. 543, 7.
Ellis S, Collado-Torres L, Leek T. Improving the Value of Public RNA-Seq Expression Data by Phenotype Prediction. bioRxiv 145656.