1. What is exoRbase?
exoRBase is a repository of circular RNA (circRNA), long non-coding RNA (lncRNA) and messenger RNA (mRNA) derived from RNA-seq data analysis on human blood exosomes. exoRBase features integration and visualization of RNA expression profiles based on normalized RNA-seq data spanning normal individuals and patients with different diseases.
2. How were collected RNA-seq datasets were processed?
All collected datasets were processed through a consistent bioinformatics pipeline. The raw sequencing reads were first trimmed by removing adapters and low-quality bases using Trimmomatic. Then the trimmed reads were mapped to human reference genome (GRCh38) by HISAT2. featureCounts were employed to quantify the number of reads aligned to regions of protein-coding (mRNAs) or long non-coding RNAs (lncRNAs). Read counts for each gene were normalized to TPM values (Transcript Per Million) . Annotation of protein-coding and long non-coding RNAs were retrieved from GENCODE v25. Additionally, circular RNAs were identified and quantified by ACFS and find_circ. Circular RNAs that were detected by both tools were considered in our analysis. Taken consideration of various library sizes, the read counts of each circular RNAs were normalized by calculating the RPM values (Read counts per Million mapped reads). Following is the flowchart of data process and overall design of exoRBase.
3. How were tissue specificity scores for lncRNA and mRNA calculated?
To assess tissue specificity of protein-coding and long non-coding RNAs across human tissues, we downloaded expression atlas from GTEx (Genotype-Tissue Expression project), which profiled gene expression across 30 human tissues. Genes that were lowly expressed (<0.1 RPKM) in all tissues were removed.We defined the tissue specificity score as the difference between the logarithm of the total number of tissues and the Shannon entropy of the expression values for a gene. The score was calculated as follows:
Where pi stands for the relative frequency, which was computed as , Xi is the gene expression level for gene in tissue i, N is the total number of all tissues.
For each gene, one specificity score and 30 frequency scores (p_i) were calculated. A gene was defined as tissue-specific when its max frequency score was more than double the second largest one and its specificity score was no less than 1.
4. How were circular RNAs were assigned to their linear RNAs?
Circular RNAs are transcribed from intra- or inter-genic loci. A circular RNA was assigned to be related to a gene when this circular RNA was transcribed from regions within the gene or this gene was located nearest to the circular RNA.
5. What do the abbreviations of disease names stand for?
'NP' stands for normal person, 'CHD' stands for Coronary Heart Disease, 'CRC' stands for Colorectal Cancer, 'HCC' stands for Hepatocellular Carcinoma, 'PAAD' stands for Pancreatic Adenocarcinoma, 'BC' stands for Breast Cancer, 'WhB' stand for Whole Blood.
6. Were the expression values further processed in heatmaps?
Yes. The expression values were further processed before drawing heatmaps as following:
Where is the processed expression value used in heatmap for RNA i. stands for the raw TPM value for RNA i.