Documentation of TransCirc database

CircRNA:


Figure source:
https://en.wikipedia.org/wiki/Circular_RNA

Circular RNAs (circRNAs) are a class of abundant and conserved RNAs in animals and plants
[1-3]
. Most circRNAs are produced from a special type of alternative splicing known as back-splicing, and are predominantly localized in cytoplasm
[3-5]
. However, the general function of circRNA in vivo is still an open question. Several circRNAs have known functions including sequestration of miRNAs
[6,7]
or RNA binding proteins (RBPs)
[8]
(i.e., as competitors of the linear mRNAs), and modulation of transcription and interference with splicing
[9,10]
. Nevertheless, the function of the remaining circRNAs is an uncharted territory and has not benefitted as much from the advancement in sequencing technology until lately. Since in vitro synthesized circRNAs can be translated in cap-independent fashion
[11]
and many protein-coding genes in higher eukaryotes can produce circRNAs through back-splicing of exons, it is highly possible that they function as mRNAs in vivo to direct protein synthesis. Recent studies indicated that some cytoplasmic circRNAs can be effectively translated into detectable peptides, and many short sequences, including m6A modification sites have been suggested to function as IRES-like elements to drive circRNA translation
[12-15]
. Using various direct and indirect evidences that support circRNA translation, we conducted an integrative analysis to predict the potential of all circRNAs in coding for functional peptides. The result of such prediction and the supporting evidences were summarized in the TransCirc database.
Motivation of TransCirc database:

Now as translation of circRNAs is starting to grow up a new field, more and more researchers would benefit from a database collecting circRNAs that can translate. We noticed that a translatable circRNA database is still lacking, and hope to fill the gap to make circRNA translation studies more convenient.

Evidences of circRNA translation:

1. Ribosome/polysome profiling

The translation of mRNAs is carried out by ribosome, which can form polysomes in actively translated mRNAs. Therefore the association with ribosomes/polysomes can serve as a strong predictor of the potential for translatable circRNA. We used published ribosomal foot printing data
[16]
and polysomal profile
[17]
to score the association of circRNAs with ribosomes, which can serve as a strong predictor for translation of circRNAs.

2. Translation initiation site (TIS)

A global mapping of TIS codons at nearly single-nucleotide resolution has been achieved by GTI-seq
[18]
, revealing an unambiguous set of several thousands TIS codons across the entire human transcriptome. We used the data from TISdb based on GTI-seq as an indirect evidence supporting the translation of circRNAs, which also associated with potential ORFs.

3. IRES sequence

Since circRNAs are covalently closed molecules without free ends, the translation of circRNAs must use an unconventional initiation mechanism known as cap-independent translation initiation. Such initiation pathway has to be driven by IRESs (internal ribosomal entry sites), which are typically short RNA fragments with specialized secondary structure. Although most well-studied IRESs were found in various virus RNAs, there are several cases where the endogenous genes contain IRES to drive translation in a cap-independent fashion. Recently, we and other group conducted a systematic screen for IRES elements in human genome or from random sequences
[19]
, and thus we used all the available IRES information as evidence to support circRNA translation.

4. m6A sites

The N-6-methyladenosine (m6A) is the most common modification of RNAs, and have been found in many types of non-coding and coding RNAs. We have recently found that circRNAs undergo extensive modification m6A, which can drive circRNA translation through recruiting reader protein YTHDF3 that interacts with translation initiation factors (most notably eIF4G2). Therefore we used published m6A modification data from REPIC database
[20]
(identified by three different tools), and mapped them back to circRNA sequences. The existence of experimentally validated m6A sites in circRNA can also serve as a predictor for translatable circRNA
[17]
, which are integrated in this database.

5. ORF length

The length of potential open reading frame (ORF) is a common predictor for coding RNA vs. non-coding RNAs. Usually a long ORF cannot be found in a non-coding RNA, and thus we used the ORF length > 20aa as a minimal requirement for circRNA encoded peptide. It should be noticed that ORF length is a relatively weak predictor, as many small peptide were recently found to be coded by “non-coding” RNAs in human transcriptome, whereas circRNAs with a long ORF will have better chance to be a coding mRNA.

6. sequence composition

The amino acid (aa) sequences of all natural proteins only occupy a very small fraction of the possible sequence space, mostly because only a sub-fraction of sequences can form stable proteins. Therefore the protein with “unnatural” sequence tend to be degraded rapidly, and the sequence similarity to all natural proteins can serve as a strong predictor to identify authentic proteins in random strings of amino acid sequences. Therefore we used machine learning approach to predict how likely a given sequence is with natural proteins, and applied this prediction to score for how likely a given ORF encoded by circRNAs can serve as a template for functional protein.

7. Proteomics evidence by Mass spectrometry

Mass spectrometry is an important method to accurate identify and characterize proteins. Several large scale mass-spectrometry experiments have been conducted to study human proteome {}, however only about ~50% of MS spectra can be reliably assigned to known peptides coded by human mRNAs even considering the post-translational modification of protein. This result suggests that there are a large fraction of “hidden proteome” encoded by non-canonical mRNA, some of which could be coded by circRNAs. We have defined a new set of rules and rigorous filters to search circRNA-encoded peptide from MS dataset, and use the search results as a strong evidence for circRNA translation. We included all the raw MS spectra supporting the circRNA encoded peptides that cross the back-splice junctions.
Data Content of TransCirc database:

Context information: taxonomy
host gene information: genome build, position, official name, etc.
circRNA information: locations, exons and sequences
ORF information: starts and ends, sequences
Protein/peptide product information: sequences, prediction of being protein-alike and mass spectrometry

Data sources:

circRNA:

Its taxonomy, host gene and location information were collected from circAtlas.

Ribosomal and polysome binding evidence:

ribosomal foot printing data
[21]

polysomal profile
[22]

Translation initiation site:

TISdb based on GTI-seq

IRES sequences:

IRES elements in human genome or from random sequences
[23,24]

m6A sites:

REPIC database
[25]
Usage of TransCirc:

1. Simple Search

Search by Ensembl gene: e.g. IQCJ-SCHIP1 | ENSG00000283154.1
Search by Transcirc ID: e.g. TC-hsa-IQCJ-SCHIP1_0001
Search by other circRNA ID: e.g. hsa_circ_03218 | hsa-circRNA5793
Search by genomic position: e.g. chr3:159764000-159766000

2. Sequence based search

3. Browse and filter

The browse and filter page lists all potentially translatable circRNAs based on the user's query. Users can use the evidence filtering and sorting tools at the top of the page to find circRNAs of interest.

4. Detailed View

View circRNA detailed information, evidences, exon compositions, sequences, ORF locations, predicted peptides and more. We provided visualization of various aspects of circRNAs and their products. Please let us know if you wish any further information or representations.
Example of detailed information
Example of evidences for details

5. Downloading

We allow batch download of flat tables. Please go here: https://biosino.org/transcirc/download