Registry and database of bioparts for synthetic biology

AlphaFold2-PointSite-Docking

AlphaFold2-PointSite-Docking is a pipline for the 3D structure prediction, accurate identification of protein ligand binding atoms and prediction of the ligand-receptor complex structure of catalytic bioparts.

Web

https://sdap.biosino.org/zelixir/#/auth/signin

PVQD

PVQD is a method based on deep learning for protein structure design and prediction. Here we published the source code and the demos for PVQD. This code is developed on Torch framework.

Web

https://www.biosino.org/rdbsb/PVQD

Software

https://github.com/liuyf020419/PVQD-torch

Reference

Liu Y, Chen L, Liu H.Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions. bioRxiv. 2023 Nov. doi: 10.1101/2023.11.18.567666

PathFinder

PathFinder is a tool for the identification of pathways basing on the experimental verified reactions for catalytic bioparts.

Web

https://www.biosino.org/rdbsb/pathFinder

BiopartFinder

BiopartFinder is a tool for the general catalytic biopart sequence identification and similarity searches.

Web

https://www.biosino.org/rdbsb/biopartFinder

RDBSB EC-Numbers

A class hierarchy (ontology) allows you to retrieve information according to categories of interest. In the class hierarchy that follows, each line names a single class of biological objects. The levels of indentation indicate a subclass relationship to the class above. The numbers in parentheses indicate the number of bioparts of that class. Clicking on a class will display a page containing its bioparts (the biological objects that are direct children of that class). A class page also lists the parent classes and child classes, allowing you to navigate up and down in the hierarchy. Note: if the categories below are missing the expand icon, but you believe they they should be expandable, try reloading the page.

Web

https://www.biosino.org/rdbsb/ec/ec_ontology

GTDB

GTDB is an integrated repository of glycosyltransferases, which collects comprehensive information, including amino acid sequences, coding region sequences, available tertiary structures, protein classification families, catalytic reactions and metabolic pathways involved, from distinct well-known databases or predictions.

Web

https://www.biosino.org/gtdb

Reference

Zhou C, Xu Q, He S, Ye W, Cao R, Wang P, Ling Y, Yan X, Wang Q, Zhang G. GTDB: an integrated resource for glycosyltransferase sequences and annotations. Database (Oxford). 2020 Jan 1;2020:baaa047. doi: 10.1093/database/baaa047. PMID: 32542364; PMCID: PMC7296393.

RefMetaDB

Reference Metabolome Database for Plants (RefMetaPlant) serves as an integrated database and analysis platform dedicated to becoming the centralized resource for plant metabolomic research. It aims to standardize and integrate the reference metabolome data, providing a comprehensive platform for researchers in plant metabolomics, genetics, and related fields. Currently, RefMetaPlant 1.0 is released to provided:

(1) 1,086,000+ experimental mass spectra we obtained using UPLC coupled with Quadrupole-Orbitrap High Resolution Mass Spectrometer (UPLC-Q-Orbitrap-HRMS) on samples of 150+ plant species from Bryophyta, Lycopodiopsida, Pteridophyta, Gymnospermae, and Angiospermae;

(2) The reference metabolome for 153 plant species across the five major phyla of green plants;

(3) 325,100+ standard compounds mass spectral data in a library, which include data of 135,464 experimental reference mass spectral from public databases like MassBank, MoNA, Respect, FiehnLib, RIKEN PlaSMA, and data of 189,639 in silico mass spectra;

(4) A set of related query and analytical tools like ‘LC-MS/MS Query’, 'RefMetaBlast' and 'CompoundLibBlast' for plants metabolome search and profiling, and metabolite identification.

RefMetaPlant provides a powerful platform to support plant genome-scale metabolomics analysis, and promote knowledge/data sharing and collaborations of metabolomic research.

Web

https://www.biosino.org/RefMetaDB

Reference

Shi H, Wu X, Zhu Y, Jiang T, Wang Z, Li X, Liu J, Zhang Y, Chen F, Gao J, Xu X, Zhang G, Xiao N, Feng X, Zhang P, Wu Y, Li A, Chen P, Li X. RefMetaPlant: a reference metabolome database for plants across five major phyla. Nucleic Acids Res. 2024 Jan 5;52(D1):D1614-D1628. doi: 10.1093/nar/gkad980. PMID: 37953341; PMCID: PMC10767953.

tmap

For large scale and integrative microbiome research, it is expected to apply advanced data mining techniques in microbiome data analysis.

Topological data analysis (TDA) provides a promising technique for analyzing large scale complex data. The most popular Mapper algorithm is effective in distilling data-shape from high dimensional data, and provides a compressive network representation for pattern discovery and statistical analysis.

tmap is a topological data analysis framework implementing the TDA Mapper algorithm for population-scale microbiome data analysis. We developed tmap to enable easy adoption of TDA in microbiome data analysis pipeline, providing network-based statistical methods for enterotype analysis, driver species identification, and microbiome-wide association analysis of host meta-data.

Software

https://github.com/GPZ-Bioinfo/tmap

https://tmap.readthedocs.io/en/latest

Reference

Liao T, Wei Y, Luo M, Zhao GP, Zhou H. tmap: an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies. Genome Biol. 2019 Dec 23;20(1):293. doi: 10.1186/s13059-019-1871-4. PMID: 31870407; PMCID: PMC6927166.

ABACUS2

The ABACUS (a backbone-based amino acid usage survey) method uses unique statistical energy functions to carry out protein sequence design. Although some of its results have been experimentally verified, its accuracy remains improvable because several important components of the method have not been specifically optimized for sequence design or in contexts of other parts of the method. The computational efficiency also needs to be improved to support interactive online applications or the consideration of a large number of alternative backbone structures.

We derived a model to measure solvent accessibility with larger mutual information with residue types than previous models, optimized a set of rotamers which can approximate the sidechain atomic positions more accurately, and devised an empirical function to treat inter-atomic packing with parameters fitted to native structures and optimized in consistence with the rotamer set. Energy calculations have been accelerated by interpolation between pre-determined representative points in high-dimensional structural feature spaces. Sidechain repacking tests showed that ABACUS2 can accurately reproduce the conformation of native sidechains. In sequence design tests, the native residue type recovery rate reached 37.7%, exceeding the value of 32.7% for ABACUS1. Applying ABACUS2 to designed sequences on three native backbones produced proteins shown to be well-folded by experiments.

Web

https://biocomp.ustc.edu.cn/servers/software.php#

Software

https://biocomp.ustc.edu.cn/downloads/index.php?share/file&user=102&sid=Ad3MGUPJ

Reference

Xiong P, Hu X, Huang B, Zhang J, Chen Q, Liu H. Increasing the efficiency and accuracy of the ABACUS protein sequence design method. Bioinformatics. 2020 Jan 1;36(1):136-144. doi: 10.1093/bioinformatics/btz515. PMID: 31240299.

TFBS

TFBS, Binding Sites Prediction of TetR Family Repressors

This server can be used to predict DNA binding sites for transcription factors of TetR Family Repressors (TFRs). Two methods as described in the reference at the end of this page are used: a genome sequence-based method which uses ideas of phylogenetic footprinting, and a statistical energy-based method which calculates sequence energies of DNA octamers given the amino acid sequences of TFRs. Without user-provided DNA sequences, TFBSs are predicted as genome sequence fragments near the ORF of a query TFR, and as octamer DNA sequences and sequence logos representing predicted half binding sites of a query TFR. If a DNA sequence is provided by user, the given DNA sequence will be scanned for most possible TFBSs.

Web

https://biocomp.ustc.edu.cn/servers/software.php#

Software

https://biocomp.ustc.edu.cn/servers/downloads/TFBS.tar.gz

Reference

Long P, Zhang L, Huang B, Chen Q, Liu H. Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites. Nucleic Acids Res. 2020 Dec 16;48(22):12604-12617. doi: 10.1093/nar/gkaa1134. PMID: 33264415; PMCID: PMC7736823.

ABACUS-R

ABACUS-R is a method based on deep learning for designing amino acid sequences that autonomously fold into a given target backbone.

Web

https://codeocean.com/capsule/6949436/tree/v1

Reference

Liu Y, Zhang L, Wang W, Zhu M, Wang C, Li F, Zhang J, Li H, Chen Q, Liu H. Rotamer-free protein sequence design based on deep learning and self-consistency. Nat Comput Sci. 2022 Jul;2(7):451-462. doi: 10.1038/s43588-022-00273-6. Epub 2022 Jul 21. Erratum in: Nat Comput Sci. 2022 Aug;2(8):526. PMID: 38177863.

SCUBA

SCUBA (SideChain Unspecialized Backbone Arrangement) is a statistical energy function of protein conformation. It consists of energy terms derived from known protein structures using a novel adaptive-kernel neighbor counting-neural network(NC-NN) approach. It is continuous with analytical gradients, allowing protein structures to be sampled and/or optimized with complete flexibility by stochastics dynamics (SD) simulations.

By design, SCUBA energy contains both local and through-space packing interactions of mainchain atoms, while sidechains in the model mainly serve as steric placeholders. In SCUBA-driven protein backbone design, SD simulated annealing can be applied to generate optimized backbone structures at high resolution from an initial backbone which can be partially or entirely artificially constructed. During the optimization, generic instead of specific sidechain types can be employed, solving the problem of designing backbones without knowing the amino acid sequence in advance.

Web

https://biocomp.ustc.edu.cn/servers/software.php#

Software

https://github.com/USTCwangsheng/pySCUBA

Reference

Huang B, Xu Y, Hu X, Liu Y, Liao S, Zhang J, Huang C, Hong J, Chen Q, Liu H. A backbone-centred energy function of neural networks for protein design. Nature. 2022 Feb;602(7897):523-528. doi: 10.1038/s41586-021-04383-5. Epub 2022 Feb 9. PMID: 35140398.

SCUBA-D

SCUBA-D is a method based on deep learning for generating protein structure backbone.

Web

https://github.com/liuyf020419/SCUBA-D

Software

https://zenodo.org/records/10947360

Reference

Liu Y, Chen L, Liu H.De novo protein backbone generation based on diffusion with structured priors and adversarial training. bioRxiv. 2022 Dec. doi: 10.1101/2022.12.17.520847

Pythia

Pythia is a tool for self-supervised prediction of protein mutational effects.

Web

https://pythia.wulab.xyz

EnzyPick

Enzyme screening is an essential preliminary activity in metabolic engineering. Nonetheless, the existing tools largely rely on prior knowledge, and cannot utilize custom candidate enzyme libraries. To address this, we introduced the Substrate–product Pair-based Enzyme Promiscuity Prediction (SPEPP) model, which leverages transfer learning and Transformer architecture to illuminate the intricate interplay between enzymes and substrate–product pairs. SPEPP exhibited good predictive ability, eliminating the need for prior knowledge of reactions and allowing users to define their candidate enzyme libraries. Owing to its adaptability, SPEPP can be seamlessly integrated into various metabolic engineering applications including, but not limited to, substrate/product screening, de novo pathway design, and hazardous material degradation. To better assist metabolic engineers in designing and refining biochemical pathways, particularly those without programming skills, we designed EnzyPick, an easy-to-use web server for enzyme screening, based on SPEPP.

Web

http://www.biosynther.com/enzypick

Software

https://zenodo.org/records/8210150

Reference

Xing H, Cai P, Liu D, Han M, Liu J, Le Y, Zhang D, Hu QN. High-throughput prediction of enzyme promiscuity based on substrate-product pairs. Brief Bioinform. 2024 Jan 22;25(2):bbae089. doi: 10.1093/bib/bbae089. PMID: 38487850; PMCID: PMC10940840.

RxnFinder

Microbial cell factories have a lot of important and promising applications in producing bulk chemicals, natural products, biofuels, and so on. The new bottleneck of microbial cell factory is how to design reaction, enzyme, and pathway, based on enormous biosynthesis data.Our team has been focused on the construction of data-driven biosynthesis design platform ( http://www.rxnfinder.org), which is composed of the following sections: (1) Biochemical reaction database (RxnFinder): our team manually curated more than 300,000 biochemical reactions, which are 10 times larger than the KEGG reactions, from more than 550,000 biosynthesis references retrieved from PubMed using more than 40 biosynthesis related keywords. More than 10 third party databases are linked, and more than 10 informatics methods are provided. (2) Enzyme discovery (ECAssignment): a chemical transformation-based method (ECAssigner) is proposed for enzyme discovery using biochemical reaction difference fingerprints and reaction similarity.(3) Biosynthetic pathway design (BioSynther): BioSynther tool was developed to design biosynthesis pathways between starting molecules and target molecules, in which users can interactively re-design biosynthetic pathways. One of the most promising applications of biosynthetic methods is to produce chemical products of high value from the ready-made chemicals. BioSynther is also developed to explore the biosynthetic potentials of precursor chemicals using BKM-react, Rhea, and more than 300,000 in house RxnFinder reactions manually curated. (4) Cell -based pathway optimization platform (SynBioEcoli, EcoSynther, LifeSynther): Based on the comprehensive biosynthetic data curated above, whole-cell modelling methods are proposed to optimize heterogeneous biosynthetic pathways. The proposed data-driven one-stop informatics platform could be used as a useful tool in metabolic engineering, biosynthesis, and synthetic biology.

Web

http://www.rxnfinder.org

Measurement methods

UDP-糖基转移酶标准化表征方法

P450标准化表征方法

Acknowledgement

National Program on Key Basic Research Project (973 Program)

The National Key Research and Development Program of China

International Partnership Program of Chinese Academy of Sciences

Biological Resources Programme, Chinese Academy of Sciences