MASH Ocean

Description

With the increasing severity of water environment pollution and global climate change, understanding the diversity, function, and distribution of aquatic microorganisms has become increasingly important. Marine microorganisms are critical for the marine ecosystem and have significant effect on the health of marine environment. They also can directly drive the geochemical cycling therefore influence the Earth climate and atmosphere. Establishing a domestically user-friendly and high-quality hydrosphere microbiome platform can fill the research gap in this field and provide an easy path for researchers worldwide to share and analysis the microbiome data. Our first platform is focusing on oceanic database which can provide data, service and support for the relevant researches.

Contact information

Yinzhao Wang E-mail: wyz@sjtu.edu.cn

Liuyang Li E-mail: liuyangli@sjtu.edu.cn

Yaoxun Hu E-mail: 2452177401@qq.com

Citing MASH-Ocean

If you use MASH-Ocean in your work, please consider citing its manuscript:
Mash-Ocean 1.0: Interactive Platform for Investigating Microbial Diversity, Function, and Biogeography with Marine Metagenomic Data. Yinzhao Wang#, Liuyang Li#, Qiang Li#, Yaoxun Hu, Wenjie Li, Zhile Wu, Hungchia Huang, Zhenbo Lv, Wan Liu, Ruifang Cao, Guoping Zhao*, Fengping Wang*, Guoqing Zhang*. — iMeta 3, no. 3 (2024): e201. https://doi.org/10.1002/imt2.201.

Version update

Software information

Software	Version	Description
kraken2	2.1.2	An ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences
bracken	2.6.1	A highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample
krona	2.8.1	Visualization tool of relative abundances and confidences of metagenomic classfications
gtdbtk	2.3.2	Taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy
eggnog-mapper	2.1.8	Fast functional annotation tool based on precomputed Orthologous Groups and phylogenies
FastSpar	1.0.0	FastSpar is a C++ implementation of the SparCC algorithm for rapid and scalable correlation estimation of compositional data
R	4.1.0	A programming language for for statistical computing and graphics

Annotation database information

Database	Version	Description
GTDB	Release 214	The Genome Taxonomy Database is a phylogenetically consistent, genome-based taxonomy that provides rank-normalized classifications for 402,709 bacterial and archaeal genomes from domain to genus.
eggNOG	version 5.0	eggNOG 5.0 is a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses
Kraken 2 and Bracken indexes	k2_pluspf_16gb_20220908	The indexes for Kraken2 and Bracken using Refeq archaea, bacteria, viral, plasmid, human, UniVec_Core, protozoa and fungi with DB capped at 16 GB

Python package/script information

Python Package	Version	Description
Python	3.8.5	A programming language for for statistical computing and graphics
gtdb_to_ncbi_majority_vote.py	N/A	A tool to transform GTDB taxonomy to NCBI taxonomy (https://github.com/Ecogenomics/GTDBTk/blob/master/scripts/gtdb_to_ncbi_majority_vote.py)
iDIRECT	N/A	Inference of Direct and Indirect Relationships with Effective Copula-based Transitivity (https://github.com/nxiao6gt/iDIRECT/)

Co-occurrence network information

Network Level	Permutations for FastSpar	Interaction strength cutoff for FastSpar	P value cutoff for FastSpar	Interaction strength cutoff for iDIRECT
Phylum	permutations=1000	\|r\| > 0.1	P < 0.01	\|r\| > 0.1
Class	permutations=1000	\|r\| > 0.1	P < 0.01	\|r\| > 0.1
Order	permutations=1000	\|r\| > 0.1	P < 0.01	\|r\| > 0.1
Family	permutations=1000	\|r\| > 0.2	P < 0.01	\|r\| > 0.1
Genus	permutations=1000	\|r\| > 0.4	P < 0.01	\|r\| > 0.1
Species	permutations=1000	\|r\| > 0.4	P < 0.01	\|r\| > 0.1

Sample collection information

Metagenomic data of MASH was collected from NCBI in September 2020. We used 73 keywords that encompass different types of biomes. Lables with "enriched", "metatranscription", "amplicon" and "DOE Joint Genome Institute (JGI)" and files with a very small size (below 200MB), were excluded to avoid data bias and authorization problems. In total, we obtained 2,147 samples from different environments for MASH database.

Video tutorial

Website visits