top of page
Writer: Olga DudchenkoOlga Dudchenko

The Japanese macaque is the northernmost-living nonhuman primate. It is found on three of the four main Japanese islands, and lives in a variety of habitats spanning subtropical forests and subarctic forests [1]. The Japanese macaque has featured prominently in culture. For example, the three wise monkeys, which warn people to “see no evil, hear no evil and speak no evil”, are Japanese macaques [2]!


Today, we share a chromosome-length genome assembly for the Japanese macaque. This genome was created in collaboration with Michal Levy-Sakin (formerly UCSF and currently at Dovetail), Pui Kwok (UCSF), Betsy Ferguson (ONPRC) and Jeff Wall (UCSF).


See below how the chromosomes in the new genome assembly compare to those of several closely related primates: the rhesus macaque Macaca mulatta, genome assembly by the Washington University School of Medicine, shared here; crab-eating macaque Macaca fascicularis, genome assembly by the International Macaca fascicularis Genome Sequencing Consortium, shared here; and humans, by the Genome Reference Consortium, latest version available here.

Whole-genome alignments of the new Japanese macaque genome assembly (Macaca_fuscata_HiC) to several previously assembled primate species: rhesus macaque (Mmul_10, ~0.5MY to common ancestor), crab-eating macaque (Macaca_fascicularis_5.0, ~1MY to common ancestor) and human (GRCh38, ~25MY to common ancestor).

It is worth pointing out that homologies in human and Japanese macaque chromosomes have been previously studies by microscopy methods, see (Weinberg et al., 1992). We copy their results below for comparison. It is easy to see that homologies calculated from genome assemblies are in agreement with the predictions made by microscopy. The assembly comparison however offers a much more comprehensive idea of intrachromosomal rearrangements that are taking place between the two species.

Fig. 2 from Weinberg et al., 1992: Idiogramatic representation of hybridization patterns of DNA derived from human chromosomes to chromosomes of Macaca fuscata. Macaque chromosome numbers are given below each chromosome, numbers on the left indicate subregions painted with the respective human chromosome specific library.

中国DNA动物园完成濒危动物林麝的基因组组装升级


The forest musk deer (Moschus berezovskii) is a species of particular importance to Chinese ecology, biodiversity conservation, economy, and medicine. Like other musk deer, the forest musk deer has been (and still is) hunted for its musk. The hunting pressure, habitat loss and disease in captive animals have led to significant population decline. Today, forest musk deer is a Class I protected species in China, listed in CITES Appendix II and classified as endangered by IUCN [1].


林麝(Moschus berezovskii)在中国是一种对生态和物种多样性非常重要,且具有重要药用和经济价值的物种。 同其他麝一样,林麝因为产麝香被猎杀(直到今天也仍然有人非法猎杀林麝)。由于人类的捕杀、生存环境的破坏以及人工养殖中的疾病等问题最终导致林麝总群数量不断下降。目前,林麝是中国国家一级野生保护动物,也被世界自然保护联盟列入了濒危保护名单,被《濒危物种国际贸易公约》列为II类物种[1]。


To facilitate conservation efforts in China we have recently started DNA Zoo China at ShanghaiTech University. If you would like to work together on genome assembly of interesting species in China, please reach out to me, Dr. Lichun Jiang (jianglch@shanghaitech.edu.cn).


为了支持中国的野生物种保护工作,我们最近在上海科技大学开展了中国DNA动物园项目。如果您有兴趣和我们一起对中国的野生物种进行基因测序组装,请联系本人,蒋立春博士(jianglch@shanghaitech.edu.cn)。


As inaugural effort, today DNA Zoo China upgrades the genome assembly for the forest musk deer from (Fan et al., 2018). We are very grateful to Dr. Suwen Zhao from the ShanghaiTech University and ShangHai ChongMing DongPing YuanShe XunYang Co., Ltd. for providing us with the sample for Hi-C library preparation!


作为中国DNA动物园的首发项目,我们于今天正式发布最新的林麝基因组。基于2018年发表的林麝基因组草图(Fan et al., 2018),我们通过Hi-C 基因组测序技术成功组装了中国DNA 动物园项目的首个染色体水平的林麝基因组。在此诚挚感谢为我们提供Hi-C文库构建所需样品的上海科技大学赵素文博士以及上海崇明东平原麝驯养有限公司。


The chromosome-length assembly is shared here. See below how the 29 chromosomes of the forest musk deer relate to the 30 chromosomes of cattle, from (Zimin et al., Genome Biology, 2009). Note the fusion of cow chromosomes #26 and 28 in the musk deer.


我们把染色体水平的组装结果分享在此(请点击链接)。如下图可以看到林麝的29条染色体和牛基因组的30条染色体(Zimin et al., Genome Biology, 2009)比对的情况。值得关注的是牛的26号和28号染色体在林麝中发生了融合。

Whole-genome alignment plot between the chromosomes of the new forest musk deer genome assembly (ls35.final.genome_HiC) and those of the domestic cow (Bos_taurus_UMD_3.1.1, from Zimin et al., 2009).

Everyone knows that gene annotations are critical for enabling the analysis a new genome. Knowing something about the full complement of genes, and their order along chromosomes, is exceptionally powerful, and allows direct comparisons of genomes across the tree of life. To begin to understand this, we set out to create a set of annotations for all the mammals at the DNA Zoo.


tl;dr, the entire set of >1 million protein-coding genes spanning 67 mammalian genomes, can be found here (see also Wasabi mirror). Each species’ folder contains a set of transcripts, proteins, and a gff3 file. The same files can also be found in data release folders associated with each individual assembly.


A few important stats. On average, we identified 16,445 protein-coding genes per species. We do know that we’re missing between 8% and 10% of genes, based on gene content analysis in BUSCO. We have a clear analytical path to recover most of these (currently missing) genes, so stay tuned!


In total, we recovered 1,101,834 protein-coding genes. 98.1% of these genes are contained in about 22,156 orthogroups (set of genes that share a common evolutionary history, specifically orthology or parology). What does this mean? Although we may be missing some genes (more on this below), with <2% of genes being “unassigned,” we are not predicting a bunch of junk, e.g, our false positive rate is low.  How is this defined, you might ask. 1st let’s assert that if a given transcript is found in other (or even many other) species by OrthoFinder2, then it is biological in nature, rather than a technical artifact of the annotation process. So this leaves us with 2% of transcripts - what are these? These are transcripts that are either novel - singularly evolved in one species and no other (in our dataset), or it is an artifact - a false positive. Surely we do observe evolutionary novelty, but what fraction of this 2% is novelty is unclear. Let’s be safe and say that it is all artifactual, and that are false positive rate on calling transcripts is 2%.


What did we do?


Maker-based annotation: A key constraint is that we wanted a standardized gene annotation in each species, but we don’t have transcriptome data for every species. As such, we devised a strategy to leverage the power of homology, along with the fact that we have extensive knowledge of gene content in mammals.


Specifically, we elected to use the mammal-specific subset of UniProtKB/Swiss-Prot, a manually annotated, non-redundant protein sequence database. We believe this is a reasonable first approach, given the broad taxonomic and genic coverage of the genomic dataset. The Swiss-Prot subset used is available here.


So, critically, the annotations contained here are based on SwissProt mammals. This does a pretty good job and identifies the vast majority of protein-coding genes. However, note that we did not endeavor to annotate noncoding transcripts – that’s something we will be doing in the future.



Each annotation took between 18 and 35 hours to run across 48 cores, for a total of about 80,000 core-hours.


Orthogroups: In addition to the annotations, we aimed to generate orthogroups using OrthoFinder2 (Emms, Kelly, bioRxiv, 2018) – this is incredibly useful information for comparative biologists. We’ve even included a tree, see below. This tree was constructed within OrthoFinder using the default settings (e.g., using FastTree), and based on 1,023 orthogroups. Note that there may be some inconsistencies in the topology, as it’s out 1st stab at this, and further refinements are upcoming! One obvious way to improve the robustness of the tree is to apply a more appropriate model of sequence evolution for each partition, e.g., using IQTREE, but this is very time consuming.

Phylogenetic tree based on 1,023 orthogroups and constructed by OrthoFinder2 (Emms, Kelly, 2018). Scale bar refers to a phylogenetic distance of 0.04 nucleotide substitutions per site.

What’s next?


1. We know there are some genes that we’ve missed, and we have been developing a better approach, without sacrificing too much time. In addition to this, our current approach is missing ncRNA - stay tuned, because v2 annotation will contain these critically important elements!


2. Tell us what you want? What makes this even more useful? Let us know!


3. Is your favorite gene missing? Let us know and we can see where it went.

Join our mailing list

ARC-Logo-Final-2018-01.png

© 2018-2022 by the Aiden Lab.

bottom of page