2020-09-29 JC (1)

Bioinformatics

Volume 36, Issue 13, July 2020

Motivation

It is well known that the integration among different data-sources is reliable because of its potential of unveiling new functionalities of the genomic expressions, which might be dormant in a single-source analysis. Moreover, different studies have justified the more powerful analyses of multi-platform data. Toward this, in this study, we consider the circadian genes’ omics profile, such as copy number changes and RNA-sequence data along with their survival response. We develop a Bayesian structural equation modeling coupled with linear regressions and log normal accelerated failure-time regression to integrate the information between these two platforms to predict the survival of the subjects. We place conjugate priors on the regression parameters and derive the Gibbs sampler using the conditional distributions of them.

Results

Our extensive simulation study shows that the integrative model provides a better fit to the data than its closest competitor. The analyses of glioblastoma cancer data and the breast cancer data from TCGA, the largest genomics and transcriptomics database, support our findings.

Key: 整合不同来源的数据昼夜节律基因组

Motivation

Microbial communities have been proved to have close relationship with many diseases. The identification of differentially abundant microbial species is clinically meaningful for finding disease-related pathogenic or probiotic bacteria. However, certain characteristics of microbiome data have hurdled the accuracy and effectiveness of differential abundance analysis. The abundances or counts of microbiome species are usually on different scales and exhibit zero-inflation and over-dispersion. Normalization is a crucial step before the differential abundance test. However, existing normalization methods typically try to adjust counts on different scales to a common scale by constructing size factors with the assumption that count distributions across samples are equivalent up to a certain percentile. These methods often yield undesirable results when differentially abundant species are of low to medium abundance level. For differential abundance analysis, existing methods often use a single distribution to model the dispersion of species which lacks flexibility to catch a single species’ distinctiveness. These methods tend to detect a lot of false positives and often lack of power when the effect size is small.

Results

We develop a novel framework for differential abundance analysis on sparse high-dimensional marker gene microbiome data. Our methodology relies on a novel network-based normalization technique and a two-stage zero-inflated mixture count regression model (RioNorm2). Our normalization method aims to find a group of relatively invariant microbiome species across samples and conditions in order to construct the size factor. Another contribution of the paper is that our testing approach can take under-sampling and over-dispersion into consideration by separating microbiome species into two groups and model them separately. Through comprehensive simulation studies, the performance of our method is consistently powerful and robust across different settings with different sample size, library size and effect size. We also demonstrate the effectiveness of our novel framework using a published dataset of metastatic melanoma and find biological insights from the results.

Key: 微生物组归一化丰度找到一组相对稳定的微生物组，构建大小因子

Motivation

De novoassembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics.

Results

To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs.

Key: 决定contig朝向

Motivation

The understanding of the ever-increasing number of metagenomic sequences accumulating in our databases demands for approaches that rapidly ‘explore’ the content of multiple and/or large metagenomic datasets with respect to specific domain targets, avoiding full domain annotation and full assembly.

Results

S3A is a fast and accurate domain-targeted assembler designed for a rapid functional profiling. It is based on a novel construction and a fast traversal of the Overlap-Layout-Consensus graph, designed to reconstruct coding regions from domain annotated metagenomic sequence reads. S3A relies on high-quality domain annotation to efficiently assemble metagenomic sequences and on the design of a new confidence measure for a fast evaluation of overlapping reads. Its implementation is highly generic and can be applied to any arbitrary type of annotation. On simulated data, S3A achieves a level of accuracy similar to that of classical metagenomics assembly tools while permitting to conduct a faster and sensitive profiling on domains of interest. When studying a few dozens of functional domains—a typical scenario—S3A is up to an order of magnitude faster than general purpose metagenomic assemblers, thus enabling the analysis of a larger number of datasets in the same amount of time. S3A opens new avenues to the fast exploration of the rapidly increasing number of metagenomic datasets displaying an ever-increasing size.

key: 宏基因组目标域组装

Motivation

Peptide is a promising candidate for therapeutic and diagnostic development due to its great physiological versatility and structural simplicity. Thus, identifying therapeutic peptides and investigating their properties are fundamentally important. As an inexpensive and fast approach, machine learning-based predictors have shown their strength in therapeutic peptide identification due to excellences in massive data processing. To date, no reported therapeutic peptide predictor can perform high-quality generic prediction and informative physicochemical properties (IPPs) identification simultaneously.

Results

In this work, Physicochemical Property-based Therapeutic Peptide Predictor (PPTPP), a Random Forest-based prediction method was presented to address this issue. A novel feature encoding and learning scheme were initiated to produce and rank physicochemical property-related features. Besides being capable of predicting multiple therapeutics peptides with high comparability to established predictors, the presented method is also able to identify peptides’ informative IPP. Results presented in this work not only illustrated the soundness of its working capacity but also demonstrated its potential for investigating other therapeutic peptides.

key: 新型治疗性肽预测方法序列短，合成简单随机森林

Motivation

TNT (a widely used program for phylogenetic analysis) includes an interpreter for a scripting language, but that implementation is nonstandard and uses several conventions of its own. This article describes the implementation and basic usage of a C interpreter (with all the ISO essentials) now included in TNT. A phylogenetic library includes functions that can be used for manipulating trees and data, as well as other phylogeny-specific tasks. This greatly extends the capabilities of TNT.

key: 系统发育分析程序处理树木和数据的功能

Motivation

Understanding how antibodies specifically interact with their antigens can enable better drug and vaccine design, as well as provide insights into natural immunity. Experimental structural characterization can detail the ‘ground truth’ of antibody–antigen interactions, but computational methods are required to efficiently scale to large-scale studies. To increase prediction accuracy as well as to provide a means to gain new biological insights into these interactions, we have developed a unified deep learning-based framework to predict binding interfaces on both antibodies and antigens.

Results

Our framework leverages three key aspects of antibody–antigen interactions to learn predictive structural representations: (i) since interfaces are formed from multiple residues in spatial proximity, we employ graph convolutions to aggregate properties across local regions in a protein; (ii) since interactions are specific between antibody–antigen pairs, we employ an attention layer to explicitly encode the context of the partner; (iii) since more data are available for general protein–protein interactions, we employ transfer learning to leverage this data as a prior for the specific case of antibody–antigen interactions. We show that this single framework achieves state-of-the-art performance at predicting binding interfaces on both antibodies and antigens, and that each of its three aspects drives additional improvement in the performance. We further show that the attention layer not only improves performance, but also provides a biologically interpretable perspective into the mode of interaction.

key: 抗体-抗原结合界面预测, 图卷积神经网络，注意力机制，迁移学习

Motivation

Molecular docking is a widely used technique for large-scale virtual screening of the interactions between small-molecule ligands and their target proteins. However, docking methods often perform poorly for metalloproteins due to additional complexity from the three-way interactions among amino-acid residues, metal ions and ligands. This is a significant problem because zinc proteins alone comprise about 10% of all available protein structures in the protein databank. Here, we developed GM-DockZn that is dedicated for ligand docking to zinc proteins. Unlike the existing docking methods developed specifically for zinc proteins, GM-DockZn samples ligand conformations directly using a geometric grid around the ideal zinc-coordination positions of seven discovered coordination motifs, which were found from the survey of known zinc proteins complexed with a single ligand.

Results

GM-DockZn has the best performance in sampling near-native poses with correct coordination atoms and numbers within the top 50 and top 10 predictions when compared to several state-of-the-art techniques. This is true not only for a non-redundant dataset of zinc proteins but also for a homolog set of different ligand and zinc-coordination systems for the same zinc proteins. Similar superior performance of GM-DockZn for near-native-pose sampling was also observed for docking to apo-structures and cross-docking between different ligand complex structures of the same protein. The highest success rate for sampling nearest near-native poses within top 5 and top 1 was achieved by combining GM-DockZn for conformational sampling with GOLD for ranking. The proposed geometry-based sampling technique will be useful for ligand docking to other metalloproteins.

key: 金属蛋白 docking

Motivation

Synthesizing proteins in heterologous hosts is an important tool in biotechnology. However, the genetic code is degenerate and the codon usage is biased in many organisms. Synonymous codon changes that are customized for each host organism may have a significant effect on the level of protein expression. This effect can be measured by using metrics, such as codon adaptation index, codon pair bias, relative codon bias and relative codon pair bias. Codon optimization is designing codons that improve one or more of these objectives. Currently available algorithms and software solutions either rely on heuristics without providing optimality guarantees or are very rigid in modeling different objective functions and restrictions.

Results

We develop an effective mixed integer linear programing (MILP) formulation, which considers multiple objectives. Our numerical study shows that this formulation can be effectively used to generate (Pareto) optimal codon designs even for very long amino acid sequences using a standard commercial solver. We also show that one can obtain designs in the efficient frontier in reasonable solution times and incorporate other complex objectives, such as mRNA secondary structures in codon design using MILP formulations.

key: 密码子优化

Summary

Single-cell RNA sequencing technology provides a novel means to analyze the transcriptomic profiles of individual cells. The technique is vulnerable, however, to a type of noise called dropout effects, which lead to zero-inflated distributions in the transcriptome profile and reduce the reliability of the results. Single-cell RNA sequencing data, therefore, need to be carefully processed before in-depth analysis. Here, we describe a novel imputation method that reduces dropout effects in single-cell sequencing. We construct a cell correspondence network and adjust gene expression estimates based on transcriptome profiles for the local subnetwork of cells of the same type. We comprehensively evaluated this method, called PRIME (PRobabilistic IMputation to reduce dropout effects in Expression profiles of single-cell sequencing), on synthetic and eight real single-cell sequencing datasets and verified that it improves the quality of visualization and accuracy of clustering analysis and can discover gene expression patterns hidden by noise.

key: 一种概率插补方法，可减少单细胞RNA测序中的脱落效应，相同类型细胞的局部子网络

Motivation

The matrix factorization is an important way to analyze coregulation patterns in transcriptomic data, which can reveal the tumor signal perturbation status and subtype classification. However, current matrix factorization methods do not provide clear bicluster structure. Furthermore, these algorithms are based on the assumption of linear combination, which may not be sufficient to capture the coregulation patterns.

Results

We presented a new algorithm for Boolean matrix factorization (BMF) via expectation maximization (BEM). BEM is more aligned with the molecular mechanism of transcriptomic coregulation and can scale to matrix with over 100 million data points. Synthetic experiments showed that BEM outperformed other BMF methods in terms of reconstruction error. Real-world application demonstrated that BEM is applicable to all kinds of transcriptomic data, including bulk RNA-seq, single-cell RNA-seq and spatial transcriptomic datasets. Given appropriate binarization, BEM was able to extract coregulation patterns consistent with disease subtypes, cell types or spatial anatomy.

key: 通过布尔矩阵分解挖掘转录组学中的核心调控模式，揭示肿瘤信号的扰动状态和亚型分类

Motivation

Emerging evidence indicates that circular RNA (circRNA) plays a crucial role in human disease. Using circRNA as biomarker gives rise to a new perspective regarding our diagnosing of diseases and understanding of disease pathogenesis. However, detection of circRNA–disease associations by biological experiments alone is often blind, limited to small scale, high cost and time consuming. Therefore, there is an urgent need for reliable computational methods to rapidly infer the potential circRNA–disease associations on a large scale and to provide the most promising candidates for biological experiments.

Results

In this article, we propose an efficient computational method based on multi-source information combined with deep convolutional neural network (CNN) to predict circRNA–disease associations. The method first fuses multi-source information including disease semantic similarity, disease Gaussian interaction profile kernel similarity and circRNA Gaussian interaction profile kernel similarity, and then extracts its hidden deep feature through the CNN and finally sends them to the extreme learning machine classifier for prediction. The 5-fold cross-validation results show that the proposed method achieves 87.21% prediction accuracy with 88.50% sensitivity at the area under the curve of 86.67% on the CIRCR2Disease dataset. In comparison with the state-of-the-art SVM classifier and other feature extraction methods on the same dataset, the proposed model achieves the best results. In addition, we also obtained experimental support for prediction results by searching published literature. As a result, 7 of the top 15 circRNA–disease pairs with the highest scores were confirmed by literature. These results demonstrate that the proposed model is a suitable method for predicting circRNA–disease associations and can provide reliable candidates for biological experiments.

key: 基于多源信息的有效方法，可使用深度卷积神经网络预测circRNA与疾病的关联，极限学习机

Motivation

The advent of in vivo automated techniques for single-cell lineaging, sequencing and analysis of gene expression has begun to dramatically increase our understanding of organismal development. We applied novel meta-analysis and visualization techniques to the EPIC single-cell-resolution developmental gene expression dataset for Caenorhabditis elegans from Bao, Murray, Waterston et al. to gain insights into regulatory mechanisms governing the timing of development.

Results

Our meta-analysis of the EPIC dataset revealed that a simple linear combination of the expression levels of the developmental genes is strongly correlated with the developmental age of the organism, irrespective of the cell division rate of different cell lineages. We uncovered a pattern of collective sinusoidal oscillation in gene activation, in multiple dominant frequencies and in multiple orthogonal axes of gene expression, pointing to the existence of a coordinated, multi-frequency global timing mechanism. We developed a novel method based on Fisher’s Discriminant Analysis to identify gene expression weightings that maximally separate traits of interest, and found that remarkably, simple linear gene expression weightings are capable of producing sinusoidal oscillations of any frequency and phase, adding to the growing body of evidence that oscillatory mechanisms likely play an important role in the timing of development. We cross-linked EPIC with gene ontology and anatomy ontology terms, employing Fisher’s Discriminant Analysis methods to identify previously unknown positive and negative genetic contributions to developmental processes and cell phenotypes. This meta-analysis demonstrates new evidence for direct linear and/or sinusoidal mechanisms regulating the timing of development. We uncovered a number of previously unknown positive and negative correlations between developmental genes and developmental processes or cell phenotypes. Our results highlight both the continued relevance of the EPIC technique, and the value of meta-analysis of previously published results. The presented analysis and visualization techniques are broadly applicable across developmental and systems biology.

key: meta分析，秀丽隐杆线虫单细胞发育数据的荟萃分析揭示了基因激活中的多频振荡

Motivation

Many ordinary differential equation (ODE) models have been introduced to replace linear regression models for inferring gene regulatory relationships from time-course gene expression data. But, since the observed data are usually not direct measurements of the gene products or there is an unknown time lag in gene regulation, it is problematic to directly apply traditional ODE models or linear regression models.

Results

We introduce a lagged ODE model to infer lagged gene regulatory relationships from time-course measurements, which are modeled as linear transformation of the gene products. A time-course microarray dataset from a yeast cell-cycle study is used for simulation assessment of the methods and real data analysis. The results show that our method, by considering both time lag and measurement scaling, performs much better than other linear and ODE models. It indicates the necessity of explicitly modeling the time lag and measurement scaling in ODE gene regulatory models.

key: ODE基因调控模型可以忽略时滞或测量尺度变化吗，时程基因

Motivation

The outbreak of COVID-2019 initiated at Wuhan, China has become a global threat by rapid transmission and severe fatalities. Recent studies have uncovered whole genome sequence of SARS-CoV-2 (causing COVID-2019). In addition, lung metagenomic studies on infected patients revealed overrepresented Prevotella spp. producing certain proteins in abundance. We performed host–pathogen protein–protein interaction analysis between SARS-CoV-2 and overrepresented Prevotella proteins with human proteome. We also performed functional overrepresentation analysis of interacting proteins to understand their role in COVID-2019 severity.

Results

It was found that overexpressed Prevotella proteins can promote viral infection. As per the results, Prevotella proteins, but not viral proteins, are involved in multiple interactions with NF-kB, which is involved in increasing clinical severity of COVID-2019. Prevotella may have role in COVID-2019 outbreak and should be given importance for understanding disease mechanisms and improving treatment outcomes.

key: PPI

Motivation

Understanding the underlying biological mechanisms and respective interactions of a disease remains an elusive, time consuming and costly task. Computational methodologies that propose pathway/mechanism communities and reveal respective relationships can be of great value as they can help expedite the process of identifying how perturbations in a single pathway can affect other pathways.

Results

We present a random-walks-based methodology called PathWalks, where a walker crosses a pathway-to-pathway network under the guidance of a disease-related map. The latter is a gene network that we construct by integrating multi-source information regarding a specific disease. The most frequent trajectories highlight communities of pathways that are expected to be strongly related to the disease under study.

We apply the PathWalks methodology on Alzheimer's disease and idiopathic pulmonary fibrosis and establish that it can highlight pathways that are also identified by other pathway analysis tools as well as are backed through bibliographic references. More importantly, PathWalks produces additional new pathways that are functionally connected with those already established, giving insight for further experimentation.

key: 通路机制社区

Summary

Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results.

key: 基准数据集