z
Bioinformatics
Volume 36, Issue 10, 15 May 2020
Motivation
Hi-C is currently the method of choice to investigate the global 3D organization of the genome. A major limitation of Hi-C is the sequencing depth required to robustly detect loops in the data. A popular approach used to mitigate this issue, even in single-cell Hi-C data, is genome-wide averaging (piling-up) of peaks, or other features, annotated in high-resolution datasets, to measure their prominence in less deeply sequenced data. However, current tools do not provide a computationally efficient and versatile implementation of this approach.
Results
Here, we describe coolpup.py—a versatile tool to perform pile-up analysis on Hi-C data. We demonstrate its utility by replicating previously published findings regarding the role of cohesin and CTCF in 3D genome organization, as well as discovering novel details of Polycomb-driven interactions. We also present a novel variation of the pile-up approach that can aid the statistical analysis of looping interactions. We anticipate thatcoolpup.pywill aid in Hi-C data analysis by allowing easy to use, versatile and efficient generation of pile-ups
Key: Hi-C数据的堆积分析
Motivation
The studies have indicated that not only microRNAs (miRNAs) or long non-coding RNAs (lncRNAs) play important roles in biological activities, but also their interactions affect the biological process. A growing number of studies focus on the miRNA–lncRNA interactions, while few of them are proposed for plant. The prediction of interactions is significant for understanding the mechanism of interaction between miRNA and lncRNA in plant.
Results
This article proposes a new method for fulfilling plant miRNA–lncRNA interaction prediction (PmliPred). The deep learning model and shallow machine learning model are trained using raw sequence and manually extracted features, respectively. Then they are hybridized based on fuzzy decision for prediction. PmliPred shows better performance and generalization ability compared with the existing methods. Several new miRNA–lncRNA interactions inSolanum lycopersicumare successfully identified using quantitative real time–polymerase chain reaction from the candidates predicted by PmliPred, which further verifies its effectiveness.
Key: miRNA和lncRNA互作 深度学习
Motivation
One of the key computational problems in comparative genomics is the reconstruction of genomes of ancestral species based on genomes of extant species. Since most dramatic changes in genomic architectures are caused by genome rearrangements, this problem is often posed as minimization of the number of genome rearrangements between extant and ancestral genomes. The basic case of three given genomes is known as thegenome median problem. Whole-genome duplications (WGDs) represent yet another type of dramatic evolutionary events and inspire the reconstruction of preduplicated ancestral genomes, referred to as thegenome halving problem. Generalization of WGDs to whole-genome multiplication events leads to thegenome aliquoting problem.
Results
In this study, we propose polynomial-size integer linear programming (ILP) formulations for the aforementioned problems. We further obtain such formulations for the restricted and conserved versions of the median and halving problems, which have been recently introduced to improve biological relevance of the solutions. Extensive evaluation of solutions to the different ILP problems demonstrates their good accuracy. Furthermore, since the ILP formulations for the conserved versions have linear size, they provide a novel practical approach to ancestral genome reconstruction, which combines the advantages of homology- and rearrangements-based methods.
Motivation
With the emerging of high-dimensional genomic data, genetic analysis such as genome-wide association studies (GWAS) have played an important role in identifying disease-related genetic variants and novel treatments. Complex longitudinal phenotypes are commonly collected in medical studies. However, since limited analytical approaches are available for longitudinal traits, these data are often underutilized. In this article, we develop a high-throughput machine learning approach for multilocus GWAS using longitudinal traits by coupling Empirical Bayesian Estimates from mixed-effects modeling with a novel ℓ0-norm algorithm.
Results
Extensive simulations demonstrated that the proposed approach not only provided accurate selection of single nucleotide polymorphisms (SNPs) with comparable or higher power but also robust control of false positives. More importantly, this novel approach is highly scalable and could be approximately >1000 times faster than recently published approaches, making genome-wide multilocus analysis of longitudinal traits possible. In addition, our proposed approach can simultaneously analyze millions of SNPs if the computer memory allows, thereby potentially allowing a true multilocus analysis for high-dimensional genomic data. With application to the data from Alzheimer's Disease Neuroimaging Initiative, we confirmed that our approach can identify well-known SNPs associated with AD and were much faster than recently published approaches (≥6000 times).
Key: 高吞吐量机器学习GWAS
Motivation
Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies.
Results
We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide anin silicopipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications.
Conclusions
DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects.
Motivation
Knowledge of protein–ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein–ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severely imbalanced data.
Results
In this study, we propose a novel deep-learning-based method called DELIA for protein–ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. To overcome the problem of severe data imbalance between the binding and nonbinding residues, strategies of oversampling in mini-batch, random undersampling and stacking ensemble are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline.
Key: 蛋白配体结合残基预测 深度学习
Motivation
Cell-penetrating peptides (CPPs) are a vehicle for transporting into living cells pharmacologically active molecules, such as short interfering RNAs, nanoparticles, plasmid DNAs and small peptides, thus offering great potential as future therapeutics. Existing experimental techniques for identifying CPPs are time-consuming and expensive. Thus, the prediction of CPPs from peptide sequences by using computational methods can be useful to annotate and guide the experimental process quickly. Many machine learning-based methods have recently emerged for identifying CPPs. Although considerable progress has been made, existing methods still have low feature representation capabilities, thereby limiting further performance improvements.
Results
We propose a method called StackCPPred, which proposes three feature methods on the basis of the pairwise energy content of the residue as follows: RECM-composition, PseRECM and RECM–DWT. These features are used to train stacking-based machine learning methods to effectively predict CPPs. On the basis of the CPP924 and CPPsite3 datasets with jackknife validation, StackDPPred achieved 94.5% and 78.3% accuracy, which was 2.9% and 5.8% higher than the state-of-the-art CPP predictors, respectively. StackCPPred can be a powerful tool for predicting CPPs and their uptake efficiency, facilitating hypothesis-driven experimental design and accelerating their applications in clinical therapy.
Key: 从肽序列中预测细胞膜穿透肽(CPP)
Motivation
Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining. The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption. To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed.
Results
Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining. The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption. To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed.
Key: MLCS 算法优化
Motivation
Single-cell RNA-sequencing (scRNA-seq) allows us to dissect transcriptional heterogeneity arising from cellular types, spatio-temporal contexts and environmental stimuli. Transcriptional heterogeneity may reflect phenotypes and molecular signatures that are often unmeasured or unknown a priori. Cell identities of samples derived from heterogeneous subpopulations are then determined by clustering of scRNA-seq data. These cell identities are used in downstream analyses. How can we examine if cell identities are accurately inferred? Unlike external measurements or labels for single cells, using clustering-based cell identities result in spurious signals and false discoveries.
Results
We introduce non-parametric methods to evaluate cell identities by testing cluster memberships in an unsupervised manner. Diverse simulation studies demonstrate accuracy of the jackstraw test for cluster membership. We propose a posterior probability that a cell should be included in that clustering-based subpopulation. Posterior inclusion probabilities (PIPs) for cluster memberships can be used to select and visualize samples relevant to subpopulations. The proposed methods are applied on three scRNA-seq datasets. First, a mixture of Jurkat and 293T cell lines provides two distinct cellular populations. Second, Cell Hashing yields cell identities corresponding to eight donors which are independently analyzed by the jackstraw. Third, peripheral blood mononuclear cells are used to explore heterogeneous immune populations. The proposed P-values and PIPs lead to probabilistic feature selection of single cells that can be visualized using principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and others. By learning uncertainty in clustering high-dimensional data, the proposed methods enable unsupervised evaluation of cluster membership.
Key: scRNA-seq 成员细胞身份聚类评估
Motivation
Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data.
Results
We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods.
Key: RNA-seq批量效应校正
Motivation
Single-cell RNA sequencing (scRNA-seq) has enabled the simultaneous transcriptomic profiling of individual cells under different biological conditions. scRNA-seq data have two unique challenges that can affect the sensitivity and specificity of single-cell differential expression analysis: a large proportion of expressed genes with zero or low read counts ('dropout' events) and multimodal data distributions.
Results
We have developed a zero-inflation-adjusted quantile (ZIAQ) algorithm, which is the first method to account for both dropout rates and complex scRNA-seq data distributions in the same model. ZIAQ demonstrates superior performance over several existing methods on simulated scRNA-seq datasets by finding more differentially expressed genes. When ZIAQ was applied to the comparison of neoplastic and non-neoplastic cells from a human glioblastoma dataset, the ranking of biologically relevant genes and pathways showed clear improvement over existing methods.
Key: scRNA-seq下游差异表达分析(DEA)的分位数回归方法
Motivation
Single-cell RNA sequencing (scRNA-seq) methods make it possible to reveal gene expression patterns at single-cell resolution. Due to technical defects, dropout events in scRNA-seq will add noise to the gene-cell expression matrix and hinder downstream analysis. Therefore, it is important for recovering the true gene expression levels before carrying out downstream analysis.
Results
In this article, we develop an imputation method, called scTSSR, to recover gene expression for scRNA-seq. Unlike most existing methods that impute dropout events by borrowing information across only genes or cells, scTSSR simultaneously leverages information from both similar genes and similar cells using a two-side sparse self-representation model. We demonstrate that scTSSR can effectively capture the Gini coefficients of genes and gene-to-gene correlations observed in single-molecule RNA fluorescence in situ hybridization (smRNA FISH). Down-sampling experiments indicate that scTSSR performs better than existing methods in recovering the true gene expression levels. We also show that scTSSR has a competitive performance in differential expression analysis, cell clustering and cell trajectory inference.
Key: 恢复scRNA-seq真实的基因表达水平 双向稀疏自表达模型
Motivation
Single-cell RNA-sequencing (scRNA-seq) technology provides a powerful tool for investigating cell heterogeneity and cell subpopulations by allowing the quantification of gene expression at single-cell level. However, scRNA-seq data analysis remains challenging because of various technical noises such as dropout events (i.e. excessive zero counts in the expression matrix).
Results
By taking consideration of the association among cells and genes, we propose a novel collaborative matrix factorization-based method called CMF-Impute to impute the dropout entries of a given scRNA-seq expression matrix. We test CMF-Impute and compare it with the other five state-of-the-art methods on six popular real scRNA-seq datasets of various sizes and three simulated datasets. For simulated datasets, CMF-Impute outperforms other methods in imputing the closest dropouts to the original expression values as evaluated by both the sum of squared error and Pearson correlation coefficient. For real datasets, CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or T-SNE followed by K-means as evaluated by both adjusted rand index and normalized mutual information. Finally, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories.
Key: 协作矩阵因子 scRNA-seq的dropout恢复
Motivation
Single cell RNA-sequencing (scRNA-seq) technology enables whole transcriptome profiling at single cell resolution and holds great promises in many biological and medical applications. Nevertheless, scRNA-seq often fails to capture expressed genes, leading to the prominent dropout problem. These dropouts cause many problems in down-stream analysis, such as significant increase of noises, power loss in differential expression analysis and obscuring of gene-to-gene or cell-to-cell relationship. Imputation of these dropout values can be beneficial in scRNA-seq data analysis.
Results
In this article, we model the dropout imputation problem as robust matrix decomposition. This model has minimal assumptions and allows us to develop a computational efficient imputation method called scRMD. Extensive data analysis shows that scRMD can accurately recover the dropout values and help to improve downstream analysis such as differential expression analysis and clustering analysis.
Key: dropout估算