Briefings in Bioinformatics
Volume 21, Issue 4, July 2020
Cleft palate (CP) is the second most common congenital birth defect. The etiology of CP is complicated, with involvement of various genetic and environmental factors. To investigate the gene regulatory mechanisms, we designed a powerful regulatory analytical approach to identify the conserved regulatory networks in humans and mice, from which we identified critical microRNAs (miRNAs), target genes and regulatory motifs (miRNA–TF–gene) related to CP. Using our manually curated genes and miRNAs with evidence in CP in humans and mice, we constructed miRNA and transcription factor (TF) co-regulation networks for both humans and mice. A consensus regulatory loop (miR17/miR20a–FOXE1–PDGFRA) and eight miRNAs (miR-140, miR-17, miR-18a, miR-19a, miR-19b, miR-20a, miR-451a and miR-92a) were discovered in both humans and mice. The role of miR-140, which had the strongest association with CP, was investigated in both human and mouse palate cells. The overexpression of miR-140-5p, but not miR-140-3p, significantly inhibited cell proliferation. We further examined whether miR-140 overexpression could suppress the expression of its predicted target genes (BMP2, FGF9, PAX9 and PDGFRA). Our results indicated that miR-140-5p overexpression suppressed the expression of BMP2 and FGF9 in cultured human palate cells and Fgf9 and Pdgfra in cultured mouse palate cells. In summary, our conserved miRNA–TF–gene regulatory network approach is effective in detecting consensus miRNAs, motifs, and regulatory mechanisms in human and mouse CP
Brief:识别腭裂相关转录因子,miRNA及mRNA并推断作用关系,进行部分实验验证
Somatic mutation and gene expression dysregulation are considered two major tumorigenesis factors. While independent investigations of either factor pervade, studies of associations between somatic mutations and gene expression changes have been sporadic and nonsystematic. Utilizing genomic data collected from 11 315 subjects of 33 distinct cancer types, we constructed MutEx, a pan-cancer integrative genomic database. This database records the relationships among gene expression, somatic mutation and survival data for cancer patients. MutEx can be used to swiftly explore the relationship between these genomic/clinic features within and across cancer types and, more importantly, search for corroborating evidence for hypothesis inception. Our database also incorporated Gene Ontology and several pathway databases to enhance functional annotation, and elastic net and a gene expression composite score to aid in survival analysis. To demonstrate the usability of MutEx, we provide several application examples, including top somatic mutations associated with the most extensive expression dysregulation in breast cancer, differential mutational burden downstream of DNA mismatch repair gene mutations and composite gene expression score-based survival difference in breast cancer. MutEx can be accessed at http://www.innovebioinfo.com/Databases/Mutationdb_About.php.
Brief:泛癌体细胞突变、基因表达及生存关联数据库
With the increasing awareness of heterogeneity in cancers, better prediction of cancer prognosis is much needed for more personalized treatment. Recently, extensive efforts have been made to explore the variations in gene expression for better prognosis. However, the prognostic gene signatures predicted by most existing methods have little robustness among different datasets of the same cancer. To improve the robustness of the gene signatures, we propose a novel high-frequency sub-pathways mining approach (HiFreSP), integrating a randomization strategy with gene interaction pathways. We identified a six-gene signature (CCND1, CSF3R, E2F2, JUP, RARA and TCF7) in esophageal squamous cell carcinoma (ESCC) by HiFreSP. This signature displayed a strong ability to predict the clinical outcome of ESCC patients in two independent datasets (log-rank test, P = 0.0045 and 0.0087). To further show the predictive performance of HiFreSP, we applied it to two other cancers: pancreatic adenocarcinoma and breast cancer. The identified signatures show high predictive power in all testing datasets of the two cancers. Furthermore, compared with the two popular prognosis signature predicting methods, the least absolute shrinkage and selection operator penalized Cox proportional hazards model and the random survival forest, HiFreSP showed better predictive accuracy and generalization across all testing datasets of the above three cancers. Lastly, we applied HiFreSP to 8137 patients involving 20 cancer types in the TCGA database and found high-frequency prognosis-associated pathways in many cancers. Taken together, HiFreSP shows higher prognostic capability and greater robustness, and the identified signatures provide clinical guidance for cancer prognosis. HiFreSP is freely available via GitHub: https://github.com/chunquanlipathway/HiFreSP.
Brief:癌症预后基因筛选工具,基于结合随机化策略及基因交互通路,优于LASSO及COX回归
Depression is a seriously disabling psychiatric disorder with a significant burden of disease. Metabolic abnormalities have been widely reported in depressed patients and animal models. However, there are few systematic efforts that integrate meaningful biological insights from these studies. Herein, available metabolic knowledge in the context of depression was integrated to provide a systematic and panoramic view of metabolic characterization. After screening more than 10 000 citations from five electronic literature databases and five metabolomics databases, we manually curated 5675 metabolite entries from 464 studies, including human, rat, mouse and non-human primate, to develop a new metabolite-disease association database, called MENDA (http://menda.cqmu.edu.cn:8080/index.php). The standardized data extraction process was used for data collection, a multi-faceted annotation scheme was developed, and a user-friendly search engine and web interface were integrated for database access. To facilitate data analysis and interpretation based on MENDA, we also proposed a systematic analytical framework, including data integration and biological function analysis. Case studies were provided that identified the consistently altered metabolites using the vote-counting method, and that captured the underlying molecular mechanism using pathway and network analyses. Collectively, we provided a comprehensive curation of metabolic characterization in depression. Our model of a specific psychiatry disorder may be replicated to study other complex diseases.
Brief:抑郁症代谢图谱,通过已有数据库及文献整合实现,提供相关功能注释
Circular RNAs (circRNAs) are a group of novel discovered non-coding RNAs with closed-loop structure, which play critical roles in various biological processes. Identifying associations between circRNAs and diseases is critical for exploring the complex disease mechanism and facilitating disease-targeted therapy. Although several computational predictors have been proposed, their performance is still limited. In this study, a novel computational method called iCircDA-MF is proposed. Because the circRNA-disease associations with experimental validation are very limited, the potential circRNA-disease associations are calculated based on the circRNA similarity and disease similarity extracted from the disease semantic information and the known associations of circRNA-gene, gene-disease and circRNA-disease. The circRNA-disease interaction profiles are then updated by the neighbour interaction profiles so as to correct the false negative associations. Finally, the matrix factorization is performed on the updated circRNA-disease interaction profiles to predict the circRNA-disease associations. The experimental results on a widely used benchmark dataset showed that iCircDA-MF outperforms other state-of-the-art predictors and can identify new circRNA-disease associations effectively.
Brief:环状RNA与疾病关联预测工具,输入为文本信息及已知circRNA,基因及疾病间的关系,基于矩阵因子分解
Microbial community (MC) has great impact on mediating complex disease indications, biogeochemical cycling and agricultural productivities, which makes metaproteomics powerful technique for quantifying diverse and dynamic composition of proteins or peptides. The key role of biostatistical strategies in MC study is reported to be underestimated, especially the appropriate application of feature selection method (FSM) is largely ignored. Although extensive efforts have been devoted to assessing the performance of FSMs, previous studies focused only on their classification accuracy without considering their ability to correctly and comprehensively identify the spiked proteins. In this study, the performances of 14 FSMs were comprehensively assessed based on two key criteria (both sample classification and spiked protein discovery) using a variety of metaproteomics benchmarks. First, the classification accuracies of those 14 FSMs were evaluated. Then, their abilities in identifying the proteins of different spiked concentrations were assessed. Finally, seven FSMs (FC, LMEB, OPLS-DA, PLS-DA, SAM, SVM-RFE and T-Test) were identified as performing consistently superior or good under both criteria with the PLS-DA performing consistently superior. In summary, this study served as comprehensive analysis on the performances of current FSMs and could provide a valuable guideline for researchers in metaproteomics.
Brief:宏蛋白质组学
Streptococcus pneumoniae is the most common human respiratory pathogen, and β-lactam antibiotics have been employed to treat infections caused by S. pneumoniae for decades. β-lactam resistance is steadily increasing in pneumococci and is mainly associated with the alteration in penicillin-binding proteins (PBPs) that reduce binding affinity of antibiotics to PBPs. However, the high variability of PBPs in clinical isolates and their mosaic gene structure hamper the predication of resistance level according to the PBP gene sequences. In this study, we developed a systematic strategy for applying supervised machine learning to predict S. pneumoniae antimicrobial susceptibility to β-lactam antibiotics. We combined published PBP sequences with minimum inhibitory concentration (MIC) values as labelled data and the sequences from NCBI database without MIC values as unlabelled data to develop an approach, using only a fragment from pbp2x (750 bp) and a fragment from pbp2b (750 bp) to predicate the cefuroxime and amoxicillin resistance. We further validated the performance of the supervised learning model by constructing mutants containing the randomly selected pbps and testing more clinical strains isolated from Chinese hospital. In addition, we established the association between resistance phenotypes and serotypes and sequence type of S. pneumoniae using our approach, which facilitate the understanding of the worldwide epidemiology of S. pneumonia.
Brief:监督学习预测肺炎链球菌对抗生素敏感性
Functional annotation of protein sequence with high accuracy has become one of the most important issues in modern biomedical studies, and computational approaches of significantly accelerated analysis process and enhanced accuracy are greatly desired. Although a variety of methods have been developed to elevate protein annotation accuracy, their ability in controlling false annotation rates remains either limited or not systematically evaluated. In this study, a protein encoding strategy, together with a deep learning algorithm, was proposed to control the false discovery rate in protein function annotation, and its performances were systematically compared with that of the traditional similarity-based and de novo approaches. Based on a comprehensive assessment from multiple perspectives, the proposed strategy and algorithm were found to perform better in both prediction stability and annotation accuracy compared with other de novo methods. Moreover, an in-depth assessment revealed that it possessed an improved capacity of controlling the false discovery rate compared with traditional methods. All in all, this study not only provided a comprehensive analysis on the performances of the newly proposed strategy but also provided a tool for the researcher in the fields of protein function annotation.
Brief:基于蛋白质编码策略及深度学习提高蛋白质功能注释准确性
Essential genes are those whose loss of function compromises organism viability or results in profound loss of fitness. Recent gene-editing technologies have provided new opportunities to characterize essential genes. Here, we present an integrated analysis that comprehensively and systematically elucidates the genetic and regulatory characteristics of human essential genes. First, we found that essential genes act as ‘hubs’ in protein–protein interaction networks, chromatin structure and epigenetic modification. Second, essential genes represent conserved biological processes across species, although gene essentiality changes differently among species. Third, essential genes are important for cell development due to their discriminate transcription activity in embryo development and oncogenesis. In addition, we developed an interactive web server, the Human Essential Genes Interactive Analysis Platform (http://sysomics.com/HEGIAP/), which integrates abundant analytical tools to enable global, multidimensional interpretation of gene essentiality. Our study provides new insights that improve the understanding of human essential genes.
Brief:基于基因编辑构建人类必须基因的遗传及调控特性相关数据库
Long non-coding RNAs (lncRNAs) are of fundamental biological importance; however, their functional role is often unclear or loosely defined as experimental characterization is challenging and bioinformatic methods are limited. We developed a novel integrated method protocol for the annotation and detailed functional characterization of lncRNAs within the genome. It combines annotation, normalization and gene expression with sequence-structure conservation, functional interactome and promoter analysis. Our protocol allows an analysis based on the tissue and biological context, and is powerful in functional characterization of experimental and clinical RNA-Seq datasets including existing lncRNAs. This is demonstrated on the uncharacterized lncRNA GATA6-AS1 in dilated cardiomyopathy.
Brief:LncRNA功能注释
Nucleic Acids Res
Volume 48, Issue 16, 18 September 2020
Microbial and viral communities transform the chemistry of Earth's ecosystems, yet the specific reactions catalyzed by these biological engines are hard to decode due to the absence of a scalable, metabolically resolved, annotation software. Here, we present DRAM (Distilled and Refined Annotation of Metabolism), a framework to translate the deluge of microbiome-based genomic information into a catalog of microbial traits. To demonstrate the applicability of DRAM across metabolically diverse genomes, we evaluated DRAM performance on a defined, in silico soil community and previously published human gut metagenomes. We show that DRAM accurately assigned microbial contributions to geochemical cycles and automated the partitioning of gut microbial carbohydrate metabolism at substrate levels. DRAM-v, the viral mode of DRAM, established rules to identify virally-encoded auxiliary metabolic genes (AMGs), resulting in the metabolic categorization of thousands of putative AMGs from soils and guts. Together DRAM and DRAM-v provide critical metabolic profiling capabilities that decipher mechanisms underpinning microbiome function.
Brief:微生物基因组信息提取
The most popular RNA secondary structure prediction programs utilize free energy (ΔG°37) minimization and rely upon thermodynamic parameters from the nearest neighbor (NN) model. Experimental parameters are derived from a series of optical melting experiments; however, acquiring enough melt data to derive accurate NN parameters with modified base pairs is expensive and time consuming. Given the multitude of known natural modifications and the continuing use and development of unnatural nucleotides, experimentally characterizing all modified NNs is impractical. This dilemma necessitates a computational model that can predict NN thermodynamics where experimental data is scarce or absent. Here, we present a combined molecular dynamics/quantum mechanics protocol that accurately predicts experimental NN ΔG°37 parameters for modified nucleotides with neighboring Watson–Crick base pairs. NN predictions for Watson-Crick and modified base pairs yielded an overall RMSD of 0.32 kcal/mol when compared with experimentally derived parameters. NN predictions involving modified bases without experimental parameters (N6-methyladenosine, 2-aminopurineriboside, and 5-methylcytidine) demonstrated promising agreement with available experimental melt data. This procedure not only yields accurate NN ΔG°37 predictions but also quantifies stacking and hydrogen bonding differences between modified NNs and their canonical counterparts, allowing investigators to identify energetic differences and providing insight into sources of (de)stabilization from nucleotide modifications.
Brief:RNA二级机构预测
The differentiation and regeneration of skeletal muscle from myoblasts to myotubes involves myogenic transcription factors, such as myocardin-related transcription factor A (MRTF-A) and serum response factor (SRF). In addition, post-transcriptional regulation by miRNAs is required during myogenesis. Here, we provide evidence for novel mechanisms regulating MRTF-A during myogenic differentiation. Endogenous MRTF-A protein abundance and activity decreased during C2C12 differentiation, which was attributable to miRNA-directed inhibition. Conversely, overexpression of MRTF-A impaired differentiation and myosin expression. Applying miRNA trapping by RNA affinity purification (miTRAP), we identified miRNAs which directly regulate MRTF-A via its 3′UTR, including miR-1a-3p, miR-206-3p, miR-24-3p and miR-486-5p. These miRNAs were upregulated during differentiation and specifically recruited to the 3′UTR of MRTF-A. Concomitantly, Ago2 recruitment to the MRTF-A 3′UTR was considerably increased, whereas Dicer1 depletion or 3′UTR deletion elevated MRTF-A and inhibited differentiation. MRTF-A protein expression was inhibited by ectopic miRNA expression in murine C2C12 and primary human myoblasts. 3′UTR reporter activity diminished upon differentiation or miRNA expression, whereas deletion of the predicted binding sites reversed these effects. Furthermore, TGF-β abolished MRTF-A reduction and decreased miR-486-5p expression. Our findings implicate miR-24-3p and miR-486-5p in the repression of MRTF-A and suggest a complex network of transcriptional and post-transcriptional mechanisms regulating myogenesis.
Brief:肌原性分化调控中miRNA的作用_实验文章
Infertility is a complex multifactorial disease that affects up to 10% of couples across the world. However, many mechanisms of infertility remain unclear due to the lack of studies based on systematic knowledge, leading to ineffective treatment and/or transmission of genetic defects to offspring. Here, we developed an infertility disease database to provide a comprehensive resource featuring various factors involved in infertility. Features in the current IDDB version were manually curated as follows: (i) a total of 307 infertility-associated genes in human and 1348 genes associated with reproductive disorder in 9 model organisms; (ii) a total of 202 chromosomal abnormalities leading to human infertility, including aneuploidies and structural variants; and (iii) a total of 2078 pathogenic variants from infertility patients’ samples across 60 different diseases causing infertility. Additionally, the characteristics of clinically diagnosed infertility patients (i.e. causative variants, laboratory indexes and clinical manifestations) were collected. To the best of our knowledge, the IDDB is the first infertility database serving as a systematic resource for biologists to decipher infertility mechanisms and for clinicians to achieve better diagnosis/treatment of patients from disease phenotype to genetic factors. The IDDB is freely available at http://mdl.shsmu.edu.cn/IDDB/.
Brief:不孕疾病数据库,含有生殖相关基因突变
PULs (polysaccharide utilization loci) are discrete gene clusters of CAZymes (Carbohydrate Active EnZymes) and other genes that work together to digest and utilize carbohydrate substrates. While PULs have been extensively characterized in Bacteroidetes, there exist PULs from other bacterial phyla, as well as archaea and metagenomes, that remain to be catalogued in a database for efficient retrieval. We have developed an online database dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) to display experimentally verified CAZyme-containing PULs from literature with pertinent metadata, sequences, and annotation. Compared to other online CAZyme and PUL resources, dbCAN-PUL has the following new features: (i) Batch download of PUL data by target substrate, species/genome, genus, or experimental characterization method; (ii) Annotation for each PUL that displays associated metadata such as substrate(s), experimental characterization method(s) and protein sequence information, (iii) Links to external annotation pages for CAZymes (CAZy), transporters (UniProt) and other genes, (iv) Display of homologous gene clusters in GenBank sequences via integrated MultiGeneBlast tool and (v) An integrated BLASTX service available for users to query their sequences against PUL proteins in dbCAN-PUL. With these features, dbCAN-PUL will be an important repository for CAZyme and PUL research, complementing our other web servers and databases (dbCAN2, dbCAN-seq).
Brief:碳水化合物活性酶序列,注释
Many studies have indicated that non-coding RNA (ncRNA) dysfunction is closely related to numerous diseases. Recently, accumulated ncRNA–disease associations have made related databases insufficient to meet the demands of biomedical research. The constant updating of ncRNA–disease resources has become essential. Here, we have updated the mammal ncRNA–disease repository (MNDR, http://www.rna-society.org/mndr/) to version 3.0, containing more than one million entries, four-fold increment in data compared to the previous version. Experimental and predicted circRNA–disease associations have been integrated, increasing the number of categories of ncRNAs to five, and the number of mammalian species to 11. Moreover, ncRNA–disease related drug annotations and associations, as well as ncRNA subcellular localizations and interactions, were added. In addition, three ncRNA–disease (miRNA/lncRNA/circRNA) prediction tools were provided, and the website was also optimized, making it more practical and user-friendly. In summary, MNDR v3.0 will be a valuable resource for the investigation of disease mechanisms and clinical treatment strategies.
Brief:ncRNA数据库,可获得与疾病,药物关联数据,以及ncRNA亚细胞定位及相互作用
Although cancer is the leading cause of disease-related mortality in children, the relative rarity of pediatric cancers poses a significant challenge for developing novel therapeutics to further improve prognosis. Patient-derived xenograft (PDX) models, which are usually developed from high-risk tumors, are a useful platform to study molecular driver events, identify biomarkers and prioritize therapeutic agents. Here, we develop PDX for Childhood Cancer Therapeutics (PCAT), a new integrated portal for pediatric cancer PDX models. Distinct from previously reported PDX portals, PCAT is focused on pediatric cancer models and provides intuitive interfaces for querying and data mining. The current release comprises 324 models and their associated clinical and genomic data, including gene expression, mutation and copy number alteration. Importantly, PCAT curates preclinical testing results for 68 models and 79 therapeutic agents manually collected from individual agent testing studies published since 2008. To facilitate comparisons of patterns between patient tumors and PDX models, PCAT curates clinical and molecular data of patient tumors from the TARGET project. In addition, PCAT provides access to gene fusions identified in nearly 1000 TARGET samples. PCAT was built using R-shiny and MySQL. The portal can be accessed at http://pcat.zhenglab.info or http://www.pedtranscriptome.org.
Brief:儿科癌症PDX数据库,包括324个模型及其临床,基因表达和突变数据,且包含有药物的治疗数据
Brief:海洋微生物测序数据及相应水样的理化性质数据库
Housekeeping (HK) genes are constitutively expressed genes that are required for the maintenance of basic cellular functions. Despite their importance in the calibration of gene expression, as well as the understanding of many genomic and evolutionary features, important discrepancies have been observed in studies that previously identified these genes. Here, we present Housekeeping and Reference Transcript Atlas (HRT Atlas v1.0, www.housekeeping.unicamp.br) a web-based database which addresses some of the previously observed limitations in the identification of these genes, and offers a more accurate database of human and mouse HK genes and transcripts. The database was generated by mining massive human and mouse RNA-seq data sets, including 11 281 and 507 high-quality RNA-seq samples from 52 human non-disease tissues/cells and 14 healthy tissues/cells of C57BL/6 wild type mouse, respectively. User can visualize the expression and download lists of 2158 human HK transcripts from 2176 HK genes and 3024 mouse HK transcripts from 3277 mouse HK genes. HRT Atlas also offers the most stable and suitable tissue selective candidate reference transcripts for normalization of qPCR experiments. Specific primers and predicted modifiers of gene expression for some of these HK transcripts are also proposed. HRT Atlas has also been integrated with a regulatory elements resource from Epiregio server.
Brief:小鼠及人house keeping gene数据库,基于RNA-seq数据挖掘。可用于查询适应特定组织的内参,并且集成了引物序列
PathDIP was introduced to increase proteome coverage of literature-curated human pathway databases. PathDIP 4 now integrates 24 major databases. To further reduce the number of proteins with no curated pathway annotation, pathDIP integrates pathways with physical protein–protein interactions (PPIs) to predict significant physical associations between proteins and curated pathways. For human, it provides pathway annotations for 5366 pathway orphans. Integrated pathway annotation now includes six model organisms and ten domesticated animals. A total of 6401 core and ortholog pathways have been curated from the literature or by annotating orthologs of human proteins in the literature-curated pathways. Extended pathways are the result of combining these pathways with protein-pathway associations that are predicted using organism-specific PPIs. Extended pathways expand proteome coverage from 81 088 to 120 621 proteins, making pathDIP 4 the largest publicly available pathway database for these organisms and providing a necessary platform for comprehensive pathway-enrichment analysis. PathDIP 4 users can customize their search and analysis by selecting organism, identifier and subset of pathways. Enrichment results and detailed annotations for input list can be obtained in different formats and views. To support automated bioinformatics workflows, Java, R and Python APIs are available for batch pathway annotation and enrichment analysis. PathDIP 4 is publicly available at http://ophid.utoronto.ca/pathDIP.
Brief:4种模式动物及10种家养动物的通路注释数据集
Genomics, Proteomics & Bioinformatics
Volume 17, Issue 5,Pages 473-550 (October 2019)
Brief:深度学习预测蛋白与RNA结合预测
Accurate identification of compound–protein interactions (CPIs) in silico may deepen our understanding of the underlying mechanisms of drug action and thus remarkably facilitate drug discovery and development. Conventional similarity- or docking-based computational methods for predicting CPIs rarely exploit latent features from currently available large-scale unlabeled compound and protein data and often limit their usage to relatively small-scale datasets. In the present study, we propose DeepCPI, a novel general and scalable computational framework that combines effective feature embedding (a technique of representation learning) with powerful deep learning methods to accurately predict CPIs at a large scale. DeepCPI automatically learns the implicit yet expressive low-dimensional features of compounds and proteins from a massive amount of unlabeled data. Evaluations of the measured CPIs in large-scale databases, such as ChEMBL and BindingDB, as well as of the known drug–target interactions from DrugBank, demonstrated the superior predictive performance of DeepCPI. Furthermore, several interactions among small-molecule compounds and three G protein-coupled receptor targets (glucagon-like peptide-1 receptor, glucagon receptor, and vasoactive intestinal peptide receptor) predicted using DeepCPI were experimentally validated. The present study suggests that DeepCPI is a useful and powerful tool for drug discovery and repositioning. The source code of DeepCPI can be downloaded from https://github.com/FangpingWan/DeepCPI.
Brief:基于特征嵌入及深度学习算法预测复合蛋白作用,以提示药物筛选
Bioinformics
volume 36, Issue 12, 15 June 2020
Next-generation sequencing technologies have accelerated the discovery of single nucleotide variants in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral. Recently, we proposed CScape, a method for discriminating between cancer driver mutations and presumed benign variants. For the neutral class, this method relied on benign germline variants found in the 1000 Genomes Project database. Discrimination could, therefore, be influenced by the distinction of germline versus somatic, rather than neutral versus disease driver. This motivates this article in which we consider predictive discrimination between recurrent and rare somatic single point mutations based solely on using cancer data, and the distinction between these two somatic classes and germline single point mutations.
Brief:预测驱动突变
We studied the problem of discriminating early- and late-stage tumors of several cancers using genomic information while enforcing interpretability on the solutions. To this end, we developed a multitask multiple kernel learning (MTMKL) method with a co-clustering step based on a cutting-plane algorithm to identify the relationships between the input tasks and kernels. We tested our algorithm on 15 cancer cohorts and observed that, in most cases, MTMKL outperforms other algorithms (including random forests, support vector machine and single-task multiple kernel learning) in terms of predictive power. Using the aggregate results from multiple replications, we also derived similarity matrices between cancer cohorts, which are, in many cases, in agreement with available relationships reported in the relevant literature.
Brief:基于一种多核学习法通过基因表达量预测早期及晚期癌症
We describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse non-negative matrix factorization, cluster ‘fitness’, support vector machine) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, we show that the PageRank algorithm effectively downsamples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell types and while recovering novel transcriptionally distinct cell populations. We believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.
Brief:基于非负矩阵分解,聚类,支持向量机等多步算法确定singlecell亚群聚类
Here, we present a Bayesian ridge regression-based method (B-GEX) to infer gene expression profiles of multiple tissues from blood gene expression profile. For each gene in a tissue, a low-dimensional feature vector was extracted from whole blood gene expression profile by feature selection. We used GTEx RNAseq data of 16 tissues to train inference models to capture the cross-tissue expression correlations between each target gene in a tissue and its preselected feature genes in peripheral blood. We compared B-GEX with least square regression, LASSO regression and ridge regression. B-GEX outperforms the other three models in most tissues in terms of mean absolute error, Pearson correlation coefficient and root-mean-squared error. Moreover, B-GEX infers expression level of tissue-specific genes as well as those of non-tissue-specific genes in all tissues. Unlike previous methods, which require genomic features or gene expression profiles of multiple tissues, our model only requires whole blood expression profile as input. B-GEX helps gain insights into gene expressions of uncollected tissues from more accessible data of blood.
Brief:基于贝叶斯岭回归从血液基因表达谱推测多个组织的表达谱
Gene network inference and master regulator analysis (MRA) have been widely adopted to define specific transcriptional perturbations from gene expression signatures. Several tools exist to perform such analyses but most require a computer cluster or large amounts of RAM to be executed.
We developed corto, a fast and lightweight R package to infer gene networks and perform MRA from gene expression data, with optional corrections for copy-number variations and able to run on signatures generated from RNA-Seq or ATAC-Seq data. We extensively benchmarked it to infer context-specific gene networks in 39 human tumor and 27 normal tissue datasets.
Brief:快速从基因表达数据推断调节网络
Complex diseases are due to the dense interactions of many disease-associated factors that dysregulate genes that in turn form the so-called disease modules, which have shown to be a powerful concept for understanding pathological mechanisms. There exist many disease module inference methods that rely on somewhat different assumptions, but there is still no gold standard or best-performing method. Hence, there is a need for combining these methods to generate robust disease modules.
We developed MODule IdentiFIER (MODifieR), an ensemble R package of nine disease module inference methods from transcriptomics networks. MODifieR uses standardized input and output allowing the possibility to combine individual modules generated from these methods into more robust disease-specific modules, contributing to a better understanding of complex diseases.
Brief:从表达矩阵推断疾病相关网络