2021-06-16

1 Introduction
2 GDCRNATools package installation
3 Quick start
- 3.1 Data preparation
  - 3.1.1 Normalization of HTSeq-Counts data
  - 3.1.2 Parse metadata
- 3.2 ceRNAs network analysis
4 Case study: TCGA-CHOL
5 sessionInfo
6 References

1 Introduction

GDCRNATools is an R package which provides a standard, easy-to-use and comprehensive pipeline for downloading, organizing, and integrative analyzing RNA expression data in the GDC portal with an emphasis on deciphering the lncRNA-mRNA related ceRNAs regulatory network in cancer.

Competing endogenous RNAs (ceRNAs) are RNAs that indirectly regulate other transcripts by competing for shared miRNAs. Although only a fraction of long non-coding RNAs has been functionally characterized, increasing evidences show that lncRNAs harboring multiple miRNA response elements (MREs) can act as ceRNAs to sequester miRNA activity and thus reduce the inhibition of miRNA on its targets. Deregulation of ceRNAs network may lead to human diseases.

The Genomic Data Commons (GDC) maintains standardized genomic, clinical, and biospecimen data from National Cancer Institute (NCI) programs including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research To Generate Effective Treatments (TARGET), It also accepts high quality datasets from non-NCI supported cancer research programs, such as genomic data from the Foundation Medicine.

Many analyses can be perfomed using GDCRNATools, including differential gene expression analysis (limma(???), edgeR(???), and DESeq2(???)), univariate survival analysis (CoxPH and KM), competing endogenous RNA network analysis (hypergeometric test, Pearson correlation analysis, regulation similarity analysis, sensitivity Pearson partial correlation(???)), and functional enrichment analysis(GO, KEGG, DO). Besides some routine visualization methods such as volcano plot, scatter plot, and bubble plot, etc., three simple shiny apps are developed in GDCRNATools allowing users visualize the results on a local webpage. All the figures are plotted based on ggplot2 package unless otherwise specified.

This user-friendly package allows researchers perform the analysis by simply running a few functions and integrate their own pipelines such as molecular subtype classification, weighted correlation network analysis (WGCNA)(???), and TF-miRNA co-regulatory network analysis, etc. into the workflow easily. This could open a door to accelerate the study of crosstalk among different classes of RNAs and their regulatory relationships in cancer.

2 `GDCRNATools` package installation

The R software for running GDCRNATools can be downloaded from The Comprehensive R Archive Network (CRAN). The GDCRNATools package can be installed from Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#BiocManager::install("GDCRNATools")

library(GDCRNATools)

3 Quick start

In GDCRNATools, some functions are built for users to download and process GDC data efficiently. Users can also use their own data that is processed by other tools such as the UCSC Xena GDC hub, TCGAbiolinks(???), or TCGA-Assembler(???), etc.

Here we use a small dataset to show the most basic steps for ceRNAs network analysis. More detailed instruction of each step is in the Case Study section.

3.1 Data preparation

3.1.1 Normalization of HTSeq-Counts data

library(DT)

### load RNA counts data
data(rnaCounts)

### load miRNAs counts data
data(mirCounts)

####### Normalization of RNAseq data #######
rnaExpr <- gdcVoomNormalization(counts = rnaCounts, filter = FALSE)

####### Normalization of miRNAs data #######
mirExpr <- gdcVoomNormalization(counts = mirCounts, filter = FALSE)

3.1.2 Parse metadata

####### Parse and filter RNAseq metadata #######
metaMatrix.RNA <- gdcParseMetadata(project.id = 'TCGA-CHOL',
                                   data.type  = 'RNAseq', 
                                   write.meta = FALSE)

metaMatrix.RNA <- gdcFilterDuplicate(metaMatrix.RNA)
metaMatrix.RNA <- gdcFilterSampleType(metaMatrix.RNA)
metaMatrix.RNA[1:5,]

##                                                             file_name
## TCGA-3X-AAV9-01A 725eaa94-5221-4c22-bced-0c36c10c2c3b.htseq.counts.gz
## TCGA-3X-AAVA-01A b6a2c03a-c8ad-41e9-8a19-8f5ac53cae9f.htseq.counts.gz
## TCGA-3X-AAVB-01A c2765336-c804-4fd2-b45a-e75af2a91954.htseq.counts.gz
## TCGA-3X-AAVC-01A 8b20cba8-9fd5-4d56-bd02-c6f4a62767e8.htseq.counts.gz
## TCGA-3X-AAVE-01A 4082f7d5-5656-476a-9aaf-36f7cea0ac55.htseq.counts.gz
##                                               file_id      patient
## TCGA-3X-AAV9-01A 85bc7f81-51fb-4446-b12d-8741eef4acee TCGA-3X-AAV9
## TCGA-3X-AAVA-01A 42b8d463-6209-4ea0-bb01-8023a1302fa0 TCGA-3X-AAVA
## TCGA-3X-AAVB-01A 6e2031e9-df75-48df-b094-8dc6be89bf8b TCGA-3X-AAVB
## TCGA-3X-AAVC-01A 19e8fd21-f6c8-49b0-aa76-109eef46c2e9 TCGA-3X-AAVC
## TCGA-3X-AAVE-01A 1ace0df3-9837-467e-85de-c938efda8fc8 TCGA-3X-AAVE
##                           sample     submitter_id          entity_submitter_id
## TCGA-3X-AAV9-01A TCGA-3X-AAV9-01 TCGA-3X-AAV9-01A TCGA-3X-AAV9-01A-72R-A41I-07
## TCGA-3X-AAVA-01A TCGA-3X-AAVA-01 TCGA-3X-AAVA-01A TCGA-3X-AAVA-01A-11R-A41I-07
## TCGA-3X-AAVB-01A TCGA-3X-AAVB-01 TCGA-3X-AAVB-01A TCGA-3X-AAVB-01A-31R-A41I-07
## TCGA-3X-AAVC-01A TCGA-3X-AAVC-01 TCGA-3X-AAVC-01A TCGA-3X-AAVC-01A-21R-A41I-07
## TCGA-3X-AAVE-01A TCGA-3X-AAVE-01 TCGA-3X-AAVE-01A TCGA-3X-AAVE-01A-11R-A41I-07
##                   sample_type gender age_at_diagnosis tumor_stage tumor_grade
## TCGA-3X-AAV9-01A PrimaryTumor   male            26349      stagei        <NA>
## TCGA-3X-AAVA-01A PrimaryTumor female            18303     stageii        <NA>
## TCGA-3X-AAVB-01A PrimaryTumor female            25819    stageivb        <NA>
## TCGA-3X-AAVC-01A PrimaryTumor female            26493      stagei        <NA>
## TCGA-3X-AAVE-01A PrimaryTumor   male            21943     stageii        <NA>
##                  days_to_death days_to_last_follow_up vital_status project_id
## TCGA-3X-AAV9-01A           339                     NA         Dead  TCGA-CHOL
## TCGA-3X-AAVA-01A           445                     NA         Dead  TCGA-CHOL
## TCGA-3X-AAVB-01A            NA                    402        Alive  TCGA-CHOL
## TCGA-3X-AAVC-01A            NA                    709        Alive  TCGA-CHOL
## TCGA-3X-AAVE-01A            NA                    650        Alive  TCGA-CHOL

3.2 ceRNAs network analysis

3.2.1 Identification of differentially expressed genes (DEGs)

DEGAll <- gdcDEAnalysis(counts     = rnaCounts, 
                        group      = metaMatrix.RNA$sample_type, 
                        comparison = 'PrimaryTumor-SolidTissueNormal', 
                        method     = 'limma')
DEGAll[1:5,]

##                 symbol          group     logFC   AveExpr         t
## ENSG00000143257  NR1I3 protein_coding -6.916825  7.023129 -17.29086
## ENSG00000205707 ETFRF1 protein_coding -2.492182  9.515997 -16.06753
## ENSG00000134532   SOX5 protein_coding -4.871118  6.228227 -15.03589
## ENSG00000141338  ABCA8 protein_coding -5.653794  7.520581 -14.86069
## ENSG00000066583  ISOC1 protein_coding -2.370131 10.466194 -14.56532
##                       PValue          FDR        B
## ENSG00000143257 4.244355e-22 2.419282e-19 40.04288
## ENSG00000205707 8.353256e-21 2.380678e-18 37.19751
## ENSG00000134532 1.168746e-19 2.220617e-17 34.49828
## ENSG00000141338 1.851519e-19 2.638414e-17 34.11581
## ENSG00000066583 4.053959e-19 4.621513e-17 33.35640

### All DEGs
deALL <- gdcDEReport(deg = DEGAll, gene.type = 'all')

### DE long-noncoding
deLNC <- gdcDEReport(deg = DEGAll, gene.type = 'long_non_coding')

### DE protein coding genes
dePC <- gdcDEReport(deg = DEGAll, gene.type = 'protein_coding')

3.2.2 ceRNAs network analysis of DEGs

ceOutput <- gdcCEAnalysis(lnc         = rownames(deLNC), 
                          pc          = rownames(dePC), 
                          lnc.targets = 'starBase', 
                          pc.targets  = 'starBase', 
                          rna.expr    = rnaExpr, 
                          mir.expr    = mirExpr)

## Step 1/3: Hypergenometric test done !

## Step 2/3: Correlation analysis done !
## Step 3/3: Regulation pattern analysis done !

ceOutput[1:5,]

##           lncRNAs           Genes Counts listTotal popHits popTotal
## 1 ENSG00000234456 ENSG00000107864      2         2      95      277
## 2 ENSG00000234456 ENSG00000135111      2         2      24      277
## 3 ENSG00000234456 ENSG00000165672      2         2       8      277
## 4 ENSG00000234456 ENSG00000100934      2         2      20      277
## 5 ENSG00000234456 ENSG00000117500      2         2      28      277
##     foldEnrichment          hyperPValue                          miRNAs
## 1 2.91578947368421    0.116805315753675 hsa-miR-374b-5p,hsa-miR-374a-5p
## 2 11.5416666666667   0.0072202166064982 hsa-miR-374b-5p,hsa-miR-374a-5p
## 3           34.625 0.000732485742688222 hsa-miR-374b-5p,hsa-miR-374a-5p
## 4            13.85  0.00497043896824151 hsa-miR-374b-5p,hsa-miR-374a-5p
## 5 9.89285714285714  0.00988855752629099 hsa-miR-374b-5p,hsa-miR-374a-5p
##         cor    corPValue    regSim          sppc
## 1 0.6737432 1.963579e-07 0.3481546 -7.963190e-03
## 2 0.6467307 7.943945e-07 0.8878253  6.185822e-04
## 3 0.4626116 6.880428e-04 0.4289101  7.057739e-05
## 4 0.7080350 2.665317e-08 0.3733481 -8.430679e-03
## 5 0.6195919 2.836509e-06 0.4051700 -1.232672e-03

3.2.3 Export ceRNAs network to Cytoscape

ceOutput2 <- ceOutput[ceOutput$hyperPValue<0.01 
    & ceOutput$corPValue<0.01 & ceOutput$regSim != 0,]

### Export edges
edges <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'edges')
edges[1:5,]

##           fromNode          toNode altNode1Name
## 1  ENSG00000234456 hsa-miR-374b-5p    MAGI2-AS3
## 2  ENSG00000234456 hsa-miR-374a-5p    MAGI2-AS3
## 47 ENSG00000234741     hsa-miR-137         GAS5
## 50 ENSG00000255717  hsa-miR-377-3p        SNHG1
## 51 ENSG00000255717     hsa-miR-421        SNHG1

### Export nodes
nodes <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'nodes')
nodes[1:5,]

##              gene symbol type numInteractions
## 1 ENSG00000003989 SLC7A2   pc               2
## 2 ENSG00000004799   PDK4   pc               5
## 3 ENSG00000021826   CPS1   pc               3
## 4 ENSG00000047634  SCML1   pc               3
## 5 ENSG00000049246   PER3   pc               3

4 Case study: TCGA-CHOL

In this section, we use the whole datasets of TCGA-CHOL project as an example to illustrate how GDCRNATools works in detail.

4.1 Data download

Two methods are provided for downloading Gene Expression Quantification (HTSeq-Counts), Isoform Expression Quantification (BCGSC miRNA Profiling), and Clinical (Clinical Supplement) data:

4.1.1 Automatic download

To provide users a convenient method for data download, by default, we used the API method developed in the GenomicDataCommons package to download data automatically by specifying data.type and project.id arguments. An alternative method using the gdc-client for automatic download is also provided in case that the API method fails.

project <- 'TCGA-CHOL'
rnadir <- paste(project, 'RNAseq', sep='/')
mirdir <- paste(project, 'miRNAs', sep='/')

####### Download RNAseq data #######
gdcRNADownload(project.id     = 'TCGA-CHOL', 
               data.type      = 'RNAseq', 
               write.manifest = FALSE,
               directory      = rnadir)

####### Download miRNAs data #######
gdcRNADownload(project.id     = 'TCGA-CHOL', 
               data.type      = 'miRNAs', 
               write.manifest = FALSE,
               directory      = mirdir)

4.1.2 Manual download

Users can also download data manually by providing the manifest file that is downloaded from the GDC cart

Step1: Download GDC Data Transfer Tool on the GDC website
Step2: Add data to the GDC cart, then download manifest file and metadata of the cart
Step3: Download data using gdcRNADownload() function by providing the manifest file

4.2 Data organization and DE analysis

4.2.1 Parse metadata

Metadata can be parsed by either providing the metadata file (.json) that is downloaded in the data download step, or specifying the project.id and data.type in gdcParseMetadata() function to obtain information of data in the manifest file to facilitate data organization and basic clinical information of patients such as age, stage and gender, etc. for data analysis.

Only one sample would be kept if the sample had been sequenced more than once by gdcFilterDuplicate(). Samples that are neither Primary Tumor (code: 01) nor Solid Tissue Normal (code: 11) would be filtered out by gdcFilterSampleType()

####### Parse RNAseq metadata #######
metaMatrix.RNA <- gdcParseMetadata(project.id = 'TCGA-CHOL',
                                   data.type  = 'RNAseq', 
                                   write.meta = FALSE)

####### Filter duplicated samples in RNAseq metadata #######
metaMatrix.RNA <- gdcFilterDuplicate(metaMatrix.RNA)

####### Filter non-Primary Tumor and non-Solid Tissue Normal samples in RNAseq metadata #######
metaMatrix.RNA <- gdcFilterSampleType(metaMatrix.RNA)

####### Parse miRNAs metadata #######
metaMatrix.MIR <- gdcParseMetadata(project.id = 'TCGA-CHOL',
                                   data.type  = 'miRNAs', 
                                   write.meta = FALSE)

####### Filter duplicated samples in miRNAs metadata #######
metaMatrix.MIR <- gdcFilterDuplicate(metaMatrix.MIR)

####### Filter non-Primary Tumor and non-Solid Tissue Normal samples in miRNAs metadata #######
metaMatrix.MIR <- gdcFilterSampleType(metaMatrix.MIR)

4.2.2 Merge raw counts data

gdcRNAMerge() merges raw counts data of RNAseq to a single expression matrix with rows are Ensembl id and columns are samples. Total read counts for 5p and 3p strands of miRNAs can be processed from isoform quantification files and then merged to a single expression matrix with rows are miRBase v21 identifiers and columns are samples.

####### Merge RNAseq data #######
rnaCounts <- gdcRNAMerge(metadata  = metaMatrix.RNA, 
                         path      = rnadir, 
                         data.type = 'RNAseq')

####### Merge miRNAs data #######
mirCounts <- gdcRNAMerge(metadata  = metaMatrix.MIR,
                         path      = mirdir,
                         data.type = 'miRNAs')

4.2.3 TMM normalization and voom transformation

By running gdcVoomNormalization() function, raw counts data would be normalized by TMM method implemented in edgeR(???) and further transformed by the voom method provided in limma(???). Low expression genes (logcpm < 1 in more than half of the samples) will be filtered out by default. All the genes can be kept by setting filter=TRUE in the gdcVoomNormalization().

####### Normalization of RNAseq data #######
rnaExpr <- gdcVoomNormalization(counts = rnaCounts, filter = FALSE)

####### Normalization of miRNAs data #######
mirExpr <- gdcVoomNormalization(counts = mirCounts, filter = FALSE)

4.2.4 Differential gene expression analysis

Usually, people are interested in genes that are differentially expressed between different groups (eg. Primary Tumor vs. Solid Tissue Normal). gdcDEAnalysis(), a convenience wrapper, provides three widely used methods limma(???), edgeR(???), and DESeq2(???) to identify differentially expressed genes (DEGs) or miRNAs between any two groups defined by users. Note that DESeq2(???) maybe slow with a single core. Multiple cores can be specified with the nCore argument if DESeq2(???) is in use. Users are encouraged to consult the vignette of each method for more detailed information.

DEGAll <- gdcDEAnalysis(counts     = rnaCounts, 
                        group      = metaMatrix.RNA$sample_type, 
                        comparison = 'PrimaryTumor-SolidTissueNormal', 
                        method     = 'limma')

All DEGs, DE long non-coding genes, DE protein coding genes and DE miRNAs could be reported separately by setting geneType argument in gdcDEReport(). Gene symbols and biotypes based on the Ensembl 90 annotation are reported in the output.

data(DEGAll)

### All DEGs
deALL <- gdcDEReport(deg = DEGAll, gene.type = 'all')

### DE long-noncoding
deLNC <- gdcDEReport(deg = DEGAll, gene.type = 'long_non_coding')

### DE protein coding genes
dePC <- gdcDEReport(deg = DEGAll, gene.type = 'protein_coding')

4.3 Competing endogenous RNAs network analysis

Three criteria are used to determine the competing endogenous interactions between lncRNA-mRNA pairs:

The lncRNA and mRNA must share significant number of miRNAs
Expression of lncRNA and mRNA must be positively correlated
Those common miRNAs should play similar roles in regulating the expression of lncRNA and mRNA

4.3.1 Hypergeometric test

Hypergenometric test is performed to test whether a lncRNA and mRNA share many miRNAs significantly.

A newly developed algorithm spongeScanis used to predict MREs in lncRNAs acting as ceRNAs. Databases such as starBase v2.0, miRcode and mirTarBase release 7.0 are used to collect predicted and experimentally validated miRNA-mRNA and/or miRNA-lncRNA interactions. Gene IDs in these databases are updated to the latest Ensembl 90 annotation of human genome and miRNAs names are updated to the new release miRBase 21 identifiers. Users can also provide their own datasets of miRNA-lncRNA and miRNA-mRNA interactions.

The figure and equation below illustrate how the hypergeometric test works

image.png

p=1−∑k=0m(Kk)(N−Kn−k)(Nn)
here m is the number of shared miRNAs, N is the total number of miRNAs in the database, n is the number of miRNAs targeting the lncRNA, K is the number of miRNAs targeting the protein coding gene.

4.3.2 Pearson correlation analysis

Pearson correlation coefficient is a measure of the strength of a linear association between two variables. As we all know, miRNAs are negative regulators of gene expression. If more common miRNAs are occupied by a lncRNA, less of them will bind to the target mRNA, thus increasing the expression level of mRNA. So expression of the lncRNA and mRNA in a ceRNA pair should be positively correlated.

4.3.3 Regulation pattern analysis

Two methods are used to measure the regulatory role of miRNAs on the lncRNA and mRNA:

Regulation similarity

We defined a measurement regulation similarity score to check the similarity between miRNAs-lncRNA expression correlation and miRNAs-mRNA expression correlation.

Regulation similarity score=1−1M∑k=1M[|corr(mk,l)−corr(mk,g)||corr(mk,l)|+|corr(mk,g)|]M

where M is the total number of shared miRNAs, k is the kth shared miRNAs, corr(mk,l) and corr(mk,g) represents the Pearson correlation between the kth miRNA and lncRNA, the kth miRNA and mRNA, respectively

Sensitivity correlation

Sensitivity correlation is defined by Paci et al.to measure if the correlation between a lncRNA and mRNA is mediated by a miRNA in the lncRNA-miRNA-mRNA triplet. We take average of all triplets of a lncRNA-mRNA pair and their shared miRNAs as the sensitivity correlation between a selected lncRNA and mRNA.
Sensitivity correlation=corr(l,g)−1M∑k=1Mcorr(l,g)−corr(mk,l)corr(mk,g)1−corr(mk,l)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾√1−corr(mk,g)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√
where M is the total number of shared miRNAs, k is the kth shared miRNAs, corr(l,g), corr(mk,l) and corr(mk,g) represents the Pearson correlation between the long non-coding RNA and the protein coding gene, the kth miRNA and lncRNA, the kth miRNA and mRNA, respectively

4.3.4 ceRNAs network analysis

The hypergeometric test of shared miRNAs, expression correlation analysis of lncRNA-mRNA pair, and regulation pattern analysis of shared miRNAs are all implemented in the gdcCEAnalysis() function.

4.3.4.1 ceRNAs network analysis using internal databases

Users can use the internally incoporated databases of miRNA-mRNA (starBase v2.0, miRcode, and mirTarBase v7.0) and miRNA-lncRNA (starBase v2.0, miRcode, spongeScan) interactions to perform the ceRNAs network analysis.

ceOutput <- gdcCEAnalysis(lnc         = rownames(deLNC), 
                          pc          = rownames(dePC), 
                          lnc.targets = 'starBase', 
                          pc.targets  = 'starBase', 
                          rna.expr    = rnaExpr, 
                          mir.expr    = mirExpr)

4.3.4.2 ceRNAs network analysis using user-provided datasets

gdcCEAnalysis() can also take user-provided miRNA-mRNA and miRNA-lncRNA interaction datasets, such as miRNA-target interactions predicted by TargetScan, miRanda, and Diana Tools, etc. for the ceRNAs network analysis.

### load miRNA-lncRNA interactions
data(lncTarget)

### load miRNA-mRNA interactions
data(pcTarget)
pcTarget[1:3]

## $ENSG00000138829
##  [1] "hsa-miR-200b-3p" "hsa-miR-429"     "hsa-miR-101-3p"  "hsa-miR-137"    
##  [5] "hsa-miR-9-5p"    "hsa-miR-139-5p"  "hsa-miR-200c-3p" "hsa-miR-136-5p" 
##  [9] "hsa-miR-494-3p"  "hsa-miR-495-3p"  "hsa-miR-154-5p"  "hsa-miR-410-3p" 
## [13] "hsa-miR-211-5p"  "hsa-miR-140-5p"  "hsa-miR-22-3p"   "hsa-miR-33b-5p" 
## [17] "hsa-miR-144-3p"  "hsa-miR-133a-3p" "hsa-miR-23a-3p"  "hsa-miR-217"    
## [21] "hsa-miR-33a-5p"  "hsa-miR-218-5p"  "hsa-miR-133b"    "hsa-miR-876-5p" 
## [25] "hsa-miR-204-5p"  "hsa-miR-23b-3p"  "hsa-miR-23c"    
## 
## $ENSG00000113615
##  [1] "hsa-miR-200b-3p" "hsa-miR-200a-3p" "hsa-miR-429"     "hsa-miR-30e-5p" 
##  [5] "hsa-miR-30c-5p"  "hsa-miR-92b-3p"  "hsa-miR-199a-5p" "hsa-miR-181b-5p"
##  [9] "hsa-miR-181a-5p" "hsa-miR-107"     "hsa-miR-200c-3p" "hsa-miR-141-3p" 
## [13] "hsa-miR-26a-5p"  "hsa-miR-16-5p"   "hsa-miR-15a-5p"  "hsa-miR-1297"   
## [17] "hsa-miR-92a-3p"  "hsa-miR-136-5p"  "hsa-miR-300"     "hsa-miR-381-3p" 
## [21] "hsa-miR-539-5p"  "hsa-miR-7-5p"    "hsa-miR-132-3p"  "hsa-miR-212-3p" 
## [25] "hsa-miR-195-5p"  "hsa-miR-497-5p"  "hsa-miR-144-3p"  "hsa-miR-27a-3p" 
## [29] "hsa-miR-23a-3p"  "hsa-miR-181c-5p" "hsa-miR-181d-5p" "hsa-miR-371a-5p"
## [33] "hsa-miR-128-3p"  "hsa-miR-26b-5p"  "hsa-miR-103a-3p" "hsa-miR-15b-5p" 
## [37] "hsa-miR-367-3p"  "hsa-miR-30a-5p"  "hsa-miR-653-5p"  "hsa-miR-25-3p"  
## [41] "hsa-miR-182-5p"  "hsa-miR-183-5p"  "hsa-miR-490-3p"  "hsa-miR-30b-5p" 
## [45] "hsa-miR-30d-5p"  "hsa-miR-31-5p"   "hsa-miR-23b-3p"  "hsa-miR-27b-3p" 
## [49] "hsa-miR-32-5p"   "hsa-miR-199b-5p" "hsa-miR-23c"     "hsa-miR-374b-5p"
## [53] "hsa-miR-374a-5p" "hsa-miR-363-3p"  "hsa-miR-424-5p" 
## 
## $ENSG00000112144
##  [1] "hsa-miR-200b-3p" "hsa-miR-429"     "hsa-miR-30e-5p"  "hsa-miR-30c-5p" 
##  [5] "hsa-miR-101-3p"  "hsa-miR-202-3p"  "hsa-miR-139-5p"  "hsa-miR-200c-3p"
##  [9] "hsa-miR-26a-5p"  "hsa-miR-1297"    "hsa-miR-543"     "hsa-miR-300"    
## [13] "hsa-miR-382-5p"  "hsa-miR-410-3p"  "hsa-miR-144-3p"  "hsa-miR-23a-3p" 
## [17] "hsa-miR-217"     "hsa-miR-26b-5p"  "hsa-miR-218-5p"  "hsa-miR-367-3p" 
## [21] "hsa-miR-30a-5p"  "hsa-miR-383-5p"  "hsa-miR-30b-5p"  "hsa-miR-30d-5p" 
## [25] "hsa-miR-23b-3p"  "hsa-miR-374b-5p" "hsa-miR-374a-5p" "hsa-miR-448"

ceOutput <- gdcCEAnalysis(lnc         = rownames(deLNC), 
                          pc          = rownames(dePC), 
                          lnc.targets = lncTarget, 
                          pc.targets  = pcTarget, 
                          rna.expr    = rnaExpr, 
                          mir.expr    = mirExpr)

4.3.5 Network visulization in Cytoscape

lncRNA-miRNA-mRNA interactions can be reported by the gdcExportNetwork() and visualized in Cytoscape. edges should be imported as network and nodes should be imported as feature table.

ceOutput2 <- ceOutput[ceOutput$hyperPValue<0.01 & 
    ceOutput$corPValue<0.01 & ceOutput$regSim != 0,]

edges <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'edges')
nodes <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'nodes')

write.table(edges, file='edges.txt', sep='\t', quote=F)
write.table(nodes, file='nodes.txt', sep='\t', quote=F)

image.png

4.3.6 Correlation plot on a local webpage

shinyCorPlot() , a interactive plot function based on shiny package, can be easily operated by just clicking the genes in each drop down box (in the GUI window). By running shinyCorPlot() function, a local webpage would pop up and correlation plot between a lncRNA and mRNA would be automatically shown.

shinyCorPlot(gene1    = rownames(deLNC), 
             gene2    = rownames(dePC), 
             rna.expr = rnaExpr, 
             metadata = metaMatrix.RNA)

image.gif

4.4 Other downstream analyses

Downstream analyses such as univariate survival analysis and functional enrichment analysis are developed in the GDCRNATools package to facilitate the identification of genes in the ceRNAs network that play important roles in prognosis or involve in important pathways.

4.4.1 Univariate survival analysis

Two methods are provided to perform univariate survival analysis: Cox Proportional-Hazards (CoxPH) model and Kaplan Meier (KM) analysis based on the survival package. CoxPH model considers expression value as continous variable while KM analysis divides patients into high-expreesion and low-expression groups by a user-defined threshold such as median or mean. gdcSurvivalAnalysis() take a list of genes as input and report the hazard ratio, 95% confidence intervals, and test significance of each gene on overall survival.

4.4.1.1 CoxPH analysis

####### CoxPH analysis #######
survOutput <- gdcSurvivalAnalysis(gene     = rownames(deALL), 
                                  method   = 'coxph', 
                                  rna.expr = rnaExpr, 
                                  metadata = metaMatrix.RNA)

4.4.1.2 KM analysis

####### KM analysis #######
survOutput <- gdcSurvivalAnalysis(gene     = rownames(deALL), 
                                  method   = 'KM', 
                                  rna.expr = rnaExpr, 
                                  metadata = metaMatrix.RNA, 
                                  sep      = 'median')

4.4.1.3 KM plot on a local webpage by shinyKMPlot

The shinyKMPlot() function is also a simply shiny app which allow users view KM plots (based on the R package survminer.) of all genes of interests on a local webpackage conveniently.

shinyKMPlot(gene = rownames(deALL), rna.expr = rnaExpr, 
            metadata = metaMatrix.RNA)

image.gif

4.4.2 Functional enrichment analysis

gdcEnrichAnalysis() can perform Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Disease Ontology (DO) functional enrichment analyses of a list of genes simultaneously. GO and KEGG analyses are based on the R/Bioconductor packages clusterProfilier(???) and DOSE(???). Redundant GO terms can be removed by specifying simplify=TRUE in the gdcEnrichAnalysis() function which uses the simplify() function in the clusterProfilier(???) package.

enrichOutput <- gdcEnrichAnalysis(gene = rownames(deALL), simplify = TRUE)

4.4.2.1 Barplot

data(enrichOutput)

gdcEnrichPlot(enrichOutput, type = 'bar', category = 'GO', num.terms = 10)

image.png

4.4.2.2 Bubble plot

gdcEnrichPlot(enrichOutput, type='bubble', category='GO', num.terms = 10)

image.png

4.4.2.3 View pathway maps on a local webpage

shinyPathview() allows users view and download pathways of interests by simply selecting the pathway terms on a local webpage.

library(pathview)

deg <- deALL$logFC
names(deg) <- rownames(deALL)
pathways <- as.character(enrichOutput$Terms[enrichOutput$Category=='KEGG'])

shinyPathview(deg, pathways = pathways, directory = 'pathview')

image.gif

5 sessionInfo

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] DT_0.15           GDCRNATools_1.8.0
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                backports_1.1.10           
##   [3] fastmatch_1.1-0             BiocFileCache_1.12.1       
##   [5] plyr_1.8.6                  igraph_1.2.6               
##   [7] splines_4.0.2               BiocParallel_1.22.0        
##   [9] pathview_1.28.1             GenomeInfoDb_1.24.2        
##  [11] ggplot2_3.3.2               urltools_1.7.3             
##  [13] digest_0.6.25               htmltools_0.5.0            
##  [15] GOSemSim_2.14.2             viridis_0.5.1              
##  [17] GO.db_3.11.4                magrittr_1.5               
##  [19] memoise_1.1.0               openxlsx_4.2.2             
##  [21] limma_3.44.3                Biostrings_2.56.0          
##  [23] readr_1.3.1                 annotate_1.66.0            
##  [25] graphlayouts_0.7.0          matrixStats_0.57.0         
##  [27] askpass_1.1                 enrichplot_1.8.1           
##  [29] prettyunits_1.1.1           colorspace_1.4-1           
##  [31] blob_1.2.1                  rappdirs_0.3.1             
##  [33] ggrepel_0.8.2               haven_2.3.1                
##  [35] xfun_0.17                   dplyr_1.0.2                
##  [37] crayon_1.3.4                RCurl_1.98-1.2             
##  [39] jsonlite_1.7.1              graph_1.66.0               
##  [41] scatterpie_0.1.5            genefilter_1.70.0          
##  [43] zoo_1.8-8                   survival_3.2-3             
##  [45] glue_1.4.2                  survminer_0.4.9            
##  [47] GenomicDataCommons_1.12.0   polyclip_1.10-0            
##  [49] gtable_0.3.0                zlibbioc_1.34.0            
##  [51] XVector_0.28.0              DelayedArray_0.14.1        
##  [53] car_3.0-9                   Rgraphviz_2.32.0           
##  [55] BiocGenerics_0.34.0         abind_1.4-5                
##  [57] scales_1.1.1                DOSE_3.14.0                
##  [59] DBI_1.1.0                   edgeR_3.30.3               
##  [61] rstatix_0.6.0               Rcpp_1.0.5                 
##  [63] viridisLite_0.3.0           xtable_1.8-4               
##  [65] progress_1.2.2              gridGraphics_0.5-0         
##  [67] foreign_0.8-80              bit_4.0.4                  
##  [69] europepmc_0.4               km.ci_0.5-2                
##  [71] stats4_4.0.2                htmlwidgets_1.5.1          
##  [73] httr_1.4.2                  fgsea_1.14.0               
##  [75] gplots_3.1.0                RColorBrewer_1.1-2         
##  [77] ellipsis_0.3.1              pkgconfig_2.0.3            
##  [79] XML_3.99-0.5                farver_2.0.3               
##  [81] dbplyr_1.4.4                locfit_1.5-9.4             
##  [83] labeling_0.3                ggplotify_0.0.5            
##  [85] tidyselect_1.1.0            rlang_0.4.8                
##  [87] reshape2_1.4.4              later_1.1.0.1              
##  [89] AnnotationDbi_1.50.3        cellranger_1.1.0           
##  [91] munsell_0.5.0               tools_4.0.2                
##  [93] downloader_0.4              generics_0.0.2             
##  [95] RSQLite_2.2.1               broom_0.7.0                
##  [97] ggridges_0.5.2              evaluate_0.14              
##  [99] stringr_1.4.0               fastmap_1.0.1              
## [101] yaml_2.2.1                  org.Hs.eg.db_3.11.4        
## [103] knitr_1.30                  bit64_4.0.5                
## [105] tidygraph_1.2.0             zip_2.1.1                  
## [107] survMisc_0.5.5              caTools_1.18.0             
## [109] purrr_0.3.4                 KEGGREST_1.28.0            
## [111] ggraph_2.0.3                mime_0.9                   
## [113] KEGGgraph_1.48.0            DO.db_2.9                  
## [115] xml2_1.3.2                  biomaRt_2.44.4             
## [117] compiler_4.0.2              png_0.1-7                  
## [119] curl_4.3                    ggsignif_0.6.0             
## [121] tibble_3.0.3                tweenr_1.0.1               
## [123] geneplotter_1.66.0          stringi_1.5.3              
## [125] forcats_0.5.0               lattice_0.20-41            
## [127] Matrix_1.2-18               KMsurv_0.1-5               
## [129] vctrs_0.3.4                 pillar_1.4.6               
## [131] lifecycle_0.2.0             BiocManager_1.30.10        
## [133] triebeard_0.3.0             data.table_1.13.0          
## [135] cowplot_1.1.0               bitops_1.0-6               
## [137] httpuv_1.5.4                GenomicRanges_1.40.0       
## [139] qvalue_2.20.0               R6_2.4.1                   
## [141] promises_1.1.1              rio_0.5.16                 
## [143] KernSmooth_2.23-17          gridExtra_2.3              
## [145] IRanges_2.22.2              MASS_7.3-53                
## [147] gtools_3.8.2                assertthat_0.2.1           
## [149] SummarizedExperiment_1.18.2 rjson_0.2.20               
## [151] openssl_1.4.3               DESeq2_1.28.1              
## [153] S4Vectors_0.26.1            GenomeInfoDbData_1.2.3     
## [155] parallel_4.0.2              hms_0.5.3                  
## [157] clusterProfiler_3.16.1      grid_4.0.2                 
## [159] prettydoc_0.4.1             tidyr_1.1.2                
## [161] rmarkdown_2.3               rvcheck_0.1.8              
## [163] carData_3.0-4               ggpubr_0.4.0               
## [165] ggforce_0.3.2               Biobase_2.48.0             
## [167] shiny_1.5.0

6 References

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,230评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,261评论 2赞 380
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,089评论 0赞 336
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,542评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,542评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,544评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,922评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,578评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,816评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,576评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,658评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,359评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,937评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,920评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,156评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,859评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,381评论 2赞 342

2021-06-16

1 Introduction

2 GDCRNATools package installation

3 Quick start

3.1 Data preparation

3.1.1 Normalization of HTSeq-Counts data

3.1.2 Parse metadata

3.2 ceRNAs network analysis

3.2.1 Identification of differentially expressed genes (DEGs)

3.2.2 ceRNAs network analysis of DEGs

3.2.3 Export ceRNAs network to Cytoscape

4 Case study: TCGA-CHOL

4.1 Data download

4.1.1 Automatic download

4.1.2 Manual download

4.2 Data organization and DE analysis

4.2.1 Parse metadata

4.2.2 Merge raw counts data

4.2.3 TMM normalization and voom transformation

4.2.4 Differential gene expression analysis

4.3 Competing endogenous RNAs network analysis

4.3.1 Hypergeometric test

4.3.2 Pearson correlation analysis

4.3.3 Regulation pattern analysis

4.3.4 ceRNAs network analysis

4.3.4.1 ceRNAs network analysis using internal databases

4.3.4.2 ceRNAs network analysis using user-provided datasets

4.3.5 Network visulization in Cytoscape

4.3.6 Correlation plot on a local webpage

4.4 Other downstream analyses

4.4.1 Univariate survival analysis

4.4.1.1 CoxPH analysis

4.4.1.2 KM analysis

4.4.1.3 KM plot on a local webpage by shinyKMPlot

4.4.2 Functional enrichment analysis

4.4.2.1 Barplot

4.4.2.2 Bubble plot

4.4.2.3 View pathway maps on a local webpage

5 sessionInfo

6 References

推荐阅读更多精彩内容

2 `GDCRNATools` package installation