【scRW】[4]Single-cell RNA-seq: Quality control set-up

Learning Objectives:

Understand how to bring in data from single-cell RNA-seq experiments


Quality control set-up

1. Exploring the example dataset

在本次研讨会上,我们将使用单细胞RNA-seq数据集,该数据集是Kang等人(2017年)进行的一项较大研究的一部分。在本文中,作者提出了一种利用遗传变异(eQTL)来确定基因的计算算法。 每个包含单个细胞的液滴的遗传同一性(单一),并鉴定包含来自不同个体的两个细胞的液滴(双重)。
用于测试其算法的数据包括从八名狼疮患者中提取的合并的外周血单个核细胞(PBMC),分为对照和干扰素β治疗(刺激)条件。

[图片上传失败...(image-e2393b-1590137860500)]*
](https://upload-images.jianshu.io/upload_images/11904209-6ba372a42250af69.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

2. Raw data

This dataset is available on GEO (GSE96583), however the available counts matrix lacked mitochondrial reads, so we downloaded the BAM files from the SRA (SRP102802). These BAM files were converted back to FASTQ files, then run through Cell Ranger to obtain the count data that we will be using.

NOTE: The counts for this dataset is also freely available from 10X Genomics and is used as part of the Seurat tutorial.

3. Metadata

除了原始数据,我们还需要收集有关数据的信息; 这就是所谓的 metadata元数据 。 通常有一种 temptation ,就是只是开始探索数据,但如果我们不知道该数据所源自的样本,那并不是很有意义。

以下提供了我们数据集的一些相关metadata元数据:

  • The libraries were prepared using 10X Genomics version 2 chemistry;
  • The samples were sequenced on the Illumina NextSeq 500;
  • 将来自八名狼疮患者的PBMC样本分别分成两等份;
    = 用100 U / mL重组IFN-β激活一份PBMC,持续6小时。
    = The second aliquot分试样 was left untreated.
    = 6小时后,将每种条件的8个样品一起收集到两个最终的收集池中(受激细胞和对照细胞)。 我们将使用这两个合并的样本。
  • 分别鉴定了12138和12167个细胞(去除双峰后)作为对照样品和刺激后的合并样品。
  • 由于样本是PBMC,因此我们期望有免疫细胞,例如:
    = B cells
    = T cells
    = NK cells
    = monocytes
    = macrophages
    = possibly megakaryocytes

建议您在执行QC之前最好知道预期的细胞类型。因为这样当任何细胞类型的低复杂度(只有部分基因的转录本)或线粒体表达水平较高的细胞可以帮助我们提出剔除QC不合格的细胞群体。

预期上述细胞类型都不应该具有低复杂度或具有高线粒体含量。

4. Setting up the R environment

研究涉及大量数据的最重要部分之一就是如何最好地对其进行管理。 我们倾向于对分析进行优先级划分,但是数据管理中还有许多其他重要方面经常被人们忽略,因此他们对新数据没有先见之明。 HMS Data Management Working Group
深入讨论了除数据创建和分析之外要考虑的一些事项。

数据管理的一个重要方面是组织。 对于您进行和分析数据的每个实验,通过 a planned storage space (directory structure) 来进行组织是最佳实践。 我们将对单细胞分析进行此操作。

打开 RStudio 并创建一个名为single_cell_rnaseq的新R项目。 然后,创建以下目录:

single_cell_rnaseq/
├── data
├── results
└── figures

5. Download data & New script

Control sample
Stimulated sample

# February 2020
# HBC single-cell RNA-seq workshop

# Single-cell RNA-seq analysis - QC

将Rscript保存为quality_control.R。 您的工作目录应如下所示:

New script

6.Loading libraries

Now, we can load the necessary libraries:

# Load libraries
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)

7.Loading single-cell RNA-seq count data

Regardless of the technology or pipeline used to process your single-cell RNA-seq sequence data, the output will generally be the same. That is, for each individual sample you will have the following three files:

1.a file with the cell IDs, representing all cells quantified
2.a file with the gene IDs, representing all genes quantified
3.a matrix of counts per gene for every cell

We can explore these files by clicking on the data/ctrl_raw_feature_bc_matrix folder:

7.1 barcodes.tsv

这是一个文本文件,其中包含该样品存在的所有细胞条形码。 条形码以矩阵文件中显示的数据顺序列出(即,这些是列名; i.e. these are the column names)。

barcodes.tsv

7.2. features.tsv

This is a text file which contains the identifiers of the quantified genes. The source of the identifier can vary depending on what reference (i.e. Ensembl, NCBI, UCSC) you use in the quantification methods, but most often these are official gene symbols. The order of these genes corresponds to the order of the rows in the matrix file (i.e. these are the row names).

image.png

7.3 matrix.mtx

这是一个文本文件,其中包含计数值矩阵。 行与上面的基因ID相关联,列与细胞条形码对应。 请注意,此矩阵中有许多零值。

matrix.mtx

Loading this data into R requires us to use functions that allow us to efficiently combine these three files into a single count matrix. However, instead of creating a regular matrix data structure, the functions we will use create a sparse matrix to improve the amount of space, memory and CPU required to work with our huge count matrix.

  • Different methods for reading in data include:
  1. readMM(): This function is from the Matrix package and will turn our standard matrix into a sparse matrix. The features.tsv file and barcodes.tsv must first be individually loaded into R and then they are combined. For specific code and instructions on how to do this please see our additional material.
  2. Read10X(): This function is from the Seurat package and will use the Cell Ranger output directory as input. In this way individual files do not need to be loaded in, instead the function will load and combine them into a sparse matrix for you. We will be using this function to load in our data!

7.4 Reading in a single sample (read10X())

When working with 10X data and its proprietary software Cell Ranger, you will always have an outs directory. Within this directory you will find a number of different files including:

  • web_summary.html: report that explores different QC metrics, including the mapping metrics, filtering thresholds, estimated number of cells after filtering, and information on the number of reads and genes per cell after filtering.
  • BAM alignment files: files used for visualization of the mapped reads and for re-creation of FASTQ files, if needed
  • filtered_feature_bc_matrix: folder containing all files needed to construct the count matrix using data filtered by Cell Ranger
  • raw_feature_bc_matrix: folder containing all files needed to construct the count matrix using the raw unfiltered data

We are mainly interested in the raw_feature_bc_matrix as we wish to perform our own QC and filtering while accounting for the biology of our experiment/biological system.

If we had a single sample, we could generate the count matrix and then subsequently create a Seurat object:

# How to read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)

min.features 参数指定每个细胞需要检测的最小基因数量。 此参数将过滤掉质量较差的单元,这些单元可能只是封装了随机条形码 random barcodes 而没有任何单元any cell present。 通常,检测不到100个基因的细胞不考虑进行分析。

Seurat automatically creates some metadata for each of the cells when you use the Read10X() function to read in data. This information is stored in the meta.data slot within the Seurat object (see more in the note below).

The Seurat object is a custom list-like object 自定义列表状对象 that has well-defined spaces to store specific information/data. You can find more information about the slots in the Seurat object at this link.

# Explore the metadata
head(ctrl@meta.data)

What do the columns of metadata mean?

  • orig.ident: this often contains the sample identity if known, but will default to “SeuratProject”
  • nCount_RNA: number of UMIs per cell
  • nFeature_RNA: number of genes detected per cell

7.5 Reading in multiple samples with a for loop

In practice, you will likely have several samples that you will need to read in data for using one of the 2 functions we discussed earlier (Read10X() or readMM()). So, to make the data import into R more efficient we can use a for loop, that will interate over a series of commands for each of the inputs given.

In R, it has the following structure/syntax:

## DO NOT RUN

for (variable in input){
    command1
    command2
    command3
}

The for loop we will be using today will iterate over the two sample “files” and execute two commands for each sample - (1) read in the count data (Read10X()) and (2) create the Seurat objects from the read in data (CreateSeuratObject()):

# Create each individual Seurat object for every sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)
        assign(file, seurat_obj)
}

We can break down the for loop to describe the different lines of code:

Step1: specify input

For this dataset, we have two samples that we would like to create a Seurat object for:
ctrl_raw_feature_bc_matrix
stim_raw_feature_bc_matrix

We can specify these samples in the input part for our for loop as elements of a vector using c(). We are assigning these to a variable and we can call that variable anything we would like (try to give it a name that makes sense). In this example, we called the variable file.

During the execution of the above loop, file will first contain the value “ctrl_raw_feature_bc_matrix”, run through the commands all the way through to assign(). Next, it will contain the value “stim_raw_feature_bc_matrix” and once again run through all the commands. If you had 15 folders as input, instead of 2, the above code will run through 15 times, for each of your data folders.

## DO NOT RUN

# Create each individual Seurat object
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
}
Step 2: Read in data for the input

We can continue our for loop by adding a line to read in data with Read10X():

Here, we need to specify the path to the file, so we will prepend the data/ directory to our sample folder name using the paste0() function.

Step 3: Create Seurat object from the 10X count data

Now, we can create the Seurat object by using the CreateSeuratObject() function, adding in the argument project, where we can add the sample name.

## DO NOT RUN

        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file) 
Step 4: Assign Seurat object to a new variable based on sample

The last command assigns the Seurat object created (seurat_obj) to a new variable. In this way, when we iterate and move on to the next sample in our input we will not overwrite the Seurat object created in the previous iteration:

## DO NOT RUN
  
        assign(file, seurat_obj)
}

Now that we have created both of these objects, let’s take a quick look at the metadata to see how it looks:

# Check the metadata in the new Seurat objects
head(ctrl_raw_feature_bc_matrix@meta.data)
head(stim_raw_feature_bc_matrix@meta.data)

Next, we need to merge these objects together into a single Seurat object. This will make it easier to run the QC steps for both sample groups together and enable us to easily compare the data quality for all the samples.

We can use the merge() function from the Seurat package to do this:

# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix, 
                       y = stim_raw_feature_bc_matrix, 
                       add.cell.id = c("ctrl", "stim"))

Because the same cell IDs can be used for different samples, we add a sample-specific prefix to each of our cell IDs using the add.cell.id argument. If we look at the metadata of the merged object we should be able to see the prefixes in the rownames:

# Check that the merged object has the appropriate sample-specific prefixes
head(merged_seurat@meta.data)
tail(merged_seurat@meta.data)
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,098评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,213评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,960评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,519评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,512评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,533评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,914评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,574评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,804评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,563评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,644评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,350评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,933评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,908评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,146评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,847评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,361评论 2 342