【r<-生信|包】Biostrings

在这篇文章中，我们将学习如何操控R中的字符串，主要用的是Biostrings包。

Getting started

如果你是第一次使用，请先用以下命令安装Biostrings包：

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

用以下命令导入：

require(Biostrings)

Biostrings可以加引号。如果你想要查看这个包的说明文档，请点击链接http://www.bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings.pdf，我们建议你在这次实验操作中保持文档打开以便于查询。

我们将通过实际操作一些Biostrings包提供的函数去熟悉它做的是什么，又是如何实现的。

Generating DNA alphabets

R 提供了函数生成大写和小写的字母表。

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

LETTERS

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

一种生成 “ACGT” 的方式是

sample(LETTERS[c(1,3,7,20)], size=10, replace=TRUE)

##  [1] "C" "G" "T" "A" "T" "T" "G" "G" "A" "A"

Biostrings 使用的DNA_ALPHABET 变量相比起来更加方便。

DNA_ALPHABET

##  [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."

seq = sample(DNA_ALPHABET[1:4], size=10, replace=TRUE)
seq

##  [1] "A" "G" "C" "T" "C" "G" "T" "C" "T" "T"

可以将它们转换成一个字符串：

seq = paste(seq, collapse="")
seq

## [1] "AGCTCGTCTT"

The XString class

XString 类允许我们创建、存储和使用不同类型的字符串。不过我们只被允许使用XString的一些子类: BString, DNAString, RNAString, 和AAString.

让我们先创建一个BString对象：

bstring = BString("I am a BString object")
bstring

##   21-letter "BString" instance
## seq: I am a BString object

用R的 “length” 的函数获取对象的长度：

length(bstring)

## [1] 21

现在创建一个 DNAString对象：

dnastring = DNAString("TTGAAA-CTC-N")
dnastring

##   12-letter "DNAString" instance
## seq: TTGAAA-CTC-N

和刚才一样，我们可以计算对象的长度：

length(dnastring)

## [1] 12

根据使用手册，一个BString对象的区别在于：

只允许使用符合 IUPAC编码字符以及 gap letter (-)
输入 DNAString函数的每一个字符参数在被存储为DNAString对象前都被特殊方式编码过。

我们能够用str()函数查看任意 XString对象的slot：

str(dnastring)

## Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. ..@ xp                    :<externalptr> 
##   .. .. ..@ .link_to_cached_object:<environment: 0x632f648> 
##   ..@ offset         : int 0
##   ..@ length         : int 12
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

slotNames(dnastring)

## [1] "shared"          "offset"          "length"          "elementMetadata"
## [5] "metadata"

一些有用的函数：

alphabetFrequency(dnastring)

## A C G T M R W S Y K V H D B N - + . 
## 3 2 1 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0

alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)

##          A          C          G          T      other 
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000

letterFrequency(dnastring, "A", as.prob=TRUE)

##    A 
## 0.25

reverseComplement(dnastring)

##   12-letter "DNAString" instance
## seq: N-GAG-TTTCAA

我们也能够通过 [] 操作符取出单个字符：

bstring[3]

##   1-letter "BString" instance
## seq: a

dnastring[2]

##   1-letter "DNAString" instance
## seq: T

也可以通过连续索引取子字符串：

dnastring[7:12]

##   6-letter "DNAString" instance
## seq: -CTC-N

记住，R使用的是基于1的索引系统，它意味着字符串或任意向量的第一个元素的索引是1。虽然这样用来获取子字符串和简单和直观，但是我们强烈不推荐使用，特别是对大的字符串。 Biostrings 提供了 “subseq” 函数可以完成一样的操作：

subseq(bstring, start=3, end=3)

##   1-letter "BString" instance
## seq: a

subseq(bstring, 3, 3)

##   1-letter "BString" instance
## seq: a

subseq(dnastring, 7, 12)

##   6-letter "DNAString" instance
## seq: -CTC-N

我们可以将XString对象转换为字符向量：

toString(bstring)

## [1] "I am a BString object"

toString(dnastring)

## [1] "TTGAAA-CTC-N"

字符向量的长度是1，我们可以检查：

length(toString(bstring))

## [1] 1

Comparing XStrings

我们可以比较XStrings与字符或其他XStrings对象。

bb = subseq(bstring, 3, 6)
bb == "am a"

## [1] TRUE

bb == BString("am a")

## [1] TRUE

不要尝试比较不同的XString类。（也就是相同的类可以，类和字符可以比较）

bstring == dnastring

## Error in bstring == dnastring: comparison between a "BString" instance and a "DNAString" instance is not supported

bstring != dnastring

## Error in e1 == e2: comparison between a "BString" instance and a "DNAString" instance is not supported

The XStringViews class

这是可视化一个XString对象多个子序列的的有效方式。

view = Views(dnastring, start=3:0, end=5:8)
view

##   Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
##     start end width
## [1]     3   5     3 [GAA]
## [2]     2   6     5 [TGAAA]
## [3]     1   7     7 [TTGAAA-]
## [4]     0   8     9 [ TTGAAA-C]

可以查看views的数目：

length(view)

## [1] 4

同样的，我们也可以选择它的子集：

view[4:2]

##   Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
##     start end width
## [1]     0   8     9 [ TTGAAA-C]
## [2]     1   7     7 [TTGAAA-]
## [3]     2   6     5 [TGAAA]

返回的对象任然是一个XStringViews对象，即使我们选择的是一个元素。我们可以通过[[]] operator将它抽取为一个XStrings对象。

view[[2]]

##   5-letter "DNAString" instance
## seq: TGAAA

查看除了第三个的其他元素（体现的就是切片操作）

view[-3]

##   Views on a 12-letter DNAString subject
## subject: TTGAAA-CTC-N
## views:
##     start end width
## [1]     3   5     3 [GAA]
## [2]     2   6     5 [TGAAA]
## [3]     0   8     9 [ TTGAAA-C]

XStringSet

XStringSet objects可以非常方便的将几个XString objects存储在一起。

set = NULL
for (i in 1:4)
  set = c(set, paste(sample(DNA_ALPHABET[1:4], 10, replace=T), collapse=""))
set

## [1] "AGACCACTCC" "GCATGTAGCT" "GTGGTACGGC" "TCAAACGGCT"

length(set)

## [1] 4

现在将它转换为 DNAStrings:

set = DNAStringSet(set)
set

##   A DNAStringSet instance of length 4
##     width seq
## [1]    10 AGACCACTCC
## [2]    10 GCATGTAGCT
## [3]    10 GTGGTACGGC
## [4]    10 TCAAACGGCT

length(set)

## [1] 4

“reverseComplement”, “alphabetFrequency”, “width”, and “subseq”函数 work on each of the XStrings in a XStringSet.

reverseComplement(set)

##   A DNAStringSet instance of length 4
##     width seq
## [1]    10 GGAGTGGTCT
## [2]    10 AGCTACATGC
## [3]    10 GCCGTACCAC
## [4]    10 AGCCGTTTGA

alphabetFrequency(set, baseOnly=T, as.prob=T)

##        A   C   G   T other
## [1,] 0.3 0.5 0.1 0.1     0
## [2,] 0.2 0.2 0.3 0.3     0
## [3,] 0.1 0.2 0.5 0.2     0
## [4,] 0.3 0.3 0.2 0.2     0

letterFrequency(set, "A", as.prob=T)

##        A
## [1,] 0.3
## [2,] 0.2
## [3,] 0.1
## [4,] 0.3

width(set)

## [1] 10 10 10 10

subseq(set, 1, 3)

##   A DNAStringSet instance of length 4
##     width seq
## [1]     3 AGA
## [2]     3 GCA
## [3]     3 GTG
## [4]     3 TCA

你可以为每一个 XString命名：

names(set) = paste("name", 1:4, sep="")

Exercise

We’ll look at the Escherichia coli APEC O1 genome (NC 008563). This can be found in the BSgenome package (along with genomes from other model organisms). Download and install the package data:

biocLite("BSgenome.Ecoli.NCBI.20080805")

We can now load the genome with

require(BSgenome.Ecoli.NCBI.20080805)
eco = Ecoli$NC_008563

Sample the genome by generating 1000 random views of random widths (50 - 100).
Use the alphabetFrequency and reverseComplement functions on these views.
Repeat 1-2 but this time with only 100 samples.
Compare the results of alphabetFrequency. How do both of them compare with the entire genome?

Generating the views:

views1000 = Views(eco, start=sample(length(eco), 1000, replace=T), width=sample(50:100, 1000, replace=T))
views100 = Views(eco, start=sample(length(eco), 100, replace=T), width=sample(50:100, 100, replace=T))

Computing the base frequencies:

alphabetFrequency(views1000, baseOnly=T, as.prob=T, collapse=T)

##            A            C            G            T        other 
## 0.2436190835 0.2540654784 0.2542674062 0.2480211082 0.0000269237

alphabetFrequency(views100, baseOnly=T, as.prob=T, collapse=T)

##         A         C         G         T     other 
## 0.2511641 0.2521227 0.2515749 0.2451383 0.0000000

alphabetFrequency(eco, baseOnly=T, as.prob=T)

##            A            C            G            T        other 
## 2.471704e-01 2.529128e-01 2.525602e-01 2.473315e-01 2.518681e-05

可以看到采样越多数值越接近总体值。

IRanges

就像Views能够被用来查看子序列。一个通常的任务是描述染色体一系列的起始位点，并接着查看每个起始位点给定长度后的子序列。

ir1 = IRanges(start=1:10, width=10:1) # 只要定义好start,width,end其中两个，序列就被唯一确定
ir2 = IRanges(start=1:10, end=11)
ir3 = IRanges(end=11, width=10:1)

让我们看看IRanges对象的结构：

str(ir1)

## Formal class 'IRanges' [package "IRanges"] with 6 slots
##   ..@ start          : int [1:10] 1 2 3 4 5 6 7 8 9 10
##   ..@ width          : int [1:10] 10 9 8 7 6 5 4 3 2 1
##   ..@ NAMES          : NULL
##   ..@ elementType    : chr "integer"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

我们可以使用下面函数提取出起点、终点以及宽度：

start(ir1)

##  [1]  1  2  3  4  5  6  7  8  9 10

end(ir1)

##  [1] 10 10 10 10 10 10 10 10 10 10

width(ir1)

##  [1] 10  9  8  7  6  5  4  3  2  1

像Views对象一样，对IRanges对象取子集会得到新的IRanges对象。

ir1[1:4]

## IRanges of length 4
##     start end width
## [1]     1  10    10
## [2]     2  10     9
## [3]     3  10     8
## [4]     4  10     7

而且可以用逻辑表达式进行选择：

ir1[start(ir1) <= 6]

## IRanges of length 6
##     start end width
## [1]     1  10    10
## [2]     2  10     9
## [3]     3  10     8
## [4]     4  10     7
## [5]     5  10     6
## [6]     6  10     5

我们现在来创建一个函数，用基本的R绘制函数画一下IRanges对象。

library(ggplot2)

plotRanges = function(x, xlim=x, main=deparse(substitute(x)), sep=0.5){
  height = 1
  if (is(xlim, "Ranges"))
    xlim = c(min(start(xlim))-sep, max(end(xlim))+sep)
  bins = disjointBins(IRanges(start(x), end(x)+1))  
  
  plot.new()
  plot.window(xlim, c(0, max(bins) * (height + sep)))
  ybottom = bins * (sep + height) - height
  rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = "blue")
  title(main)
  axis(1)
}
plotRanges(ir1)

我们也可以用ggplot来画：

library(ggplot2)

plotRanges = function(x, xlim=x, main=deparse(substitute(x)), sep=0.5){
  height = 1
  if (is(xlim, "Ranges"))
    xlim = c(min(start(xlim))-sep, max(end(xlim))+sep)
  bins = disjointBins(IRanges(start(x), end(x)+1))  
 
  ybottom = bins*(sep+height) - height
  df = data.frame(ybottom = ybottom, xleft = start(x) -.5, 
  xright = end(x) + .5, ytop = ybottom + height)
  ggplot(df) + geom_rect(aes(xmax = xright, xmin = xleft, ymax = ytop, ymin = ybottom)) + 
    labs(title = main)
}

注意我们的范围是有重叠部分的，要是我们不要重叠呢？

reduce(ir1)

## IRanges of length 1
##     start end width
## [1]     1  10    10

plotRanges(reduce(ir1))

可以看到结果是一个NormalRanges对象。一个正常的IRanges对象表现为： 1. 非空 (i.e. 它的宽度不为0); 2. 没有重叠; 3. 顺序是从左到右; 4. 不会连接在一起 (i.e. 两个连续的IRanges不会有空的间隔).

CpG islands in real data

GC含量是基因组或DNA片段中"G""C"的百分比。要计算GC含量，我们需要对"G""C"出现的次数计数，然后除以问题中涉及的字符串长度。

我们将会使用来自UCSC基因组资源库的人类基因组版本19中8号染色体的数据(If you have plenty of space and a larger computer, you can prepare the data yourself or take a different chromosome by using the Bioconductor data package BSgenome.Hsapiens.UCSC.hg19, otherwise you will be able to load just a subset of chr8).

当一个基因组窗口GC含量大于50%并且观察与期望的CG比率大于0.6时被定义为CpG岛，CpG岛被认为是基因组上非常重要的区域，因为65%的基因启动子区域都能在CpG岛上找到。

We want to look at this for the Human Chromosome 8. You can get the data either from installing the data from bioconductor (about 800Mb, in a nice format for later reference) which we would like you to do. You can also get the data from downloading from the class website if you haven’t already gotten the data (about 140Mb). If you do not have time to do this right now, you can skip ahead and load the data seqChr8Islands and seqChr8NonIslands by downloading this file and using load("cpg.RData").

如果你还没有下载好bioconductor包，可以使用下面的命令：

source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg19")

如果你已经装好了基因组包，运行：

require(BSgenome.Hsapiens.UCSC.hg19)

## Loading required package: BSgenome.Hsapiens.UCSC.hg19

seqChr8 = unmasked(Hsapiens$chr8)

如果你还没有装好，你可以仅仅下载这次学习所需要的140Mb数据。当然，你也可以下载 seqChr8.fasta 到你的工作目录，然后运行：

seqChr8 = readDNAStringSet("seqChr8.fasta")[[1]]

下一步，我们用R下载CpG岛在基因组上的位置数据。 Irizarray’s 网站: http://rafalab.jhsph.edu/CGI/model-based-cpg-islands-hg19.txt. 我们的网站上也有model-based-cpg-islands-hg19.txt 这个文件。从这里下载： model-based-cpg-islands-gh19.txt.。把它保存到你的工作目录，并用下面的命令把它读进R:

cpglocs=read.table("model-based-cpg-islands-hg19.txt",header=T)

第一个链接好像失效了，直接用下面命令直接下载进R：

cpglocs=read.table("http://bios221.stanford.edu/data/model-based-cpg-islands-hg19.txt",header=T)

然后我们仅保留8号染色体上CpG的起始和终止位点：

cpglocs8=cpglocs[which(cpglocs[,1]=="chr8"),2:3]
#Look at the dimensions of the cpglocs object,
dim(cpglocs8)

## [1] 2855    2

#Look at the first and last few rows:
cpglocs8[1:5,]

##        start    end
## 56961  32437  33708
## 56962  40354  40656
## 56963  44536  46203
## 56964  99457 100721
## 56965 155954 156761

cpglocs8[2850:2855,]

##           start       end
## 59810 146125639 146125885
## 59811 146126089 146128165
## 59812 146175867 146176338
## 59813 146228006 146228907
## 59814 146232962 146233086
## 59815 146276730 146278360

# 我们创建了一个新的矩阵，它包含了什么呢？
nonilocs8=matrix(0,ncol=2,nrow=2856)
nonilocs8[1,1]=10000
nonilocs8[1,2]=cpglocs8[1,1]-1
nonilocs8[2:2856,1]=cpglocs8[1:2855,2]+1
nonilocs8[2:2855,2]=cpglocs8[2:2855,1]-1
nonilocs8[2856,2]=146304021

我们现在将创建DNAStringSets对象，它包含了岛内和岛外两个。

seqChr8Islands = DNAStringSet(seqChr8, start=cpglocs8[,1], end=cpglocs8[,2])
seqChr8NonIslands = DNAStringSet(seqChr8, start=nonilocs8[,1], end=nonilocs8[,2])

看一下两个数据集中GC的频率：

freqIslands = vcountPattern("CG", seqChr8Islands) / width(seqChr8Islands)
freqNonIslands = vcountPattern("CG", seqChr8NonIslands) / width(seqChr8NonIslands)
freqCombined = rbind(data.frame(freq = freqIslands, type = "islands"), 
  data.frame(freq = freqNonIslands, type = "non-islands"))
b <- seq(0, 0.25, length = 50)
hist(freqNonIslands, col = rgb(0, 0, 1, 0.5), xlim = c(0, 0.25), breaks = b, 
    main = "Frequencies of CG Digrams")
hist(freqIslands, col = rgb(1, 0, 0, 0.5), add = T, breaks = b)
legend("topright", c("in NonIsland regions", "in Island regions"), fill = c(rgb(0, 
    0, 1, 0.5), rgb(1, 0, 0, 0.5)))

用ggplot来画：

ggplot(freqCombined) + 
geom_histogram(aes(fill = type, x = freq), alpha = .7, position = "identity", 
  binwidth = .005) + 
labs(title = "Frequency of CG digrams")

Exercise: Display GC content in a running window along the sequence of Staphylococcus Aureus. For this excercise we read in a fasta file sequence from a file:

staph = readDNAStringSet("./data/staphsequence.ffn.txt", "fasta")
staph[1:3]

##   A DNAStringSet instance of length 3
##     width seq                                          names               
## [1]  1362 ATGTCGGAAAAAGAAATTTGG...AAGAAATAAGAAATGTATAA lcl|NC_002952.2_c...
## [2]  1134 ATGATGGAATTCACTATTAAA...TACCAATCAGAACTTACTAA lcl|NC_002952.2_c...
## [3]   246 GTGATTATTTTGGTTCAAGAA...TTCATCAAGGTGAACAATGA lcl|NC_002952.2_c...

This shows the first 3 sequences in the set. If you type in ‘staph’ it will print out all the sequences. Now, suppose we want to find the average GC content in the sequence windows of width 100 bases.

letterFrequency(staph[[1]], letters="ACGT", OR=0)

##   A   C   G   T 
## 522 219 229 392

GCstaph = data.frame(ID=names(staph), GC=rowSums(alphabetFrequency(staph)[, c(2,3)]/width(staph))*100)

Now we would like to display the GC content in a sliding window.

window = 100
# compute the GC content in a sliding window (as a fraction) for a sequence no. 364
gc = rowSums(letterFrequencyInSlidingView(staph[[364]], window, c("G", "C")))/window
plot(gc, type = 'l')

Now, we would like to add a smoothing line to observe the overall trends.

plot(1:length(gc), gc)
lines(lowess(x = 1:length(gc), y= gc, f = 0.10), col = 12, lwd = 2)

原文链接：

https://web.stanford.edu/class/bios221/labs/biostrings/lab_1_biostrings.html