用R获取芯片探针与基因的对应关系三部曲-bioconductor
现有的基因芯片种类不要太多了!soft
和miniml
都是表示该platform的基础信息,比如GPL编号
,上传日期等,soft文件的部分内容如下
但是重要而且常用的芯片并不多!
一般分析芯片数据都需要把探针的ID切换成基因的ID,我一般喜欢用基因的entrez ID。
一般有三种方法可以得到芯片探针与gene的对应关系。
金标准当然是去基因芯片的厂商的官网直接去下载啦!!!
一种是直接用bioconductor的包
一种是从NCBI里面下载文件来解析好!
首先,我们说官网,肯定可以找到,不然这种芯片出来就没有意义了!
然后,我们看看NCBI下载的,会比较大
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6947 这两种方法都比较麻烦,需要一个个的来!
所以我接下来要讲的是用R的bioconductor包来批量得到芯片探针与gene的对应关系!
一般重要的芯片在R的bioconductor里面都是有包的,用一个R包可以批量获取有注释信息的芯片平台,我选取了常见的物种,如下:
0 | gpl | organism | bioc_package.db |
1 | GPL32 | Mus musculus | mgu74a |
2 | GPL33 | Mus musculus | mgu74b |
3 | GPL34 | Mus musculus | mgu74c |
6 | GPL74 | Homo sapiens | hcg110 |
7 | GPL75 | Mus musculus | mu11ksuba |
8 | GPL76 | Mus musculus | mu11ksubb |
9 | GPL77 | Mus musculus | mu19ksuba |
10 | GPL78 | Mus musculus | mu19ksubb |
11 | GPL79 | Mus musculus | mu19ksubc |
12 | GPL80 | Homo sapiens | hu6800 |
13 | GPL81 | Mus musculus | mgu74av2 |
14 | GPL82 | Mus musculus | mgu74bv2 |
15 | GPL83 | Mus musculus | mgu74cv2 |
16 | GPL85 | Rattus norvegicus | rgu34a |
17 | GPL86 | Rattus norvegicus | rgu34b |
18 | GPL87 | Rattus norvegicus | rgu34c |
19 | GPL88 | Rattus norvegicus | rnu34 |
20 | GPL89 | Rattus norvegicus | rtu34 |
22 | GPL91 | Homo sapiens | hgu95av2 |
23 | GPL92 | Homo sapiens | hgu95b |
24 | GPL93 | Homo sapiens | hgu95c |
25 | GPL94 | Homo sapiens | hgu95d |
26 | GPL95 | Homo sapiens | hgu95e |
27 | GPL96 | Homo sapiens | hgu133a |
28 | GPL97 | Homo sapiens | hgu133b |
29 | GPL98 | Homo sapiens | hu35ksuba |
30 | GPL99 | Homo sapiens | hu35ksubb |
31 | GPL100 | Homo sapiens | hu35ksubc |
32 | GPL101 | Homo sapiens | hu35ksubd |
36 | GPL201 | Homo sapiens | hgfocus |
37 | GPL339 | Mus musculus | moe430a |
38 | GPL340 | Mus musculus | mouse4302 |
39 | GPL341 | Rattus norvegicus | rae230a |
40 | GPL342 | Rattus norvegicus | rae230b |
41 | GPL570 | Homo sapiens | hgu133plus2 |
42 | GPL571 | Homo sapiens | hgu133a2 |
43 | GPL886 | Homo sapiens | hgug4111a |
44 | GPL887 | Homo sapiens | hgug4110b |
45 | GPL1261 | Mus musculus | mouse430a2 |
49 | GPL1352 | Homo sapiens | u133x3p |
50 | GPL1355 | Rattus norvegicus | rat2302 |
51 | GPL1708 | Homo sapiens | hgug4112a |
54 | GPL2891 | Homo sapiens | h20kcod |
55 | GPL2898 | Rattus norvegicus | adme16cod |
60 | GPL3921 | Homo sapiens | hthgu133a |
63 | GPL4191 | Homo sapiens | h10kcod |
64 | GPL5689 | Homo sapiens | hgug4100a |
65 | GPL6097 | Homo sapiens | illuminaHumanv1 |
66 | GPL6102 | Homo sapiens | illuminaHumanv2 |
67 | GPL6244 | Homo sapiens | hugene10sttranscriptcluster |
68 | GPL6947 | Homo sapiens | illuminaHumanv3 |
69 | GPL8300 | Homo sapiens | hgu95av2 |
70 | GPL8490 | Homo sapiens | IlluminaHumanMethylation27k |
71 | GPL10558 | Homo sapiens | illuminaHumanv4 |
72 | GPL11532 | Homo sapiens | hugene11sttranscriptcluster |
73 | GPL13497 | Homo sapiens | |
74 | GPL13534 | Homo sapiens | IlluminaHumanMethylation450k |
75 | GPL13667 | Homo sapiens | hgu219 |
76 | GPL15380 | Homo sapiens | GGHumanMethCancerPanelv1 |
77 | GPL15396 | Homo sapiens | hthgu133b |
78 | GPL17897 | Homo sapiens | hthgu133a |
这些包首先需要都下载
gpl_info=read.csv("GPL_info.csv",stringsAsFactors = F)
### first download all of the annotation packages from bioconductor
for (i in 1:nrow(gpl_info)){
print(i)
platform=gpl_info[i,4]
platform=gsub('^ ',"",platform) ##主要是因为我处理包的字符串前面有空格
#platformDB='hgu95av2.db'
platformDB=paste(platform,".db",sep="")
if( platformDB %in% rownames(installed.packages()) == FALSE) {
BiocInstaller::biocLite(platformDB)
#source("http://bioconductor.org/biocLite.R");
#biocLite(platformDB )
}
}
下载完了所有的包, 就可以进行批量导出芯片探针与gene的对应关系!
for (i in 1:nrow(gpl_info)){
print(i)
platform=gpl_info[i,4]
platform=gsub('^ ',"",platform)
#platformDB='hgu95av2.db'
platformDB=paste(platform,".db",sep="")
if( platformDB %in% rownames(installed.packages()) != FALSE) {
library(platformDB,character.only = T)
#tmp=paste('head(mappedkeys(',platform,'ENTREZID))',sep='')
#eval(parse(text = tmp))
###重点在这里,把字符串当做命令运行
all_probe=eval(parse(text = paste('mappedkeys(',platform,'ENTREZID)',sep='')))
EGID <- as.numeric(lookUp(all_probe, platformDB, "ENTREZID"))
##自己把内容写出来即可
}
}
参考:http://blog.sina.com.cn/s/blog_62b37bfe0101jbuq.html
从GEO的soft文件中提取探针序列信息
问题
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16956 我需要注释这个平台,想要下载它的表格,可是直接表格点下面的view full table会显示在网页中,有五万多行,网页只能显示两万多行,根本不完整。
思路
所以我只能下载文件了,隐约记得应该下载soft文件,可我不记得怎么读取了。
解决方案
1.搜索:
找到了第一个链接:
https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/
哦 原来是这个样子,getGEO可以。但是我的soft好大,接近三百兆,很容易下载不完整。
2.从Rstudio自带的terminal运行linux,下载soft文件到本地
用wget来下载到本地再读取,刚好Rstudio可以运行linux哦,之前发过一次:https://www.jianshu.com/p/c16c7095e4b2
wget -c ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL16nnn/GPL16956/soft/GPL16956_family.soft.gz
gunzip GPL16956_family.soft.gz
3.读取数据并处理
下载的是压缩包,解压不解压都行的。都可以读取
刚才的帖子里有用信息就是读取方法:
library(GEOquery)
x=getGEO(filename = "GPL16956_family.soft")
step1:从GPL中提取有用的部分,只需要数据框的’ID’,'SEQUENCE’两列。
y=x@dataTable@table[,c('ID','SEQUENCE')]
head(y)
## ID
## 1 ASHGA5P000001
## 2 ASHGA5P000002
## 3 ASHGA5P000003
## 4 ASHGA5P000004
## 5 ASHGA5P000005
## 6 ASHGA5P000006
## SEQUENCE
## 1 GAGATAACTAGAACAGTGTCCCTCCCCTTTTATAACCTGTGTTTTTAGATTTCAAAAAGA
## 2 CTAACTGTCAAACAAAGATGGGCTAATTAAACGGATGCCAACTAATCGGAAAACTTCTGG
## 3 CCAGAAGCTCCAGCATGGTCCACCTACCAGAAGCTCCAGCATGATTTACCCATGAGTAGC
## 4 TAAGTACGGTAATGCAAGAAGGTGCAGCAGGCGTAGTGCATTACAGCAACACTGAAAAAC
## 5 TATGAGAGATGAAGAGATAAGGTCTAACTGTTACCTGGGCTGAACCCTTGGACTTCTAGG
## 6 CAGTAGCGAGTATTTACTAAGTACTTTCTATTTGCGAGGCCCTGATAAAAGTACTGTCCT
#断线好几次 保存一下
save(y,file = "16956.Rdata")
step2.重复值统计
几千万行肯定不对,进行去重统计,发现每个都重复了八百多次!不明白为什么。
library(tidyverse)
x=count(y,ID,sort = T)
head(x)
## # A tibble: 6 x 2
## ID n
## <chr> <int>
## 1 !Sample_channel_count = 1 875
## 2 !Sample_molecule_ch1 = total RNA 875
## 3 !Sample_organism_ch1 = Homo sapiens 875
## 4 !Sample_platform_id = GPL16956 875
## 5 !sample_table_begin 875
## 6 !sample_table_end 875
step3.去重复,两种方法
test1=distinct(y,ID,SEQUENCE)
test2=y[!duplicated(y),]
发现有一些叹号开头的行没有被跳过,有NA。所以去除NA。与GEO网页上的行数进行对比,发现一致,说明正确了。保存一下
test3= na.omit(test2)
nrow(test3)
## [1] 58944
save(test3,file = "GPL16956.Rdata")
write.csv(test3,file = "GPL16956.csv",row.names = F)
test4=read.csv("GPL16956.csv")
head(test4)
## ID
## 1 ASHGA5P000001
## 2 ASHGA5P000002
## 3 ASHGA5P000003
## 4 ASHGA5P000004
## 5 ASHGA5P000005
## 6 ASHGA5P000006
## SEQUENCE
## 1 GAGATAACTAGAACAGTGTCCCTCCCCTTTTATAACCTGTGTTTTTAGATTTCAAAAAGA
## 2 CTAACTGTCAAACAAAGATGGGCTAATTAAACGGATGCCAACTAATCGGAAAACTTCTGG
## 3 CCAGAAGCTCCAGCATGGTCCACCTACCAGAAGCTCCAGCATGATTTACCCATGAGTAGC
## 4 TAAGTACGGTAATGCAAGAAGGTGCAGCAGGCGTAGTGCATTACAGCAACACTGAAAAAC
## 5 TATGAGAGATGAAGAGATAAGGTCTAACTGTTACCTGGGCTGAACCCTTGGACTTCTAGG
## 6 CAGTAGCGAGTATTTACTAAGTACTTTCTATTTGCGAGGCCCTGATAAAAGTACTGTCCT
sessionInfo()
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS/LAPACK: /pub/anaconda3/lib/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] knitr_1.22 forcats_0.4.0 stringr_1.4.0
## [4] dplyr_0.8.0.1 purrr_0.3.2 readr_1.3.1
## [7] tidyr_0.8.3 tibble_2.1.1 ggplot2_3.1.0
## [10] tidyverse_1.2.1 GEOquery_2.50.5 Biobase_2.42.0
## [13] BiocGenerics_0.28.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.1
## [5] plyr_1.8.4 tools_3.5.1 evaluate_0.13 lubridate_1.7.4
## [9] jsonlite_1.6 nlme_3.1-137 gtable_0.3.0 lattice_0.20-38
## [13] pkgconfig_2.0.2 rlang_0.3.3 cli_1.1.0 rstudioapi_0.10
## [17] yaml_2.2.0 xfun_0.5 haven_2.1.0 withr_2.1.2
## [21] httr_1.4.0 xml2_1.2.0 generics_0.0.2 hms_0.4.2
## [25] grid_3.5.1 tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
## [29] fansi_0.4.0 readxl_1.3.1 limma_3.38.3 modelr_0.1.4
## [33] magrittr_1.5 backports_1.1.3 scales_1.0.0 rvest_0.3.2
## [37] assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4 stringi_1.4.3
## [41] lazyeval_0.2.2 munsell_0.5.0 broom_0.5.1 crayon_1.3.4