Last update: 2018-02-02
Code version: 106f0fe3ceb59d559c7cc55e0482b2b4ad248ae6
Also loading important libraries and custom functions for analysis.
seq_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/data"
file_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/output"
Rdata_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/data"
Script_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/code"
source(file.path(Script_dir,'init.R'))
source(file.path(Script_dir,"tools_R.r"))
library(biomaRt)
First, we need to download all GWAS SNPs from the NHGRI GWAS catalog. The current NHGRI catalog was downloaded from https://www.ebi.ac.uk/gwas/api/search/downloads/full on December 1, 2017. Information about column headers are here (https://www.ebi.ac.uk/gwas/docs/fileheaders). The data was then loaded in to R and data was cleaned.
gwas.cat.meta <- read.delim(file.path(file_dir,"gwas_catalog_v1.0-associations_e90_r2017-11-27.tsv")) %>%
dplyr::select(SNP_ID_CURRENT, DISEASE.TRAIT, PUBMEDID) %>%
dplyr::mutate(snp_id = paste0("rs",SNP_ID_CURRENT)) %>%
dplyr::rename(trait = DISEASE.TRAIT, pmid = PUBMEDID) %>%
dplyr::select(snp_id, trait, pmid)
## Warning: package 'bindrcpp' was built under R version 3.3.2
head(gwas.cat.meta)
## snp_id trait pmid
## 1 rs4950928 YKL-40 levels 18403759
## 2 rs7993214 Psoriasis 18369459
## 3 rs8034191 Lung cancer 18385676
## 4 rs2808630 Lung cancer 18385676
## 5 rs7626795 Lung cancer 18385676
## 6 rs8034191 Lung cancer 18385738
# Using GRCh38 - run on 12-1-17 - important to use 87
snpmart<-useEnsembl(biomart = "snp",dataset = 'hsapiens_snp', version = 87)
# Looking up the SNPs. This is going to take a while.
snpPos<-getBM(attributes = c('refsnp_id','chr_name','chrom_start','chrom_end'),
filters = c('snp_filter'),
values = as.character(gwas.cat.meta$snp_id),
mart = snpmart)
snpPos<-snpPos[!duplicated(snpPos$refsnp_id),]
names(snpPos) <- c('snp','chr','start','end')
snpPos <- snpPos[,c(2,3,4,1)]
head(snpPos)
## chr start end snp
## 1 21 33060745 33060745 rs1000005
## 2 4 144312789 144312789 rs10000225
## 3 5 97197765 97197765 rs1000083
## 4 5 150860514 150860514 rs1000113
## 5 4 12578222 12578222 rs10002492
## 6 4 38783103 38783103 rs10004195
mega.snp <- snpPos
# +/- 1 Mb
mega.snp$upstream <- mega.snp$start-1e+06
mega.snp$downstream <- mega.snp$end+1e+06
# add chromosome coordinates
mega.snp$coordinates <- paste0(mega.snp$chr,":",mega.snp$upstream,":", mega.snp$downstream)
saveRDS(mega.snp, file = "mega.snp.rds")
# Using biomart to find genes
ensembl<-useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", version = 87)
# From https://www.biostars.org/p/167818/
mega.coords.list <- as.list(mega.snp$coordinates)
saveRDS(mega.coords.list, file = file.path(Rdata_dir,"mega.coords.list.rds"))
In order to find all the genes within +/- 1 Mb of each lead SNP in the NHGRI GWAS catalog, the following for loop was used. This for loop takes coordinates and pulls out information about all genes within that region. There are ~40,000 SNPs in the GWAS catalog, so this script had to be run in parallel which involved splitting the SNP coordinates in to 10 separate files. All of the SNP-gene tables were combined in to a final list called “gwas.snp_results.txt” that is processed in another script.
# Finding all genes in the intervials in CRCh38, ensembl version 87
mega.results <- data.frame() # creating blank df
for(i in 1:length(mega.coords.list)){
results<-getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position",
"end_position","gene_biotype"),
filters = c("chromosomal_region"),
values = list(chromosomal_region=mega.coords.list[i]),
mart = ensembl) # finding the genes through biomaRt
results$snp <- mega.snp$snp[i] # populating the SNP column
mega.results <- rbind(mega.results,results) # Adding new rows on to df
mega.results <- mega.results[!(mega.results$hgnc_symbol == ""),] # Removing blank gene names
}
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid splines stats4 parallel stats graphics grDevices
## [8] utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 biomaRt_2.28.0 ggbiplot_0.55
## [4] scales_0.5.0 SC3_1.1.4 ROCR_1.0-7
## [7] jackstraw_1.1.1 lfa_1.2.2 tsne_0.1-3
## [10] gridExtra_2.3 slackr_1.4.2 vegan_2.4-4
## [13] permute_0.9-4 MASS_7.3-47 gplots_3.0.1
## [16] RColorBrewer_1.1-2 Hmisc_4.0-3 Formula_1.2-2
## [19] survival_2.41-3 lattice_0.20-35 Heatplus_2.18.0
## [22] Rtsne_0.13 pheatmap_1.0.8 tidyr_0.7.1
## [25] dplyr_0.7.4 plyr_1.8.4 heatmap.plus_1.3
## [28] stringr_1.2.0 marray_1.50.0 limma_3.28.21
## [31] reshape2_1.4.3 monocle_2.2.0 DDRTree_0.1.5
## [34] irlba_2.2.1 VGAM_1.0-2 ggplot2_2.2.1
## [37] Biobase_2.32.0 BiocGenerics_0.18.0 Matrix_1.2-11
##
## loaded via a namespace (and not attached):
## [1] RSelenium_1.7.1 colorspace_1.3-2 class_7.3-14
## [4] rprojroot_1.2 htmlTable_1.9 corpcor_1.6.9
## [7] base64enc_0.1-3 bit64_0.9-7 AnnotationDbi_1.34.4
## [10] mvtnorm_1.0-6 codetools_0.2-15 doParallel_1.0.11
## [13] robustbase_0.92-7 knitr_1.17 jsonlite_1.5
## [16] cluster_2.0.6 semver_0.2.0 shiny_1.0.5
## [19] rrcov_1.4-3 httr_1.3.1 backports_1.1.1
## [22] assertthat_0.2.0 lazyeval_0.2.1 acepack_1.4.1
## [25] htmltools_0.3.6 tools_3.3.0 igraph_1.1.2
## [28] gtable_0.2.0 glue_1.1.1 binman_0.1.0
## [31] doRNG_1.6.6 Rcpp_0.12.14 slam_0.1-37
## [34] gdata_2.18.0 nlme_3.1-131 iterators_1.0.8
## [37] mime_0.5 rngtools_1.2.4 gtools_3.5.0
## [40] WriteXLS_4.0.0 XML_3.98-1.9 DEoptimR_1.0-8
## [43] yaml_2.1.15 memoise_1.1.0 pkgmaker_0.22
## [46] rpart_4.1-11 RSQLite_2.0 fastICA_1.2-1
## [49] latticeExtra_0.6-28 stringi_1.1.5 S4Vectors_0.10.3
## [52] pcaPP_1.9-72 foreach_1.4.3 e1071_1.6-8
## [55] checkmate_1.8.4 caTools_1.17.1 rlang_0.1.6
## [58] pkgconfig_2.0.1 matrixStats_0.52.2 bitops_1.0-6
## [61] qlcMatrix_0.9.5 evaluate_0.10.1 purrr_0.2.4
## [64] bindr_0.1 htmlwidgets_0.9 bit_1.1-12
## [67] magrittr_1.5 R6_2.2.2 IRanges_2.6.1
## [70] combinat_0.0-8 DBI_0.7 wdman_0.2.2
## [73] foreign_0.8-69 mgcv_1.8-22 RCurl_1.95-4.9
## [76] nnet_7.3-12 tibble_1.3.4 KernSmooth_2.23-15
## [79] rmarkdown_1.8 data.table_1.10.4 blob_1.1.0
## [82] HSMMSingleCell_0.106.2 digest_0.6.12 xtable_1.8-2
## [85] httpuv_1.3.5 openssl_0.9.7 munsell_0.4.3
## [88] registry_0.3
This R Markdown site was created with workflowr