Creating GWAS SNP-Gene Master Table

Last update: 2018-02-02

Code version: 106f0fe3ceb59d559c7cc55e0482b2b4ad248ae6

Setting important directories.

Also loading important libraries and custom functions for analysis.

seq_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/data"
file_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/output"
Rdata_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/data"
Script_dir <- "/Volumes/PAULHOOK/sc-da-parkinsons/code"
source(file.path(Script_dir,'init.R'))
source(file.path(Script_dir,"tools_R.r"))

library(biomaRt)

Downloading and loading in the NHGRI data

First, we need to download all GWAS SNPs from the NHGRI GWAS catalog. The current NHGRI catalog was downloaded from https://www.ebi.ac.uk/gwas/api/search/downloads/full on December 1, 2017. Information about column headers are here (https://www.ebi.ac.uk/gwas/docs/fileheaders). The data was then loaded in to R and data was cleaned.

gwas.cat.meta <- read.delim(file.path(file_dir,"gwas_catalog_v1.0-associations_e90_r2017-11-27.tsv")) %>%
  dplyr::select(SNP_ID_CURRENT, DISEASE.TRAIT, PUBMEDID) %>%
  dplyr::mutate(snp_id = paste0("rs",SNP_ID_CURRENT)) %>%
  dplyr::rename(trait = DISEASE.TRAIT, pmid = PUBMEDID) %>%
  dplyr::select(snp_id, trait, pmid)

## Warning: package 'bindrcpp' was built under R version 3.3.2

head(gwas.cat.meta)

##      snp_id         trait     pmid
## 1 rs4950928 YKL-40 levels 18403759
## 2 rs7993214     Psoriasis 18369459
## 3 rs8034191   Lung cancer 18385676
## 4 rs2808630   Lung cancer 18385676
## 5 rs7626795   Lung cancer 18385676
## 6 rs8034191   Lung cancer 18385738

Getting SNP coordinates

# Using GRCh38 - run on 12-1-17 - important to use 87
snpmart<-useEnsembl(biomart = "snp",dataset = 'hsapiens_snp', version = 87)

# Looking up the SNPs. This is going to take a while.
snpPos<-getBM(attributes = c('refsnp_id','chr_name','chrom_start','chrom_end'), 
      filters = c('snp_filter'), 
      values = as.character(gwas.cat.meta$snp_id),
      mart = snpmart)

snpPos<-snpPos[!duplicated(snpPos$refsnp_id),]

names(snpPos) <- c('snp','chr','start','end')

snpPos <- snpPos[,c(2,3,4,1)]

head(snpPos)

##   chr     start       end        snp
## 1  21  33060745  33060745  rs1000005
## 2   4 144312789 144312789 rs10000225
## 3   5  97197765  97197765  rs1000083
## 4   5 150860514 150860514  rs1000113
## 5   4  12578222  12578222 rs10002492
## 6   4  38783103  38783103 rs10004195

Calculating +/- 1 Mb lead SNP coordinates

mega.snp <- snpPos

# +/- 1 Mb
mega.snp$upstream <- mega.snp$start-1e+06
mega.snp$downstream <- mega.snp$end+1e+06

# add chromosome coordinates
mega.snp$coordinates <- paste0(mega.snp$chr,":",mega.snp$upstream,":", mega.snp$downstream)
saveRDS(mega.snp, file = "mega.snp.rds")

# Using biomart to find genes
ensembl<-useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", version = 87)

# From https://www.biostars.org/p/167818/
mega.coords.list <- as.list(mega.snp$coordinates)
saveRDS(mega.coords.list, file = file.path(Rdata_dir,"mega.coords.list.rds"))

Finding all genes

In order to find all the genes within +/- 1 Mb of each lead SNP in the NHGRI GWAS catalog, the following for loop was used. This for loop takes coordinates and pulls out information about all genes within that region. There are ~40,000 SNPs in the GWAS catalog, so this script had to be run in parallel which involved splitting the SNP coordinates in to 10 separate files. All of the SNP-gene tables were combined in to a final list called “gwas.snp_results.txt” that is processed in another script.

# Finding all genes in the intervials in CRCh38, ensembl version 87
mega.results <- data.frame() # creating blank df
for(i in 1:length(mega.coords.list)){
  results<-getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position",
                                "end_position","gene_biotype"),
                 filters = c("chromosomal_region"),
                 values = list(chromosomal_region=mega.coords.list[i]), 
                 mart = ensembl) # finding the genes through biomaRt
  results$snp <- mega.snp$snp[i] # populating the SNP column
  mega.results <- rbind(mega.results,results) # Adding new rows on to df
  mega.results <- mega.results[!(mega.results$hgnc_symbol == ""),] # Removing blank gene names
}

Session Info

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] grid      splines   stats4    parallel  stats     graphics  grDevices
##  [8] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2        biomaRt_2.28.0      ggbiplot_0.55      
##  [4] scales_0.5.0        SC3_1.1.4           ROCR_1.0-7         
##  [7] jackstraw_1.1.1     lfa_1.2.2           tsne_0.1-3         
## [10] gridExtra_2.3       slackr_1.4.2        vegan_2.4-4        
## [13] permute_0.9-4       MASS_7.3-47         gplots_3.0.1       
## [16] RColorBrewer_1.1-2  Hmisc_4.0-3         Formula_1.2-2      
## [19] survival_2.41-3     lattice_0.20-35     Heatplus_2.18.0    
## [22] Rtsne_0.13          pheatmap_1.0.8      tidyr_0.7.1        
## [25] dplyr_0.7.4         plyr_1.8.4          heatmap.plus_1.3   
## [28] stringr_1.2.0       marray_1.50.0       limma_3.28.21      
## [31] reshape2_1.4.3      monocle_2.2.0       DDRTree_0.1.5      
## [34] irlba_2.2.1         VGAM_1.0-2          ggplot2_2.2.1      
## [37] Biobase_2.32.0      BiocGenerics_0.18.0 Matrix_1.2-11      
## 
## loaded via a namespace (and not attached):
##  [1] RSelenium_1.7.1        colorspace_1.3-2       class_7.3-14          
##  [4] rprojroot_1.2          htmlTable_1.9          corpcor_1.6.9         
##  [7] base64enc_0.1-3        bit64_0.9-7            AnnotationDbi_1.34.4  
## [10] mvtnorm_1.0-6          codetools_0.2-15       doParallel_1.0.11     
## [13] robustbase_0.92-7      knitr_1.17             jsonlite_1.5          
## [16] cluster_2.0.6          semver_0.2.0           shiny_1.0.5           
## [19] rrcov_1.4-3            httr_1.3.1             backports_1.1.1       
## [22] assertthat_0.2.0       lazyeval_0.2.1         acepack_1.4.1         
## [25] htmltools_0.3.6        tools_3.3.0            igraph_1.1.2          
## [28] gtable_0.2.0           glue_1.1.1             binman_0.1.0          
## [31] doRNG_1.6.6            Rcpp_0.12.14           slam_0.1-37           
## [34] gdata_2.18.0           nlme_3.1-131           iterators_1.0.8       
## [37] mime_0.5               rngtools_1.2.4         gtools_3.5.0          
## [40] WriteXLS_4.0.0         XML_3.98-1.9           DEoptimR_1.0-8        
## [43] yaml_2.1.15            memoise_1.1.0          pkgmaker_0.22         
## [46] rpart_4.1-11           RSQLite_2.0            fastICA_1.2-1         
## [49] latticeExtra_0.6-28    stringi_1.1.5          S4Vectors_0.10.3      
## [52] pcaPP_1.9-72           foreach_1.4.3          e1071_1.6-8           
## [55] checkmate_1.8.4        caTools_1.17.1         rlang_0.1.6           
## [58] pkgconfig_2.0.1        matrixStats_0.52.2     bitops_1.0-6          
## [61] qlcMatrix_0.9.5        evaluate_0.10.1        purrr_0.2.4           
## [64] bindr_0.1              htmlwidgets_0.9        bit_1.1-12            
## [67] magrittr_1.5           R6_2.2.2               IRanges_2.6.1         
## [70] combinat_0.0-8         DBI_0.7                wdman_0.2.2           
## [73] foreign_0.8-69         mgcv_1.8-22            RCurl_1.95-4.9        
## [76] nnet_7.3-12            tibble_1.3.4           KernSmooth_2.23-15    
## [79] rmarkdown_1.8          data.table_1.10.4      blob_1.1.0            
## [82] HSMMSingleCell_0.106.2 digest_0.6.12          xtable_1.8-2          
## [85] httpuv_1.3.5           openssl_0.9.7          munsell_0.4.3         
## [88] registry_0.3

This R Markdown site was created with workflowr