Read in the data we were given
setwd("o:/Training/Advanced R/Advanced R Data/")
read.delim("methylation.txt") -> methylation
read.delim("expression.txt") -> expression
Remove any lines from the methylation data where the promoter methylation equals -1 (ie not actually measured), and see what effect this has on the number of rows of data we have
nrow(methylation)
## [1] 25077
methylation[methylation$Promoter_meth != -1,] -> methylation
nrow(methylation)
## [1] 23974
Deduplicate the expression data based on probe names. We’re going to leave one instance of each duplicated name behind.
expression[!duplicated(expression$Probe),] -> expression
Find out how many olfactory receptors we have in our expression data
sum(grepl("Olfr",expression$Probe))
## [1] 1118
Change the chromosome names from 1,2,3 etc, to chr1,chr2,chr3 etc.
expression$Chromosome <- paste("chr",expression$Chromosome,sep="")
Find out which probes are present in the methylation data, but not the expression data
methylation$Probe[!(methylation$Probe %in% expression$Probe)]
## [1] AC027184.1 AC044807.1 AC044864.2 AC069308.1
## [5] AC087117.2 AC087556.1 AC087898.1 AC093346.1
## [9] AC099934.1 AC100600.1 AC102240.1 AC102463.1
## [13] AC102525.1 AC102689.1 AC102815.1 AC103359.1
## [17] AC103368.1 AC107711.2 AC107770.1 AC108409.1
## [21] AC108434.1 AC109138.1 AC109196.1 AC110211.1
## [25] AC110235.1 AC113244.1 AC113316.1 AC113598.1
## [29] AC114657.1 AC114668.1 AC115005.1 AC115697.1
## [33] AC117657.1 AC119813.1 AC120136.3 AC120540.1
## [37] AC121599.1 AC121600.1 AC121784.1 AC121784.4
## [41] AC122247.1 AC122490.1 AC122520.2 AC122537.1
## [45] AC122546.1 AC122860.1 AC123610.1 AC123616.1
## [49] AC123882.1 AC124180.1 AC124466.1 AC124472.1
## [53] AC124489.1 AC124584.1 AC124613.1 AC124807.1
## [57] AC125201.1 AC126679.1 AC126690.2 AC127419.1
## [61] AC132367.1 AC133195.1 AC133646.1 AC134596.1
## [65] AC136021.1 AC137708.1 AC138311.2 AC139241.1
## [69] AC139317.1 AC140327.1 AC141881.8 AC142212.1
## [73] AC147101.1 AC148990.1 AC149593.1 AC149599.1
## [77] AC150277.1 AC152062.1 AC152063.1 AC152453.1
## [81] AC152826.1 AC153556.1 AC154187.1 AC154247.1
## [85] AC154473.1 AC154495.1 AC156026.1 AC156032.1
## [89] AC156953.1 AC157822.1 AC158667.1 AC158898.1
## [93] AC158995.1 AC159379.1 AC159624.1 AC160757.1
## [97] AC161246.1 AC161366.1 AC162464.1 AC162799.1
## [101] AC163215.1 AC163703.1 AC164408.1 AC165361.1
## [105] AC166172.1 AC166290.1 AC166574.2 AC167242.1
## [109] AC167245.1 AC170189.1 AI324046 AL589670.1
## [113] AL590969.1 AL591146.1 AL591712.1 AL591905.1
## [117] AL596123.1 AL596258.1 AL603708.1 AL603828.1
## [121] AL606908.1 AL626775.1 AL626784.1 AL645470.1
## [125] AL645971.1 AL646096.2 AL663027.1 AL669947.1
## [129] AL669982.1 AL671866.1 AL671882.1 AL671913.1
## [133] AL672070.1 AL672076.1 AL672246.1 AL683882.1
## [137] AL713978.1 AL732564.1 AL732626.1 AL772271.1
## [141] AL772271.4 AL772272.1 AL772349.1 AL772404.1
## [145] AL806517.1 AL807811.1 AL831731.1 AL844480.1
## [149] AL844840.2 AL844869.1 AL844871.1 AL845174.1
## [153] AL928544.2 AL929011.1 AL929371.2 AL954388.2
## [157] Art2a-ps BX005053.1 BX470216.1 BX537299.1
## [161] BX537327.1 BX649539.1 Cdk3-ps CPEB3_ribozyme
## [165] CR956641.1 CR974466.1 CT010479.2 CT025673.1
## [169] CT030240.1 CT030723.1 CT485616.1 CT573086.1
## [173] Gm10895 Gm10907 Gm13892 Gm13893
## [177] Gm13909 Gm13934 Gm13950 Gm14030
## [181] Gm16603 Gm16611 Gm16612 Gm16886
## [185] Gm16888 Gm16891 Gm16931 Gm16971
## [189] Gm16979 Gm17002 Gm17004 Gm17011
## [193] Gm17015 Gm17347 Gm5074 H2-Bl
## [197] H2-T10 Ighg Ighg1 Ighm
## [201] Ighv3-3 Igkc Igkv12-38 Igkv12-89
## [205] Igkv18-36 Igkv6-32 Mir1197 Mir1199
## [209] Mir1247 Mir1249 Mir124a-3 Mir126
## [213] Mir129-2 Mir132 Mir137 Mir138-2
## [217] Mir149 Mir154 Mir1893 Mir1901
## [221] Mir1905 Mir1932 Mir1938 Mir1945
## [225] Mir1964 Mir1983 Mir210 Mir212
## [229] Mir218-1 Mir218-2 Mir219-2 Mir26a-1
## [233] Mir320 Mir323 Mir339 Mir341
## [237] Mir375 Mir411 Mir425 Mir431
## [241] Mir434 Mir486 Mir543 Mir574
## [245] Mir598 Mir666 Mir680-1 Mir702
## [249] Mir707 Mir718 Mir760 Mir762
## [253] Mir770 Mir92b Mir99a Mir99b
## [257] mmu-mir-2134-1 mmu-mir-2134-2 mmu-mir-2134-4 mmu-mir-689-2
## [261] mt-Rnr2 n-R5-8s1 n-R5s109 n-R5s151
## [265] n-R5s156 n-R5s172 n-R5s175 n-R5s193
## [269] n-R5s194 n-R5s2 n-R5s28 n-R5s33
## [273] n-R5s56 n-R5s71 n-R5s92 n-R5s96
## [277] Nlrp1c-ps NRON Ptcra RNase_MRP
## [281] Scarna10 Scarna13 SCARNA13 Scarna2
## [285] Scnm1 Serpinb10-ps Slc22a13b SNORA53
## [289] Snora61 SNORA71 Snora73a Snora73b
## [293] Snora78 Snord104 Snord13 snoU54
## [297] Tcra-V8 Tcrg-C Tcrg-V4 Tcrg-V5
## [301] Tcrg-V6 Tcrg-V7 Telomerase-vert Trac
## [305] Trav1 Trav10n Trav13-2 Trav17
## [309] Trav18 Trav2 Trav3-1 Trav3-3
## [313] Trav3-4 Trav6d-3 Trav6d-5 Trav7-6
## [317] Trav8d-1 Trbc1 Trbc2 Trbv1
## [321] Trbv13-3 Trbv16 Trbv19 Trbv2
## [325] Trbv31 Trbv4 Trbv5 Trdv1
## [329] Trdv2-1 Trdv5 U11 U12
## [333] U2 U3 U4atac
## 25077 Levels: 0610005C13Rik 0610007C21Rik 0610007L01Rik ... Zzz3
Merge the expression and methylation datasets together. The only column name in common is Probe, which is what we want to merge on anyway.
merge(expression,methylation) -> merged.data
Loop over the expression/methylation columns to calculate the range of values in each.
apply(merged.data[,6:8],2,range)
## Expression Gene_body_meth Promoter_meth
## [1,] -9.743026 0 0
## [2,] 11.321041 100 100
Calculate the mean gene body methylation for each chromosome, then plot this as a barplot
tapply(merged.data$Gene_body_meth,merged.data$Chromosome,mean)
## chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16
## 43.58484 46.10247 39.33736 45.42464 42.77284 41.64486 40.85106 44.89734
## chr17 chr18 chr19 chr2 chr3 chr4 chr5 chr6
## 42.08896 43.12389 43.66956 39.22779 40.02630 41.98396 48.95195 40.50904
## chr7 chr8 chr9 chrMT chrX
## 41.69334 47.86813 43.31258 16.70353 40.79321
barplot(tapply(merged.data$Gene_body_meth,merged.data$Chromosome,mean))
Set up a function to calculate the SEM
sem <- function (x,absent=NA) {
if (sum(is.na(x)) > 0) return (absent)
return (sd(x)/sqrt(length(x)))
}
Calculate the gene body methylation SEM for each chromosome.
tapply(merged.data$Gene_body_meth,merged.data$Chromosome,sem)
## chr1 chr10 chr11 chr12 chr13 chr14 chr15
## 0.8997813 1.0365418 0.7527817 1.2446569 1.1330741 1.1697267 1.1343304
## chr16 chr17 chr18 chr19 chr2 chr3 chr4
## 1.2531691 0.9920431 1.4073080 1.1942370 0.7315191 0.9978256 0.8167108
## chr5 chr6 chr7 chr8 chr9 chrMT chrX
## 0.8783216 0.9507895 0.7829240 0.9606002 0.9308829 3.7439534 1.0516666
Install the vioplot library and use it to draw violin plots for the 3 datasets we have.
library(vioplot)
par(mfrow=c(1,3))
sapply(colnames(merged.data)[6:8],function(x) {
vioplot(merged.data[[x]])
title(x)
}
)