• Exercise 1 - Simple point and line plots
    • Weight chart
    • Chromosome position
    • Genomes
  • Exercise 2 - Barplots and Distributions
    • Small file
    • Expression
    • Cancer
    • Child variants
  • Exercise 3 - Annotation, Scaling and Colours
    • Themes
    • Cancer
    • Brain Bodyweight
  • Exercise 4 - Summary Overlays
    • Tidy1
  • Exercise 5 - Faceting and Highlighting
    • Up down expression
    • Download Festival

This is a worked set of answers to the ggplot course

Exercise 1 - Simple point and line plots

First we are going to load the main tidyverse library.

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Weight chart

We’ll plot out the data in the weight_chart.txt file. Let’s load it and look first.

read_tsv("weight_chart.txt") -> weight
## Parsed with column specification:
## cols(
##   Age = col_double(),
##   Weight = col_double()
## )
weight
ABCDEFGHIJ0123456789
Age
<dbl>
Weight
<dbl>
03.6
14.4
25.2
36.0
46.6
57.2
67.8
78.4
88.8
99.2

We’ll start with a simple plot, just setting the minimum aesthetics.

weight %>%
  ggplot(aes(x=Age, y=Weight)) +
  geom_point()

Now we can customise this a bit by adding fixed aesthetics to the geom_point() function.

weight %>%
  ggplot(aes(x=Age, y=Weight)) +
  geom_point(size=3, colour="blue2")

Now repeat but with a different geometry.

weight %>%
  ggplot(aes(x=Age, y=Weight)) +
  geom_line()

Finally, combine the two geometries.

weight %>%
  ggplot(aes(x=Age, y=Weight)) +
  geom_line()+
  geom_point(size=3, colour="blue2")

Chromosome position

Now let’s look at the chromosome_position_data.txt file.

read_tsv("chromosome_position_data.txt") -> chr.data
## Parsed with column specification:
## cols(
##   Position = col_double(),
##   Mut1 = col_double(),
##   Mut2 = col_double(),
##   WT = col_double()
## )
head(chr.data)
ABCDEFGHIJ0123456789
Position
<dbl>
Mut1
<dbl>
Mut2
<dbl>
WT
<dbl>
917572732.7073541.33751.250
917573232.7073541.30001.250
917573735.4147071.13751.250
917574232.7073541.57501.875
917574732.7073541.18751.250
917575232.7073542.82502.500

We have the data in three separate columns at the moment so we need to use pivot_longer to put them into a single column.

chr.data %>%
  pivot_longer(cols=-Position, names_to = "sample", values_to = "value") -> chr.data

head(chr.data)
ABCDEFGHIJ0123456789
Position
<dbl>
sample
<chr>
value
<dbl>
91757273Mut12.707354
91757273Mut21.337500
91757273WT1.250000
91757323Mut12.707354
91757323Mut21.300000
91757323WT1.250000

Now we can plot out a line graph of the position vs value for each of the samples. We’ll use colour to distiguish the lines for each sample.

chr.data %>%
  ggplot(aes(x=Position, y=value, colour=sample)) +
  geom_line(size=1)

Genomes

Finally we’re going to look at the genome size vs number of chromosomes and colour it by domain in our genomes data.

read_csv("genomes.csv") -> genomes
## Parsed with column specification:
## cols(
##   Organism = col_character(),
##   Groups = col_character(),
##   Size = col_double(),
##   Chromosomes = col_double(),
##   Organelles = col_double(),
##   Plasmids = col_double(),
##   Assemblies = col_double()
## )
head(genomes)
ABCDEFGHIJ0123456789
Organism
<chr>
'Brassica napus' phytoplasma
'Candidatus Kapabacteria' thiocyanatum
'Catharanthus roseus' aster yellows phytoplasma
'Chrysanthemum coronarium' phytoplasma
'Echinacea purpurea' witches'-broom phytoplasma
'Osedax' symbiont bacterium Rs2_46_30_T18

To get at the Domain we’ll need to split apart the Groups field.

genomes %>%
  separate(col=Groups, into=c("Domain","Kingdom","Class"), sep=";") -> genomes

head(genomes)
ABCDEFGHIJ0123456789
Organism
<chr>
Domain
<chr>
'Brassica napus' phytoplasmaBacteria
'Candidatus Kapabacteria' thiocyanatumBacteria
'Catharanthus roseus' aster yellows phytoplasmaBacteria
'Chrysanthemum coronarium' phytoplasmaBacteria
'Echinacea purpurea' witches'-broom phytoplasmaBacteria
'Osedax' symbiont bacterium Rs2_46_30_T18Bacteria

Now we can draw the plot.

genomes %>%
  ggplot(aes(x=log10(Size),y=Chromosomes, colour=Domain)) +
  geom_point()

Exercise 2 - Barplots and Distributions

Small file

We want a barplot of the lengths of samples in category A.

read_tsv("small_file.txt") -> small.file
## Parsed with column specification:
## cols(
##   Sample = col_character(),
##   Length = col_double(),
##   Category = col_character()
## )
head(small.file)
ABCDEFGHIJ0123456789
Sample
<chr>
Length
<dbl>
Category
<chr>
x_145A
x_282B
x_381C
x_456D
x_596A
x_685B

Since there is only one measure per sample there is no summarisation to be done so we use geom_col rather than geom_bar.

small.file %>%
  filter(Category=="A") %>%
  ggplot(aes(x=Sample,y=Length)) +
  geom_col()

Next we want a stripchart (geom_jitter) of all of the lengths for each category. We need to use height=0 in the geom_jitter to ensure that we don’t adjust the height of the points, only their width.

small.file %>%
  ggplot(aes(x=Category, y=Length)) +
  geom_jitter(height=0)

Whilst this worked it’s not very easy to tell the categories apart so we’ll tweak it to make it clearer.

small.file %>%
  ggplot(aes(x=Category, y=Length, colour=Category)) +
  geom_jitter(height=0, width=0.3, show.legend = FALSE, size=4)

Expression

Plot the distribution of expression values.

read_tsv("expression.txt") -> expression
## Parsed with column specification:
## cols(
##   Gene = col_character(),
##   Expression = col_double()
## )
head(expression)
ABCDEFGHIJ0123456789
Gene
<chr>
Expression
<dbl>
Xkr4-5.037614
Gm1992-2.167190
Rp1-6.964203
Sox17-2.672370
Mrpl153.347706
Lypla11.833350

Let’s try the plots in a couple of ways.

expression %>%
  ggplot(aes(Expression)) +
  geom_histogram(fill="yellow",colour="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

expression %>%
  ggplot(aes(Expression)) +
  geom_density(fill="yellow",colour="black")

We could also play around with the resolution in either of these plots.

Either increasing the resolution:

expression %>%
  ggplot(aes(Expression)) +
  geom_histogram(fill="yellow",colour="black", binwidth = 0.2)

..or decreasing it.

expression %>%
  ggplot(aes(Expression)) +
  geom_density(fill="yellow",colour="black", bw=2)

Cancer

Plot the number of male deaths for all sites.

read_csv("cancer_stats.csv") -> cancer
## Parsed with column specification:
## cols(
##   Class = col_character(),
##   Site = col_character(),
##   `Male Cases` = col_double(),
##   `Female Cases` = col_double(),
##   `Male Deaths` = col_double(),
##   `Female Deaths` = col_double()
## )
head(cancer)
ABCDEFGHIJ0123456789
Class
<chr>
Site
<chr>
Male Cases
<dbl>
Female Cases
<dbl>
Male Deaths
<dbl>
OralTongue1255045102220
OralMouth843058801800
OralPharynx1445034202660
Digestive SystemEosophagus13750390013020
Digestive SystemStomach17230102806800
Digestive SystemSmall intestine56101980890
cancer %>%
  ggplot(aes(x=Site, y=`Male Deaths`)) + 
  geom_col()
## Warning: Removed 5 rows containing missing values (position_stack).

We can’t see all of the labels as there isn’t enough space. We’ll fix this later, but for now let’s just show the 5 highest.

cancer %>%
  arrange(desc(`Male Deaths`)) %>%
  slice(1:5) %>%
  ggplot(aes(x=Site, y=`Male Deaths`)) + 
  geom_col()

Now it works, but even though we fed it sorted data the plot still comes out in alphabetical order.

Child variants

Plot the MutantRead distributions for good (QUAL==200) and bad (QUAL<200) variants.

read_csv("Child_Variants.csv", guess_max = 1000000) -> child
## Parsed with column specification:
## cols(
##   CHR = col_character(),
##   POS = col_double(),
##   dbSNP = col_character(),
##   REF = col_character(),
##   ALT = col_character(),
##   QUAL = col_double(),
##   GENE = col_character(),
##   ENST = col_character(),
##   MutantReads = col_double(),
##   COVERAGE = col_double(),
##   MutantReadPercent = col_double()
## )
head(child)
ABCDEFGHIJ0123456789
CHR
<chr>
POS
<dbl>
dbSNP
<chr>
REF
<chr>
ALT
<chr>
QUAL
<dbl>
GENE
<chr>
ENST
<chr>
MutantReads
<dbl>
COVERAGE
<dbl>
169270.AG16OR4F5ENST0000033513734
169511rs75062661AG200OR4F5ENST000003351372427
169761.AT200OR4F5ENST0000033513788
169897rs75758884TC59OR4F5ENST0000033513733
1877831rs6672356TC200SAMD11ENST000003420661011
1881627rs2272757GA200NOC2LENST000003270445256

We need to make the good/bad category column.

child %>%
  mutate(`Good or not` = if_else(QUAL==200,"Good","Bad")) -> child

head(child)
ABCDEFGHIJ0123456789
CHR
<chr>
POS
<dbl>
dbSNP
<chr>
REF
<chr>
ALT
<chr>
QUAL
<dbl>
GENE
<chr>
ENST
<chr>
MutantReads
<dbl>
COVERAGE
<dbl>
169270.AG16OR4F5ENST0000033513734
169511rs75062661AG200OR4F5ENST000003351372427
169761.AT200OR4F5ENST0000033513788
169897rs75758884TC59OR4F5ENST0000033513733
1877831rs6672356TC200SAMD11ENST000003420661011
1881627rs2272757GA200NOC2LENST000003270445256

Now we can plot it. I did it on a log scale to make it a bit easier to look at.

child %>%
  ggplot(aes(x = `Good or not`, y=log2(MutantReads))) +
  geom_violin(fill="yellow", colour="black")

Exercise 3 - Annotation, Scaling and Colours

Themes

Set a theme and then redraw some stuff to see that it changes.

theme_set(theme_bw(base_size=16))
child %>%
  ggplot(aes(x = `Good or not`, y=log2(MutantReads))) +
  geom_violin(fill="yellow", colour="black")

Yes, that definitely looks different, as will every plot from now on.

Cancer

Redraw the previous bargraph but with the axes flipped so we can see all of the categories and we don’t have to filter them. I’m also going to order the results by the data to make the plot clearer, and I’ve removed the cancers which males can’t get.

cancer %>%
  filter(!is.na(`Male Deaths`)) %>%
  ggplot(aes(x=reorder(Site,`Male Deaths`), y=`Male Deaths`)) + 
  geom_col() +
  xlab("Site")+
  coord_flip()

Brain Bodyweight

Plot a scatterplot of brainweight vs bodyweight and make various customisations.

  • Put the title in the centre

  • Make the axes log scale

  • Colour by Category but using a ColorBrewer palette

  • Change the ordering of the categories

read_tsv("brain_bodyweight.txt") -> brain
## Parsed with column specification:
## cols(
##   Species = col_character(),
##   Category = col_character(),
##   body = col_double(),
##   brain = col_double()
## )
head(brain)
ABCDEFGHIJ0123456789
Species
<chr>
Category
<chr>
body
<dbl>
brain
<dbl>
CowDomesticated465.00423.0
Grey WolfWild36.33119.5
GoatDomesticated27.66115.0
Guinea PigWild1.045.5
DiplodocusExtinct11700.0050.0
Asian ElephantWild2547.004603.0
brain %>%
  mutate(Category=factor(Category,levels=c("Domesticated","Wild","Extinct"))) %>%
  ggplot(aes(x=brain, y=body, colour=Category))+
  geom_point(size=4)+
  ggtitle("Brain vs Body weight")+
  xlab("Brainweight (g)") +
  ylab("Bodyweight (kg)") +
  scale_y_log10() +
  scale_x_log10() +
  scale_colour_brewer(palette = "Set1")

Finally do a barplot of all species showing their brainweight, but coloured by their bodyweight and using a custom colour scheme.

brain %>%
  ggplot(aes(x=Species, y=brain, fill=log(body))) +
  geom_col() +
  coord_flip() +
  scale_fill_gradientn(colours=c("blue2","purple", "green2","red2","yellow"))

Exercise 4 - Summary Overlays

Tidy1

Plot a stripchart with t boxplot overlay to summarise the data in the 4 categories.

read_csv("tidy_data1.csv") -> tidy1
## Parsed with column specification:
## cols(
##   DMSO = col_double(),
##   `TGX-221` = col_double(),
##   PI103 = col_double(),
##   Akt1 = col_double()
## )
tidy1
ABCDEFGHIJ0123456789
DMSO
<dbl>
TGX-221
<dbl>
PI103
<dbl>
Akt1
<dbl>
144.4393099.6107341.95241111.80130
135.71670115.3576057.46430124.18050
57.88828106.4484041.01954126.77380
66.71269115.8983063.12587130.95770
73.3698175.96729NA88.62730
83.43180NANA147.88130
97.41048NANA72.68707
97.91444NANA82.73766
107.97630NANA49.60179
113.44100NANANA

First we restructure the data

tidy1 %>%
  pivot_longer(cols=everything(), names_to = "sample", values_to = "value") %>%
  filter(!is.na(value)) -> tidy1

tidy1
ABCDEFGHIJ0123456789
sample
<chr>
value
<dbl>
DMSO144.43930
TGX-22199.61073
PI10341.95241
Akt1111.80130
DMSO135.71670
TGX-221115.35760
PI10357.46430
Akt1124.18050
DMSO57.88828
TGX-221106.44840

Now we can do the plotting

tidy1 %>%
  ggplot(aes(x=sample, y=value,colour=sample)) +
  geom_boxplot(color="grey", size=2) +
  geom_jitter(height=0, width=0.15, show.legend = FALSE, size=5)

We can do the same thing but just showing a mean bar instead of a full boxplot.

tidy1 %>%
  ggplot(aes(x=sample, y=value,colour=sample)) +
  stat_summary(geom="errorbar", fun.ymax = mean, fun.ymin=mean, colour="grey", size=2) +
  geom_jitter(height=0, width=0.15, show.legend = FALSE, size=5)

Now we can plot the sample thing as a barplot.

tidy1 %>%
  ggplot(aes(x=sample, y=value)) +
  geom_bar(stat="summary", fill="yellow",color="grey", size=2) + 
  stat_summary(geom="errorbar", width=0.3, color="grey", size=2)
## No summary function supplied, defaulting to `mean_se()
## No summary function supplied, defaulting to `mean_se()

We could also have done the same thing using pre-calculated values. We’ll use the STDEV instead of the SEM.

tidy1 %>%
  group_by(sample) %>%
  summarise(mean=mean(value),stdev=sd(value)) -> tidy1.summary

tidy1.summary
ABCDEFGHIJ0123456789
sample
<chr>
mean
<dbl>
stdev
<dbl>
Akt1103.9164932.14626
DMSO100.8012624.18533
PI10350.8905311.10922
TGX-221102.6564616.37553
tidy1.summary %>%
  ggplot(aes(x=sample, y=mean, ymin=mean-stdev, ymax=mean+stdev)) +
  geom_col(fill="yellow",color="grey", size=2) +
  geom_errorbar(size=2, colour="grey", width=0.3)

Exercise 5 - Faceting and Highlighting

Up down expression

Plot out a scatterplot of the two datasets against each other and customise the colouring.

read_tsv("up_down_expression.txt") -> up.down
## Parsed with column specification:
## cols(
##   Gene = col_character(),
##   Condition1 = col_double(),
##   Condition2 = col_double(),
##   State = col_character()
## )
head(up.down)
ABCDEFGHIJ0123456789
Gene
<chr>
Condition1
<dbl>
Condition2
<dbl>
State
<chr>
A4GNT-3.6808610-3.4401355unchanging
AAAS4.54795804.3864126unchanging
AASDH3.71906953.4787276unchanging
AATF5.07847205.0151916unchanging
AATK0.47114210.5598642unchanging
AB015752.4-3.6808610-3.5921390unchanging

Let’s do a simple, uncustomised plot first.

up.down %>%
  ggplot(aes(x=Condition1, y=Condition2, colour=State)) +
  geom_point(size=0.5)

Now let’s improve the appearance and add some custom labels.

up.down %>%
  filter(Condition1 > -1 & Condition2 > -1 & abs(Condition1 - Condition2) > 3) -> up.down.interesting

up.down.interesting
ABCDEFGHIJ0123456789
Gene
<chr>
Condition1
<dbl>
Condition2
<dbl>
State
<chr>
BMP33.1715820-0.8551732down
COL1A26.94586202.5371442down
COL3A17.09502362.7096112down
EEF1A20.55860504.6631520up
NTS1.86345965.8172520up
PGR1.60042514.6259530up
RP11-1070N10.63.74726680.7297893down
S100A14-0.94389533.6788050up
SCGB1D22.6609666-0.8551732down
SPTSSB-0.35893296.6001540up
library(ggrepel)
up.down %>%
  ggplot(aes(x=Condition1, y=Condition2, colour=State, label=Gene)) +
  geom_point(size=1.5) +
  scale_colour_manual(values=c("blue2","grey","red2")) +
  theme(legend.position="none") +
  geom_abline(slope = 1, intercept = 0, colour="darkgrey", size=1) +
  geom_text_repel(data=up.down.interesting,col="black", box.padding = 1)

Download Festival

Clean up the data (restructure and remove NA values)

Draw a stripchart of cleanliness for males and females and facet by the day of the festival. Colour the males and females differently and add a line to show the mean values.

read_csv("DownloadFestival.csv") -> festival
## Parsed with column specification:
## cols(
##   ticknumb = col_double(),
##   gender = col_character(),
##   day1 = col_double(),
##   day2 = col_double(),
##   day3 = col_double()
## )
head(festival)
ABCDEFGHIJ0123456789
ticknumb
<dbl>
gender
<chr>
day1
<dbl>
day2
<dbl>
day3
<dbl>
2111Male2.641.351.61
2229Female0.971.410.29
2338Male0.84NANA
2384Female3.03NANA
2401Female0.880.08NA
2405Male0.85NANA
festival %>%
  pivot_longer(cols=starts_with("day"), names_to = "day", values_to = "cleanliness") %>%
  filter(!is.na(cleanliness)) -> festival

head(festival)
ABCDEFGHIJ0123456789
ticknumb
<dbl>
gender
<chr>
day
<chr>
cleanliness
<dbl>
2111Maleday12.64
2111Maleday21.35
2111Maleday31.61
2229Femaleday10.97
2229Femaleday21.41
2229Femaleday30.29

Now we can plot it out.

festival %>%
  ggplot(aes(x=gender, y=cleanliness, colour=gender)) +
  geom_jitter(height=0, width=0.3, alpha=0.5, stroke=NA) +
  scale_colour_manual(values = c("blue2","red2")) +
  stat_summary(geom="errorbar", fun.y = mean, fun.ymax = mean, fun.ymin = mean, colour="darkgrey", size=3) +
  facet_grid(cols=vars(day))

Finally we can draw the plot above but split by both day and attendance

festival %>%
  group_by(ticknumb) %>%
  count() %>%
  right_join(festival) %>%
  rename(attended=n) -> festival
## Joining, by = "ticknumb"
head(festival)
ABCDEFGHIJ0123456789
ticknumb
<dbl>
attended
<int>
gender
<chr>
day
<chr>
cleanliness
<dbl>
21113Maleday12.64
21113Maleday21.35
21113Maleday31.61
22293Femaleday10.97
22293Femaleday21.41
22293Femaleday30.29
festival %>%
  ggplot(aes(x=gender, y=cleanliness, colour=gender)) +
  geom_jitter(height=0, width=0.3, alpha=0.5, stroke=NA) +
  scale_colour_manual(values = c("blue2","red2")) +
  stat_summary(geom="errorbar", fun.y = mean, fun.ymax = mean, fun.ymin = mean, colour="darkgrey", size=3) +
  facet_grid(cols=vars(day), rows=vars(attended))