--- title: "MSStats Example" output: html_document: df_print: paged toc: true toc_float: true --- Introduction ============ This is an example of loading in a search result from MaxQuant and analysing it in MSstats. We can do this using their shiny application, but doing it in R directly gives us more control over the process and lets us examine intermediate steps in the processing. The data we are using comes from PRIDE accession PXD043985. It's a yeast dataset which contains both whole proteome and an affinity pull down using the eIF2a protein. The interest here is to look at the proteins which are enriched in the pull down compared to the whole proteome. Setup ===== We're going to base this analysis on the MSstats package, but will supplement this with the standard tidyverse packages. We'll use pheatmap for drawing heatmaps of the hits. ```{r message=FALSE} library(MSstats) library(MSstatsConvert) library(tidyverse) theme_set(theme_bw()) ``` Loading Data ============ We are going to import two files from the MaxQuant output. These are: 1. ```evidence.txt``` this is the main quantified file at the peptide level 2. ```proteinGroups.txt``` this file provides protein level quantitation and shows how the peptides were combined We'll read in the evidence file first to look at some of the properties of the data, but we can then get MSStats to convert this into its standard format. We're importing data from MaxQuant, but this would equally work with data from other search platforms. Read in evidence file --------------------- We read in the file using the standard ```read_delim``` but we use the same column name repair that ```read.delim`` would use as MSstats expects the names to be in this format. ```{r message=FALSE} read_delim( "evidence.txt", name_repair = "universal" ) -> evidence head(evidence) ``` Read in protein file -------------------- We can also load the protein level information. ```{r message=FALSE} read_delim( "proteinGroups.txt", name_repair = "universal" ) -> protein_groups head(protein_groups) ``` Create annotation file ---------------------- For the downstream analysis we also need to make up a tibble of annotations to say which group each file belongs to. There are only two groups here, there full proteomes and the affinity tag pull downs. We'll make the annotation from the data in the evidence file. We're not doing a mass tagged experiment so we need to say that all of the samples are using light (L), ie normal masses. ```{r} evidence %>% distinct(`Raw.file`) %>% mutate(Condition = str_replace(Raw.file,"^.*-","")) %>% mutate(Condition = str_replace(Condition,"TAP_Prot","Prot")) %>% mutate(Condition = str_replace(Condition,"_Rep.","")) %>% arrange(Raw.file) %>% group_by(Condition) %>% mutate(BioReplicate = 1:n()) %>% ungroup() %>% add_column(IsotypeLabelType="L") -> annotation annotation ``` Properties of input data ======================== We've already looked at the QC of this data using PTXQC, but we can also look directly into the evidence and protein data to see what we're working with. MSstats will do some filtering for us when the data is loaded, but the exact metrics which are used will vary between different search programs. Retention time -------------- It's good to see that we're getting a nice even spread of peptides coming into the experiment through the duration of the retention time. We can see this visually. ```{r fig.width=12, fig.height=5} evidence %>% ggplot(aes(x=Retention.time, colour=Raw.file)) + geom_density() ``` Number of peptides per sample ----------------------------- ```{r} evidence %>% group_by(Raw.file) %>% summarise( true_matches = sum(is.na(Match.score)), transfer_matches = sum(!is.na(Match.score)) ) %>% ungroup() %>% pivot_longer( cols = -Raw.file, names_to="match_type", values_to="peptide_count" ) %>% ggplot(aes(x=Raw.file, y=peptide_count, fill=match_type)) + geom_col() + coord_flip() + scale_fill_brewer(palette = "Set1") ``` All of the plots we could make in PTXQC we could remake here. We can see that the TAP samples show fewer peptides than the proteomes. Amount of contamination ----------------------- ```{r} evidence %>% group_by(Raw.file) %>% summarise( contaminants = sum(!is.na(Potential.contaminant)), non_contaminants = sum(is.na(Potential.contaminant)) ) %>% left_join(annotation) %>% ggplot(aes(x=non_contaminants, y=contaminants, colour=Condition, label=BioReplicate)) + geom_point(size=7) + geom_text(colour="black") ``` Again, we can see that the full proteomes have many more peptides in them, but also many fewer contaminants. The TAP samples are less complex (as you'd expect) and more contaminated. We can also see that there is a small batch effect where samples 1,2 and 3 in the full proteome are less abundant than samples 4,5 and 6. PEP scores ---------- The PEP scores represent the probability that the reported match is correct and are based on the strength of the hits to the real proteome reference vs the hits to the reversed reference. ```{r fig.width=7, fig.height=12} evidence %>% filter(is.na(Potential.contaminant)) %>% select(Raw.file,PEP) %>% mutate(PHRED = -10*log10(PEP)) %>% filter(!is.nan(PHRED)) %>% mutate(PHRED = replace(PHRED, PHRED > 50,50)) %>% left_join(annotation) %>% ggplot(aes(x=Raw.file, y=PHRED, fill=Condition)) + geom_violin() + coord_flip() ``` We can see that although a lot of the matches are of high quality the Q1 files, especially the Q1 TAP have more hits with lower quality matches. The Q2 files look much nicer. It's only the PEP scores with a PHRED of less than 20 (99% confidence) where we'd have any concerns. Number of peptides per protein ------------------------------ ```{r} protein_groups %>% select(starts_with("Peptides.")) %>% pivot_longer( cols=everything(), names_to="Raw.File", values_to="peptide_count" ) %>% filter(peptide_count > 0) %>% mutate(Raw.File=str_replace(Raw.File,"Peptides.","")) %>% ggplot(aes(x=peptide_count, group=Raw.File)) + geom_density() + coord_cartesian(xlim=c(0,25)) + scale_x_continuous(breaks=0:25) ``` There are quite a lot of proteins identified by only a single peptide. These would usually be removed during the data import and processing. Number and type of proteins --------------------------- We can also look at the total number of proteins observed in each sample and how it was found (directly observed or via MS1 transfer). ```{r} protein_groups %>% select(starts_with("Identification.type.")) %>% pivot_longer( cols=everything(), names_to="Raw.File", values_to="IdentificationType" ) %>% group_by(Raw.File,IdentificationType) %>% count() %>% ggplot(aes(x=Raw.File, y=n, fill=IdentificationType)) + geom_col() + coord_flip() ``` We can see that there are many fewer observed proteins in the TAP samples, and that the Q1 replicates had a higher proportion of MS2 identifications than the Q2. Having taken a basic look at the raw data we can start working through the MSstats pipeline. Running MSstats =============== Converting the input data ------------------------- ```{r} MaxQtoMSstatsFormat( evidence = evidence, annotation = annotation, proteinGroups = protein_groups ) -> raw_data ``` ```{r} raw_data %>% as_tibble() %>% head() ``` We can see a simplified, standardised version of the data. Quantifying the data -------------------- ```{r message=FALSE, warning=FALSE} dataProcess( raw_data, logTrans = 2, normalization = "equalizeMedians" ) -> quantified_data quantified_data$FeatureLevelData %>% head() quantified_data$ProteinLevelData %>% head() ``` We now have data which has been quantiated and normalised. The ```dataProcess``` function has additional options for changing the default methods used for filtering, quantitation and normalisation. By default it normalises at the PSM level to equalise the median intensity between files. It summarises the intensities of multiple PSMs in a protein using a Tukey polished median value and it keeps all features in the data. It will impute missing values rather than leaving them empty. The review of statistics said that just using the default maxquant quantitation, with no additional normalisation is also a good approach, so you could Checking peptide normalisation ------------------------------ We can look at the distribution of normalised intensity values in the peptides. We should see that their medians are all the same because of the method used. ```{r fig.width=10, fig.height=4} quantified_data$FeatureLevelData %>% filter(!censored) %>% ggplot(aes(x=originalRUN, y=newABUNDANCE, fill=GROUP)) + geom_boxplot() + coord_flip() ``` These seem fairly comparable from this level of view. Whilst this type of normalisation would generally work well with whole proteomes we will check how it performs on this data later, since we are looking at enriched data. Checking protein normalisation ------------------------------ We can also look at the protein level. ```{r fig.width=10, fig.height=4} quantified_data$ProteinLevelData %>% ggplot(aes(x=originalRUN, y=LogIntensities, fill=GROUP)) + geom_boxplot() + coord_flip() ``` At the protein level we can see that the normalisation doesn't look as nice and that the distribution of intensities in the total protein are higher than the TAP. We saw earlier that the TAP was a less complex mix so there are many fewer proteins in it. We would also expect that they would be distributed differently since they are only proteins which are affinity pulled-down so we might expect a smaller number of more highly abundant proteins. For whole proteome comparisons normalisation should be fairly straight forward, but for enriched mixes it can be more complex. We can look at the distributions in a different way, looking at the detail of the distribution for each sample. ```{r fig.width=10, fig.height=4} quantified_data$ProteinLevelData %>% ggplot(aes(x=LogIntensities, colour=GROUP, group=originalRUN)) + geom_density(linewidth=1) ``` Again, we can see the distributions look different, but now it's clearer that we can see an outgroup with high abundance in the TAP samples, and the majority of the data with relatively low abundance. It's the high abundance proteins in the TAP relative to Total that we're likely to be most interested in. Let's look at the scatterplot of Total vs TAP at the protein level. ```{r fig.width=6, fig.height=6} quantified_data$ProteinLevelData %>% group_by(Protein, GROUP) %>% summarise( LogIntensities = mean(LogIntensities, na.rm = TRUE) ) %>% pivot_wider( names_from=GROUP, values_from=LogIntensities, values_fill = 15 ) %>% ggplot(aes(x=ProtTot, y=TAP)) + geom_jitter(width=0.1, height=0.1) + geom_abline(slope=1, intercept = 0, colour="red", linewidth=1) ``` Because we have a lot of proteins which were only observed in one condition or the other I've added an artificial value of 15 to these missing values, and then gave these a bit of noise so we could get a sense of how many there were. We can see a lot of proteins in Total which are absent in TAP, and only a few which go the other way. There is a diagonal set of points which look higher in TAP, but those might be the true unchanging proteins, so we could later opt to re-normalise to center the data on those proteins, which will give us a very different answer for the changing proteins. Differential Expression ======================= We'd like to do some statistics to compare the two conditions. We need to build a contrast, which is super simple where we only have two conditions. we need a "1" for one conditions and "-1" for the other. ```{r} levels(quantified_data$FeatureLevelData$GROUP) matrix(c(-1,1),nrow=1) -> contrasts row.names(contrasts) <- "TAP_vs_Total" colnames(contrasts) <- levels(quantified_data$FeatureLevelData$GROUP) contrasts ``` Now we can run the statistics. This runs the mixed effects model built into MSstats. ```{r message=FALSE, warning=FALSE} groupComparison(contrast.matrix = contrasts, data=quantified_data) -> comparison_result ``` Review stats results -------------------- Let's take a look at the results. We're only really interested in proteins which are higher in TAP than Total. ```{r} comparison_result$ComparisonResult %>% filter(log2FC >0) %>% arrange(adj.pvalue) %>% head(n=20) ``` We can see that the most significant hits are proteins which appeared in one sample but not the other. Let's have a look at those which are in both. ```{r} comparison_result$ComparisonResult %>% filter(log2FC >0 & !is.infinite(log2FC)) %>% arrange(adj.pvalue) %>% head(n=20) ``` Volcano Plot ------------ We can plot this out as a volcano plot. ```{r fig.width=6, fig.height=5} comparison_result$ComparisonResult %>% filter(!is.na(log2FC) & !is.na(adj.pvalue)) %>% mutate(PHRED=-10*log10(adj.pvalue)) %>% mutate(PHRED=replace(PHRED,PHRED>55,55)) %>% mutate(log2FC = replace(log2FC, log2FC< -10, -10)) %>% mutate(log2FC = replace(log2FC, log2FC> 10, 10)) %>% mutate(significant = log2FC>0 & adj.pvalue<0.01) -> volcano_data volcano_data %>% ggplot(aes(x=log2FC, y=PHRED, colour=significant, label=Protein)) + geom_jitter(show.legend = FALSE, width=0.2, height=1) + scale_colour_manual(values=c("grey","blue2")) + geom_vline(xintercept = 0, linewidth=1) + geom_text(data = volcano_data %>% filter(significant & log2FC > 4), colour="black", size=3, vjust=1.5) ``` These statistics give a uniformly high value to the points which are consistently observed in one sample but not the other. This would be one of the major differences between this method and other statistical methods. Scatterplot ----------- We can plot these out on the same scatterplot as before. ```{r fig.width=6, fig.height=6} quantified_data$ProteinLevelData %>% group_by(Protein, GROUP) %>% summarise( LogIntensities = mean(LogIntensities, na.rm = TRUE) ) %>% pivot_wider( names_from=GROUP, values_from=LogIntensities, values_fill = 15 ) %>% mutate(up_in_TAP = Protein %in% (comparison_result$ComparisonResult %>% filter(log2FC >0 & adj.pvalue < 0.01) %>% pull(Protein))) %>% arrange(up_in_TAP) %>% ggplot(aes(x=ProtTot, y=TAP, colour=up_in_TAP)) + geom_jitter(width=0.1, height=0.1, show.legend = FALSE) + geom_abline(slope=1, intercept = 0, colour="red", linewidth=1) + scale_colour_manual(values=c("grey","blue2")) ``` Validating quantitative hits ---------------------------- If we wanted to make a strong claim about an individual protein then it would be good to review the information we have for it. Let's look at a few of the top hits to see how well they behave across samples. ```{r} comparison_result$ComparisonResult %>% filter(log2FC >0 & !is.infinite(log2FC)) %>% arrange(adj.pvalue) %>% head(n=20) %>% pull(Protein) %>% as.character() -> top_20_measured top_20_measured ``` These are the top 20 hits which were measured in both conditions. ```{r fig.width=10, fig.height=10} quantified_data$ProteinLevelData %>% filter(Protein %in% top_20_measured) %>% ggplot(aes(x=GROUP, y=LogIntensities, size=NumMeasuredFeature, colour=GROUP)) + geom_jitter(show.legend = FALSE) + stat_summary(geom="crossbar", colour="black", fun=mean, show.legend = FALSE) + facet_wrap(vars(Protein)) ``` We can see a nice consistent separation of the two groups. We can see from the point sizes that there is variability in the number of peptides seen in each sample and that some samples are better observed that others. Peptide level differences ------------------------- The view we see above comes from the protein level aggregation, and each of those measurements derives from multiple peptide level measurements. We can go back to the peptides to see how well the evidence there stacks up. Let's look at one of the moderate hits - Q12460 ```{r} quantified_data$FeatureLevelData %>% filter(PROTEIN == "Q12460") %>% arrange(GROUP,originalRUN) %>% mutate(originalRUN=factor(originalRUN, levels=unique(originalRUN))) %>% ggplot(aes(x=originalRUN, y=newABUNDANCE, colour=GROUP, fill=GROUP, shape=censored, group=PEPTIDE)) + geom_line(linewidth=1)+ geom_point(size=3, colour="black") + scale_shape_manual(values=c(21,24)) ``` Now we can see the more detailed peptide level view of the data showing the degree of overall consistency. With the shape colours we can also see which values were actually measured, and which were imputed. Categorical Hits ---------------- We can also look at the hits which were present in one condition and absent in the other to see how strong that difference is to gauge how confident we should be that this is a difference which derives from actual underlying protein abundance changes. ```{r} comparison_result$ComparisonResult %>% filter(is.infinite(log2FC), log2FC>0) %>% pull(Protein) %>% as.character() -> categorical_hits categorical_hits ``` We can now see how high and consistent these are in the TAP condition, since we know that they will be absent in the ProtTot ```{r fig.width=6, fig.height=12} quantified_data$ProteinLevelData %>% filter(Protein %in% categorical_hits) %>% filter(GROUP=="TAP") %>% ggplot(aes(x=Protein, y=LogIntensities)) + geom_jitter(colour="darkgrey", size=4, width=0.2) + stat_summary(geom="crossbar", fun.data=mean_se) + coord_flip() ``` We can see that there are a lot of samples here which are near the limit of detection (which was around 20 in this data), and that some of the samples show a lot of variability between the replicates which were measured, and had some replicates missing (there should be 6) so we'd need to be cautious about believing some of these candidates. Renormalising ============= With this type of data it's not obvious where the correct position for protein level normalisation should be. We have a few different normalisation options but most of them are based around matching some properties of the distributions. In our case we don't necessarily expect the distributions to match nicely because of the TAP enrichment. Instead we can pull out genes in the central high intensity band and use those to direct the normalisation. Selecting Genes --------------- It looks like I can select genes with an intensity above 23 in both samples and use those for normalisation. ```{r} quantified_data$ProteinLevelData %>% group_by(Protein, GROUP) %>% summarise( LogIntensities = mean(LogIntensities, na.rm = TRUE) ) %>% pivot_wider( names_from=GROUP, values_from=LogIntensities, values_fill = 15 ) %>% filter(ProtTot > 23 & TAP > 23) %>% pull(Protein) -> genes_to_normalise_with ``` Let's have a look at those. ```{r fig.width=6, fig.height=6} quantified_data$ProteinLevelData %>% group_by(Protein, GROUP) %>% summarise( LogIntensities = mean(LogIntensities, na.rm = TRUE) ) %>% pivot_wider( names_from=GROUP, values_from=LogIntensities, values_fill = 15 ) %>% mutate(normalise = Protein %in% genes_to_normalise_with) %>% arrange(normalise) %>% ggplot(aes(x=ProtTot, y=TAP, colour=normalise)) + geom_jitter(width=0.1, height=0.1, show.legend = FALSE) + geom_abline(slope=1, intercept = 0, colour="red", linewidth=1) + scale_colour_manual(values=c("grey","blue2")) ``` That doesn't look too bad. Re-run normalisation --------------------- ```{r message=FALSE} dataProcess( raw_data, normalization = "globalStandards", nameStandards = genes_to_normalise_with ) -> quantified_data2 ``` Let's check the re-normalised data ```{r fig.width=10, fig.height=4} quantified_data2$ProteinLevelData %>% ggplot(aes(x=LogIntensities, colour=GROUP, group=originalRUN)) + geom_density(linewidth=1) ``` From a distribution point of view this actually looks worse, but we can observe the patterning in the scatterplot which makes more sense. Rerun comparison ---------------- ```{r message=FALSE, warning=FALSE} groupComparison(contrast.matrix = contrasts, data=quantified_data2) -> comparison_result2 ``` Replotting ---------- ```{r fig.width=6, fig.height=6} quantified_data2$ProteinLevelData %>% group_by(Protein, GROUP) %>% summarise( LogIntensities = mean(LogIntensities, na.rm = TRUE) ) %>% pivot_wider( names_from=GROUP, values_from=LogIntensities, values_fill = 15 ) %>% mutate(up_in_TAP = Protein %in% (comparison_result2$ComparisonResult %>% filter(log2FC >0 & adj.pvalue < 0.01) %>% pull(Protein))) %>% arrange(up_in_TAP) %>% ggplot(aes(x=ProtTot, y=TAP, colour=up_in_TAP)) + geom_jitter(width=0.1, height=0.1, show.legend = FALSE) + geom_abline(slope=1, intercept = 0, colour="red", linewidth=1) + scale_colour_manual(values=c("grey","blue2")) ``` We can see that we get a smaller collection of hits with the modified normalisation. The data has shifted to center on the genes we selected for normalisation. It's not completely clear which result is more correct since we have no absolute measurements, but this method gives us the ability to rerun normalisation with any selection of genes we care to pick.