# # TUTORIAL ON HUMAN AFFYMETRIX DATA PRE-PROCESSING AND GENE FILTERING USING MAID # # To pre-process human Affymetrix data and then filter significantly up- or down-regulated genes # using MAID (MA-plot-based signal Intensity-Dependent fold-change criterion). # # # author: Michael Hecker # date: 12-05-2010 # e-mail address: michael.hecker@rocketmail.com # address 1: Leibniz Institute for Natural Product Research and Infection Biology - Hans-Knoell-Institute, # Beutenbergstr. 11a, D-07745 Jena, Germany # address 2: Steinbeis Transfer Center for Proteome Analysis, # Schillingallee 68, D-18057 Rostock, Germany # # author's comment: # Please feel free to use, modify or redistribute this file without any restrictions. # For further information: http://www.hki-jena.de/index.php/0/2/491 # # tested on: i686-pc-linux-gnu, R version 2.11.0 # #We start by loading all required libraries... library(affy) library(affyPLM) library(robustbase) library(gahgu133acdf) #Please note: GeneAnnot-based custom chip definition files (CDFs) have to be installed (gahgu133acdf)! #The latest version can also be downloaded from http://www.xlab.unimo.it/GA_CDF/customcdf.php?version=latest #Here we use Packages Version 2.1.0 (which is associated to GeneCards Version 2.41.1 and GeneAnnot Version 1.9). #The RData-file provides some information for each GeneCard in the GeneCards database (55546 entries), #e.g. corresponding gene symbol, official full name and genomic location... #On our institute's webpage you will also find the information of older/newer Genecards versions. load("GeneCards_2.41.1.RData") #In order to make the tutorial more readable, some R functions were defined in a separate R file, #which we will include by the source-command... source("MAID_tutorial_codes.r") #We create directories for the resulting images and data tables... make_directories() #Next, we load an HG-U133A Affymetrix sample data set. This data set provides gene expression levels of three #rheumatoid arthritis (RA) patients starting Etanercept therapy. For each patient the expression was measured at #baseline (t0) as well as after three days post therapy initiation (t1). In this tutorial we will identify #genes consistently up- or down-regulated in this group of patients. Please make sure that you have #extracted the CELfiles-directory contained in MAID_sample_data.zip in the working directory of the R process... files = dir(path="CELfiles") data = ReadAffy(celfile.path="CELfiles") #We use the GeneAnnot-based custom CDF and the MAS5 algorithm to pre-process the data, i.e. to generate #signal intensities from the raw CEL-files. Notably, the custom CDFs also work with other pre-processing methods #such as RMA, but the MAID analysis is rather designed for MAS5-processed data (the difference is that RMA #already includes a variance-stabilizing step and normalizes the data by default using quantile normalisation). #Here, we end up with a signal matrix of 12175 rows (genes) and 6 columns (microarrays)... data@cdfName = paste("ga", data@cdfName, sep="") data@annotation = paste("ga", data@annotation, sep="") ga_eset = mas5(data) signals = exprs(ga_eset) colnames(signals) = substr(colnames(signals), 1, nchar(colnames(signals))-6) rownames(signals) = substr(rownames(signals), 1, nchar(rownames(signals))-3) #For quality control we can plot pseudo-images of the microarrays, i.e. the microarray scan is #reconstructed based on the measured data. The residual visualisation is particularly useful to detect #specks and smears on a chip. For further information on quality assessment please have a look at: #http://www.bioconductor.org/packages/bioc/vignettes/affyPLM/inst/doc/QualityAssess.pdf #Here, the pseudo-images show a good quality for all the six microarrays. pfit = fitPLM(data) make_pseudoimages() #To correct for systematic effects in the data we perform a loess normalisation, which is a very flexible #method. Especially the parameter span has to be set with necessary care. The (rounded) normalized signal intensities #are then written to a tab-separated file in the Tables-directory... signals.normalized <- normalize.loess(signals, family.loess="symmetric", subset=1:nrow(signals), span=0.05) write.table(file="Tables/signals_normalized.txt", data.frame(GeneCards_ID=rownames(signals.normalized),round(signals.normalized)), row.names=F, quote=F, sep="\t") #MA-plots are a common way to visualise microarray data and the effects of the loess normalisation. #Here we plot the data for each pair-wise combination of microarrays. The red line in these plots indicates the #loess regression curve... make_maplots() make_maplots_normalized() #Boxplots and densities of raw and normalized signal intensities could also be helpful to evaluate the comparability #of the microarray measurements. Again we see how the normalisation affects the data... make_boxplots() make_densityplots() #To identify genes with significant expression changes we use two criteria: a t-statistic and so-called MAID-scores. #Here, we aim to detect common transcriptional effects during the first three days of therapy. #Therefore, we use a paired t-test to compare the baseline levels with those at t1 (when comparing two independent #patient groups etc. one might instead apply the two-sample t-test or the Welch-test). Moreover, we calculate #so-called MAID-scores that represent signal intensity-dependent fold-changes. In this way, the MAID approach takes #into account that the variability in the log fold-changes increases as the measured signal intensity decreases. #The filtering procedure is realised by the function filtergenes() that requires two cut-offs: one for the #t-test p-value (default 0.05) and one for the MAID-scores (default 2.0, here we choose 3.0 which is more stringent). #It also produces an MA-plot that provides an intuitive visualisation of the gene filtering (blue = measured genes, #yellow = genes with p-value<0.05, green = filtered genes with p-value<0.05 and |MAID-score|>3, #grey = IQR's calculated in a sliding window, black = IQR's used to compute the regression curve, #dashed red line = fitted MAID regression curve, solid red line = actual MAID-score cut-off curves). #We see, that in particular weakly expressed genes are rejected by MAID. patientids = names(table(substr(files,1,4))) maid.t0t1 = filtergenes(signals.normalized[,paste(patientids,"_t0",sep="")], signals.normalized[,paste(patientids,"_t1",sep="")], thresh=3, pvalthresh=0.05, visualize=T, print="Images/Filtering_MAplot_t0t1.png") signals.normalized.filtered = signals.normalized[maid.t0t1$filtered,] #Finally, we plot the time courses for all the filtered genes and write an output file. #(In general we might use the GeneCards information to further reduce the filtering result, #e.g. by considering only protein-coding transcripts.) make_timecourseplots() write_filteredgenes_information() #As a result, four genes could be identified as being responsive to the therapy #(CXCR4, TUBA1A, MOSC1 and PRB4). However, within this short tutorial we analysed microarray data of only three #patients. Obviously, there is a lack of statistical power rendering the result less reliable. The full data set #comprising 19 rheumatoid arthritis patients is available from ArrayExpress (accession number E-MTAB-11) #and was published by Koczan et al. (2008). The transcriptional effects of Etanercept in RA patients were #studied in more detail in Hecker et al. (2009).