Wednesday, August 17, 2011

Dissecting an ExpressionSet object

Last time, we were in R using some Bioconductor code. If we do this:

library('ALL')
data('ALL')

then the variable ALL is available

> ALL
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 128 samples
element names: exprs
protocolData: none
phenoData
sampleNames: 01005 01010 ... LAL4 (128 total)
varLabels: cod diagnosis ... date last seen (21 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
pubMedIds: 14684422 16243790
Annotation: hgu95av2

(note: your output may vary slightly depending on the R and Bioconductor version.)

Obviously the object that ALL refers to is complicated. Its components are called "slots" and they are accessible with the syntax: object_name@slot_name. For example:

pData = ALL@phenoData

You can find out which slots are available by parsing the above output or just doing:

> slotNames(ALL)
[1] "assayData" "phenoData"
[3] "featureData" "experimentData"
[5] "annotation" "protocolData"
[7] ".__classVersion__"



In turn the slots are also complex objects. Technically, a phenoData object is:

> class(ALL@phenoData)
[1] "AnnotatedDataFrame"
attr(,"package")
[1] "Biobase"

More important, its sub-objects are available with the syntax slot_name$sub-object_name. But first we must discover which names are available:

> pData = ALL@phenoData
> varLabels(pData)
[1] "cod" "diagnosis" "sex"
[4] "age" "BT" "remission"
[7] "CR" "date.cr" "t(4;11)"
[10] "t(9;22)" "cyto.normal" "citog"
[13] "mol.biol" "fusion protein" "mdr"
[16] "kinet" "ccr" "relapse"
[19] "transplant" "f.u" "date last seen"

So, for example:

> pData$sex[1:5]
[1] M M F M M
Levels: F M

> pData$mol.biol[1:5]
[1] BCR/ABL NEG BCR/ABL ALL1/AF4 NEG
Levels: ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16

> levels(pData$mol.biol)
[1] "ALL1/AF4" "BCR/ABL" "E2A/PBX1" "NEG"
[5] "NUP-98" "p15/p16"

At this level, the objects are reasonably simple.

> class(pData$mol.biol)
[1] "factor"
> class(pData$sex)
[1] "factor"



Don't forget the other method names:

> sampleNames(pData)[1:5]
[1] "01005" "01010" "03002" "04006" "04007"

> featureNames(ALL)[1:8]
[1] "1000_at" "1001_at" "1002_f_at" "1003_s_at"
[5] "1004_at" "1005_at" "1006_at" "1007_s_at"

> head(varMetadata(pData))
labelDescription
cod Patient ID
diagnosis Date of diagnosis
sex Gender of the patient
age Age of the patient at entry
BT does the patient have B-cell or T-cell ALL
remission Complete remission(CR), refractory(REF) or NA. Derived from CR

And, I should point out that (without actually knowing it at the time) I shadowed a real function in the namespace by assigning to pData above.

If we reboot R and do:

library('ALL')
data('ALL')

we can use a different method for access:

> p = pData(ALL)
> class(p)
[1] "data.frame"
> names(p)
[1] "cod" "diagnosis" "sex"
[4] "age" "BT" "remission"
[7] "CR" "date.cr" "t(4;11)"
[10] "t(9;22)" "cyto.normal" "citog"
[13] "mol.biol" "fusion protein" "mdr"
[16] "kinet" "ccr" "relapse"
[19] "transplant" "f.u" "date last seen"
> p$sex[1:5]
[1] M M F M M
Levels: F M



The very first slot holds the gene expression data.

> aData = ALL@assayData
> class(aData)
[1] "environment"

I'm not really sure what that class is. And, unfortunately, it is not easy to find which sub-object names are available for this slot:

> names(aData)
NULL
> elements(aData)
Error: could not find function "elements"
> elementNames(aData)
Error: could not find function "elementNames"

except we can probably guess that exprs is one (based on the output of the call `ALL` above:

> e = aData$exprs
> class(e)
[1] "matrix"
> e[1:3,1:3]
01005 01010 03002
1000_at 7.597323 7.479445 7.567593
1001_at 5.046194 4.932537 4.799294
1002_f_at 3.900466 4.208155 3.886169
> dim(e)
[1] 12625 128
> rownames(e)[1:8]
[1] "1000_at" "1001_at" "1002_f_at" "1003_s_at"
[5] "1004_at" "1005_at" "1006_at" "1007_s_at"
> colnames(e)[1:3]
[1] "01005" "01010" "03002"

-------------------------------------------------------
So far, so good.

Now, obviously we can select rows and columns of a matrix like ALL@assayData$exprs, which we've assigned to the variable e:

> e[1:2,1:3]
01005 01010 03002
1000_at 7.597323 7.479445 7.567593
1001_at 5.046194 4.932537 4.799294

What the Bioconductor guys have done is make this assignment valid for the ExpressionSet object itself:

> ALL[1:2,1:3]
ExpressionSet (storageMode: lockedEnvironment)
assayData: 2 features, 3 samples
element names: exprs
protocolData: none
phenoData
sampleNames: 01005 01010 03002
varLabels: cod diagnosis ... date last seen (21 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
pubMedIds: 14684422 16243790
Annotation: hgu95av2

That's either very cool or very confusing depending on where you sit.


Finally, the annotation. This requires another package:

ann = ALL@annotation
> ann
[1] "hgu95av2"

library('annaffy')
probeids <- featureNames(ALL)
symbols <- aafSymbol(probeids, 'hgu95av2')

> symbols <- aafSymbol(probeids, 'hgu95av2')
[1] "You are missing hgu95av2 looking to see if it is available."
Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.11
Package hgu95av2 is available for download, would you like to install? [y/n]
y
Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.11
trying URL 'http://bioconductor.org/packages/2.6/data/annotation/bin/macosx/leopard/contrib/2.11/hgu95av2_2.2.0.tgz'
Content type 'application/x-gzip' length 9340297 bytes (8.9 Mb)
opened URL
==================================================
downloaded 8.9 Mb


The downloaded packages are in
/var/folders/3g/3gHxNStTGZaWkGngOL39-++++TI/-Tmp-//RtmptUUJr1/downloaded_packages
Loading required package: hgu95av2

******* Deprecation warning *******:

The package 'hgu95av2' is deprecated and will not be supported in the future.

Instead we strongly reccomend that you should start using the 'hgu95av2.db' package.

We hope you will enjoy these new packages. If you still have questions, you can search
(http://dir.gmane.org/gmane.science.biology.informatics.conductor) or ask the mailing list
(http://bioconductor.org/docs/mailList.html).

To repeat:

library('annaffy')
probeids <- featureNames(ALL)
symbols <- aafSymbol(probeids, 'hgu95av2')
> class(symbols)
[1] "aafList"
> genenames = getText(symbols)
> genenames[1:3]
[1] "MAPK3" "TIE1" "CYP2C19"
> probeids[1:3]
[1] "1000_at" "1001_at" "1002_f_at"

We'll write this all down on a couple post-it notes and put it on the monitor for reference.