Python for Bioinformatics: Dissecting an ExpressionSet object

Last time, we were in R using some Bioconductor code. If we do this:



library('ALL')

data('ALL')

then the variable ALL is available



> ALL

ExpressionSet (storageMode: lockedEnvironment)

assayData: 12625 features, 128 samples 

  element names: exprs 

protocolData: none

phenoData

  sampleNames: 01005 01010 ... LAL4 (128 total)

  varLabels: cod diagnosis ... date last seen (21 total)

  varMetadata: labelDescription

featureData: none

experimentData: use 'experimentData(object)'

  pubMedIds: 14684422 16243790 

Annotation: hgu95av2

(note: your output may vary slightly depending on the R and Bioconductor version.)

Obviously the object that ALL refers to is complicated. Its components are called "slots" and they are accessible with the syntax: object_name@slot_name. For example:



pData = ALL@phenoData

You can find out which slots are available by parsing the above output or just doing:



> slotNames(ALL)

[1] "assayData"         "phenoData"        

[3] "featureData"       "experimentData"   

[5] "annotation"        "protocolData"     

[7] ".__classVersion__"

In turn the slots are also complex objects. Technically, a phenoData object is:



> class(ALL@phenoData)

[1] "AnnotatedDataFrame"

attr(,"package")

[1] "Biobase"

More important, its sub-objects are available with the syntax slot_name$sub-object_name. But first we must discover which names are available:



> pData = ALL@phenoData

> varLabels(pData)

 [1] "cod"            "diagnosis"      "sex"           

 [4] "age"            "BT"             "remission"     

 [7] "CR"             "date.cr"        "t(4;11)"       

[10] "t(9;22)"        "cyto.normal"    "citog"         

[13] "mol.biol"       "fusion protein" "mdr"           

[16] "kinet"          "ccr"            "relapse"       

[19] "transplant"     "f.u"            "date last seen"

So, for example:



> pData$sex[1:5]

[1] M M F M M

Levels: F M



> pData$mol.biol[1:5]

[1] BCR/ABL  NEG      BCR/ABL  ALL1/AF4 NEG     

Levels: ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16



> levels(pData$mol.biol)

[1] "ALL1/AF4" "BCR/ABL"  "E2A/PBX1" "NEG"     

[5] "NUP-98"   "p15/p16"

At this level, the objects are reasonably simple.



> class(pData$mol.biol)

[1] "factor"

> class(pData$sex)

[1] "factor"

Don't forget the other method names:



> sampleNames(pData)[1:5]

[1] "01005" "01010" "03002" "04006" "04007"



> featureNames(ALL)[1:8]

[1] "1000_at"   "1001_at"   "1002_f_at" "1003_s_at"

[5] "1004_at"   "1005_at"   "1006_at"   "1007_s_at"



> head(varMetadata(pData))

                                                         labelDescription

cod                                                            Patient ID

diagnosis                                               Date of diagnosis

sex                                                 Gender of the patient

age                                           Age of the patient at entry

BT                             does the patient have B-cell or T-cell ALL

remission  Complete remission(CR), refractory(REF) or NA. Derived from CR

And, I should point out that (without actually knowing it at the time) I shadowed a real function in the namespace by assigning to pData above.

If we reboot R and do:



library('ALL')

data('ALL')

we can use a different method for access:



> p = pData(ALL)

> class(p)

[1] "data.frame"

> names(p)

 [1] "cod"            "diagnosis"      "sex"           

 [4] "age"            "BT"             "remission"     

 [7] "CR"             "date.cr"        "t(4;11)"       

[10] "t(9;22)"        "cyto.normal"    "citog"         

[13] "mol.biol"       "fusion protein" "mdr"           

[16] "kinet"          "ccr"            "relapse"       

[19] "transplant"     "f.u"            "date last seen"

> p$sex[1:5]

[1] M M F M M

Levels: F M

The very first slot holds the gene expression data.



> aData = ALL@assayData

> class(aData)

[1] "environment"

I'm not really sure what that class is. And, unfortunately, it is not easy to find which sub-object names are available for this slot:



> names(aData)

NULL

> elements(aData)

Error: could not find function "elements"

> elementNames(aData)

Error: could not find function "elementNames"

except we can probably guess that exprs is one (based on the output of the call `ALL` above:



> e = aData$exprs

> class(e)

[1] "matrix"

> e[1:3,1:3]

             01005    01010    03002

1000_at   7.597323 7.479445 7.567593

1001_at   5.046194 4.932537 4.799294

1002_f_at 3.900466 4.208155 3.886169

> dim(e)

[1] 12625   128

> rownames(e)[1:8]

[1] "1000_at"   "1001_at"   "1002_f_at" "1003_s_at"

[5] "1004_at"   "1005_at"   "1006_at"   "1007_s_at"

> colnames(e)[1:3]

[1] "01005" "01010" "03002"

-------------------------------------------------------
So far, so good.

Now, obviously we can select rows and columns of a matrix like ALL@assayData$exprs, which we've assigned to the variable e:



> e[1:2,1:3]

           01005    01010    03002

1000_at 7.597323 7.479445 7.567593

1001_at 5.046194 4.932537 4.799294

What the Bioconductor guys have done is make this assignment valid for the ExpressionSet object itself:



> ALL[1:2,1:3]

ExpressionSet (storageMode: lockedEnvironment)

assayData: 2 features, 3 samples 

  element names: exprs 

protocolData: none

phenoData

  sampleNames: 01005 01010 03002

  varLabels: cod diagnosis ... date last seen (21 total)

  varMetadata: labelDescription

featureData: none

experimentData: use 'experimentData(object)'

  pubMedIds: 14684422 16243790 

Annotation: hgu95av2

That's either very cool or very confusing depending on where you sit.

Finally, the annotation. This requires another package:



ann = ALL@annotation

> ann

[1] "hgu95av2"



library('annaffy')

probeids <- featureNames(ALL)

symbols <- aafSymbol(probeids, 'hgu95av2')



> symbols <- aafSymbol(probeids, 'hgu95av2')

[1] "You are missing hgu95av2 looking to see if it is available."

Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.11

Package hgu95av2 is available for download, would you like to install? [y/n] 

y

Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.11

trying URL 'http://bioconductor.org/packages/2.6/data/annotation/bin/macosx/leopard/contrib/2.11/hgu95av2_2.2.0.tgz'

Content type 'application/x-gzip' length 9340297 bytes (8.9 Mb)

opened URL

==================================================

downloaded 8.9 Mb





The downloaded packages are in

	/var/folders/3g/3gHxNStTGZaWkGngOL39-++++TI/-Tmp-//RtmptUUJr1/downloaded_packages

Loading required package: hgu95av2



******* Deprecation warning *******:



The package 'hgu95av2' is deprecated and will not be supported in the future. 



Instead we strongly reccomend that you should start using the 'hgu95av2.db' package. 



We hope you will enjoy these new packages.  If you still have questions, you can search 

(http://dir.gmane.org/gmane.science.biology.informatics.conductor) or ask the mailing list 

(http://bioconductor.org/docs/mailList.html).

To repeat:



library('annaffy')

probeids <- featureNames(ALL)

symbols <- aafSymbol(probeids, 'hgu95av2')

> class(symbols)

[1] "aafList"

> genenames = getText(symbols)

> genenames[1:3]

[1] "MAPK3"   "TIE1"    "CYP2C19"

> probeids[1:3]

[1] "1000_at"   "1001_at"   "1002_f_at"

We'll write this all down on a couple post-it notes and put it on the monitor for reference.

Wednesday, August 17, 2011

Dissecting an ExpressionSet object