Python for Bioinformatics: Bioonductor: summary

Saturday, August 20, 2011

Bioonductor: summary

[ Update: Sorry for the dup. I tried using the new formatting on this one, but found a problem, which I hope is fixed now.]

I promise this is the last one. Here we load the sample data from ExpressionSet. I show (once again) how to obtain all the relevant stuff: expression data, sample names, feature names, and gene names.

One of the nice things about ExpressionSet objects is the very fancy indexing one can do with them. A great example is given on p. 166 of the Bioconductor book.

I made an stab at implementing range selectors here but it's almost trivial. Also, in this one, the Python code does not contain any informative whitespace, so I've used the <code> tag.
The script is first and then the output. If you can't figure something out, just ask.

script.py

import rpy2.robjects as robjects 
from rpy2.robjects.packages import importr  
import rpy2.rinterface as ri  
import bioc.biobase as biobase 

r = robjects.r 
g = robjects.globalenv 
#gd = importr('grDevices') 
dim = r['dim'] 

r('data("sample.ExpressionSet")') 
sample_ExpressionSet = r['sample.ExpressionSet'] 
eset = sample_ExpressionSet 
#print tuple(ri.globalenv) 
print eset 

print dim(eset) 
ed = eset.exprs 
print ed.rx(1,1) 

# no way to do es[x,y] from Python? 
# just wrap it 
r(''' 
do_sel <-function(obj,sel,bycol=FALSE) {
if (bycol) { return(obj[,sel]) } 
else { return(obj[sel,]) } 
} 
''') 

do_sel = r['do_sel'] 
# both work: 
v = robjects.IntVector([2,3,6]) 
v = robjects.BoolVector([True,False]*13) 
eset_sub = do_sel(eset,v,bycol=True) 
eset_sub = do_sel(eset_sub,v) 
print dim(eset_sub) 

# no explicit featureData, just from data rows 
rownames = r['rownames'] 
fd = rownames(ed) 
print fd[:2] 

# get the gene names 
importr('annaffy') 
r(''' 
get.genenames <- function(probeids) {
symbols <- aafSymbol(probeids, 'hgu95av2') 
getText(symbols) } 
''') 
get_genenames = r['get.genenames'] 
gn = get_genenames(fd) 

# slightly fancy, no dups, original order 
from collections import OrderedDict 
z = zip(gn,[None]*len(gn)) 
D = OrderedDict(z) 
L = D.keys() 
for i in range(4):  print L[i*5:i*5+5] 
print 

pd = eset.do_slot('phenoData') 
data = pd.do_slot('data') 
v = robjects.IntVector(range(1,4)) 
print do_sel(data,v)

> python script.py 
ExpressionSet (storageMode: lockedEnvironment) 
assayData: 500 features, 26 samples  
element names: exprs, se.exprs  
protocolData: none 
phenoData 
sampleNames: A B ... Z (26 total) 
varLabels: sex type score 
varMetadata: labelDescription 
featureData: none 
experimentData: use 'experimentData(object)' 
Annotation: hgu95av2  

Features  Samples  
500       26  

[1] 192.742 

Features  Samples  
3        3  

[1] "AFFX-MurIL2_at"  "AFFX-MurIL10_at" 


******* Deprecation warning *******: 

The package 'hgu95av2' is deprecated and will not be supported in the future.  

Instead we strongly reccomend that you should start using the 'hgu95av2.db' package.  

We hope you will enjoy these new packages.  If you still have questions, you can search  
(http://dir.gmane.org/gmane.science.biology.informatics.conductor) or ask the mailing list  
(http://bioconductor.org/docs/mailList.html).   


Loading required package: hgu95av2 
['', 'STAT1', 'GAPDH', 'ACTB', 'TFRC'] 
['C15orf31', 'ACTL6B', 'GLRA1', 'KCNB2', 'MGAT5'] 
['BMP3', 'RPL14', 'ZCWPW2', 'KITLG', 'GPR12'] 
['TRA@', 'ATP8A2', 'C1orf68', 'FAM20B', 'SLC34A1'] 

sex    type score 
A Female Control  0.75 
B   Male    Case  0.40 
C   Male Control  0.73 

>