Your example isn't reproducible (for me anyway). Tryand see what information that tells you. I suspect it's a list you can access the pieces of and only part of it is printed.Code:str(res)
Hi! I'm new to R and I'm stuck in a problem with a particular package but perhaps someone can help me.
I'm using the cosmo package to find repeated subsequences in sequences of letters
http://www.bioconductor.org/packages...tml/cosmo.html
To run it with the dataset below I just execute 3 commands:
>library(cosmo)
> seqal <- system.file("Exfiles/short1.FASTA", package="cosmo")
> res <- cosmo(seqs=seqal, constraints="None", minW=10, maxW=10, models="TCM")
As result I get a list of motifs that I paste below but I need ALL the motifs not only the first 16, as I understand from the documentation "motifs" is a Character object how could I display it completely?
1 1146 91 1 TTTTATTTTT 0.9797432
2 1145 89 1 TTTTTTTTTT 0.9762943
3 1187 87 1 TTTTTTTTTA 0.8750529
4 1145 26 1 TTTTTTTTTA 0.8741454
5 1196 73 -1 CAAAAAAAAA 0.8727398
6 1196 97 -1 CAAAAAAAAA 0.8628938
7 1250 79 -1 TAAAAAAAAA 0.8587207
8 1157 40 -1 TAAAATAAAA 0.7922084
9 1256 92 -1 AAAACTAAAA 0.7366366
10 1145 102 1 TTTTTTTTAA 0.6705112
11 1250 13 -1 TAACCAAAAA 0.6118567
12 1114 70 -1 AAACCAAAAA 0.5864355
13 1088 86 -1 CAAACTAAAA 0.5631808
14 1122 73 -1 AAACAAAAAA 0.5310371
15 1088 124 1 TATTATTTTA 0.4966246
16 1187 8 -1 CAACAAAAAA 0.4187826
>1068
CACCACAACCACCACCACCCCCCACTTCATATATCAATATACAACCAACATCCTCCCTCAAACTAATCCAAAACCTCCCCCCAAAAA
>1073
CCAAAAAACAAATCAAAAATCCAATAATTCCACTTTTTAAATTACCTTCTCTATATATAACTCAAACAACA
>1088
CCCACAATCACCTCCTCCTTCTCCCTAATATACACCCCCACACCACACATCCAATCAATCATCTTTTTTTCCAAAATTTTTCCCCCAAACTAAAAACATATCCCCATACCCCCCCCCCCCCCTTATTATTTTAA
>1114
ACTCACACAAAATTCCCAATCTAAAATCACATTCACTATTCACTTTCTAATATTTCCTTAAAATACATAAAACCAAAAATCCTATTTTCACTACACAC
>1122
CAAAAACCCCAAAAACCTTTCCTTTCTTCCCCTTTAAAATAAACTCCATCAACCCTAAATAAAAATCCTCCCAAACAAAAAA
>1145
TTTCCTCTTTTTTTTTTTTTTCTCCTTTTTTTTTATTATTTATTATATATCTTAACATTTAATAATTACTTAATTTTATATTCTCTCCTTTTTTTTTTCTCTTTTTTTTAAATTTCCTAATTTTAATTATTTCACAATTCTAATTTCCACATTTATA
>1146
CCCTACTCATACTCTCCACTACCCAACACTCCTTCAATTAAATTTAATTCATTACCTTATACTACCATTAACACCCAAAAAATTTCTCACTTTTATTTTTAATTTTTCTATCAAATTAAATCCATCCATTA
>1157
CTCTAATTTTCATCCATTTCCCACCTTTACAACAAAACCTAAAATAAAATTTCCCAATTCAACTTCCAAACAATCTCTTAATTATCTTCCAATAATACTCATTAAAAAAATTCTCCAACCAACC
>1161
ACACTCATATCCAATTTATATCTCCAAAAAACTCATATAATACCCTACATTACTTATTCCCAACCACCTACAACAAAATCTCATCATTTAAA
>1164
TCTCCTATCTTCTTATAA
>1187
AAAACCACAACAAAAAAACCAAAAAACCCAAAAACTCCCCTTCTCCTTCTCTCAATAAATCATAATTCTCCTTTTCTCCTTTCTTCTTTTTTTTTATACCATAAATATACCTAATAA
>1196
CTCCACACACTCTACAAATCATCAACTATCCCCCAACACTCTCCCTTCCCCTTCCCCCCCCTAAACTAATCCCAAAAAAAAACCCCACCACAACCCCAAAAAAAAA
>1206
CTTCAAACCCCACAAACCTACATATACCTACAAAATTCACCAACAA
>1250
CCAACTTCATCATAACCAAAAATCCATACCCACTTAACATCCACATTATATAACAACAACCCCCAATATATACCTAATTAAAAAAAAACCACTAACAAAA
>1256
ACCAAAAACCCTTACAAAACTCCATAACACCTCAACAAACAAAAAAAAAAAAACCCACAAAACCCCTCAAAACACAAAAAACAATCACAAAAAAACTAAAAAAA
>1258
CCCTATATAATTACTACTATACTAACCCTTTCATTTAACCCTCTAATCTCCAATCTTACTACCCTACTCACACCCTCTACCTACACCCTAACTAACTCCTAATCTAACACTCCCAAATCACTA
>1293
TCTATCATCATATTCCTACATTATACTCACTCTCCTCCCTTCCCCTTCAACCTATTCAAATCTTCTCCCACCCCCTCCCCTCTTATTCCATTANG
Thanks
Ana
Your example isn't reproducible (for me anyway). Tryand see what information that tells you. I suspect it's a list you can access the pieces of and only part of it is printed.Code:str(res)
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
anarichardson (09-29-2012)
Try accessing the slot named motif of the returned cosmo class using the @ symbol or slot function
but there may not be more than 16Code:res@motifs slot(res, "motif")
anarichardson (09-29-2012)
I tried your example, was not reproducible for me either.
Instead, here's one example from the vignette. Str does the trick as trinker said.
Code:seqFile <- system.file("Exfiles/seq.fasta", package="cosmo") res <- cosmo(seqs=seqFile, constraints="None" , minW=7, maxW=8, models=c("OOPS", "TCM"))Code:summary(res) ## Displays the motifs seq pos orient motif prob 1 Seq1 69 1 CGCCAGCG 1.0000000 2 Seq6 21 1 CCCATGGG 1.0000000 3 Seq3 86 1 CGGACGCG 1.0000000 4 Seq10 14 1 GGCACGCG 1.0000000 5 Seq2 35 1 CGCGCGCG 1.0000000 6 Seq4 15 1 CCGGAGCG 1.0000000 7 Seq8 25 1 GGGCTGCC 1.0000000 8 Seq7 69 1 CGGGCGGG 0.9241186 9 Seq5 79 1 GGCCGGCG 0.9022922 10 Seq9 7 1 CGCCCGCG 0.8581486Code:str(res@seqs) ## displays the whole sequence $ :List of 2 ..$ seq : chr "GCTTCTGGTTCTATTGAAAAAATTCAGGTGGGGATGTCCAGGTGAAAAATATTTTTAGAATAGAAAAACGCCAGCGTTACATATCTTTAGGGAGAAGCGG" ..$ desc: chr "Seq1" $ :List of 2 ..$ seq : chr "GATTGTGTTTAAAATAGATGTTTAAAGTAGCTCTCGCGCGCGCACTTTAGATGAATTTAAATGATATACATTGAGGACGATATTCTTCAAAACCGCCGCA" ..... etc etc
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
anarichardson (09-29-2012)
Just an aside, you can paste your R codes and outputs using
[CODE] [/CODE]
For Example
[CODE] a1<-lm(y~x, data=mydata) [/CODE] displays
Or simply select your code/output and hastag them.Code:a1<-lm(y~x, data=mydata)
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
anarichardson (09-29-2012), trinker (09-29-2012)
Thanks so much for your comments, I've run
and got thisCode:> str(res@motifs)
I attach the short1.txt file with my sequences, the extension must be .FASTA to reproduce my case the .txt needs to be saved as .FASTA.Code:Formal class 'align' [package "cosmo"] with 6 slots ..@ seq : chr [1:16] "1250" "1145" "1145" "1122" ... ..@ pos : int [1:16] 79 89 101 73 56 87 26 87 73 70 ... ..@ orient: num [1:16] -1 1 1 -1 -1 1 1 -1 -1 -1 ... ..@ motif : chr [1:16] "TAAAAAAAAA" "TTTTTTTTTT" "CTTTTTTTTA" "AAACAAAAAA" ... ..@ prob : num [1:16] 0.91 0.908 0.863 0.827 0.794 ... ..@ eval : num 2.31e-24
But I can't find out how could I get more than 16 motifs.
I also tried
and gotCode:> str(res@motifs@motif)
and I also triedCode:chr [1:16] "TAAAAAAAAA" "TTTTTTTTTT" "CTTTTTTTTA" "AAACAAAAAA" "TAAATAAAAA" "TTTTTTTTTA" ...
and gotCode:> slot(res, "motifs")
If I runCode:seq pos orient motif prob 1 1250 79 -1 TAAAAAAAAA 0.9097117 2 1145 89 1 TTTTTTTTTT 0.9075282 3 1145 101 1 CTTTTTTTTA 0.8628318 4 1122 73 -1 AAACAAAAAA 0.8273926 5 1122 56 -1 TAAATAAAAA 0.7940559 6 1187 87 1 TTTTTTTTTA 0.7787428 7 1145 26 1 TTTTTTTTTA 0.7719197 8 1088 87 -1 AAACTAAAAA 0.7601947 9 1196 73 -1 CAAAAAAAAA 0.7511989 10 1114 70 -1 AAACCAAAAA 0.7446275 11 1196 97 -1 CAAAAAAAAA 0.7347967 12 1146 91 1 TTTTATTTTT 0.6645410 13 1088 123 1 TTATTATTTT 0.6589068 14 1256 93 -1 AAACTAAAAA 0.6422225 15 1250 13 -1 TAACCAAAAA 0.6247923 16 1157 40 -1 TAAAATAAAA 0.5360757
I get this matrix with non zero values in the G row, so there should be some G in the motifs so there should be more motifs other than those 16 that I don't know how to seeCode:> summary(res@pwm)
Code:Position weight matrix: 1 2 3 4 5 6 7 8 9 10 A 0.0000 0 0.0566 0.0000 0.119 0.1931 0.0000 0 0 0.2833 C 0.0941 0 0.0000 0.0000 0.000 0.0000 0.0000 0 0 0.0000 G 0.0000 0 0.0000 0.0281 0.000 0.1656 0.3166 0 0 0.0908 T 0.9059 1 0.9434 0.9719 0.881 0.6413 0.6834 1 1 0.6259
The PWM defines a way to score similar motifs - some motifs can score really well, others relatively poorly, and often one prefers to look at only the top scoring motifs. The contribution of the G is small in the PWM, thus most likely won't show up in any top scoring motifs. I'm guessing the cosmo function has some scoring cutoff (hence my comment above). Don't know if it has a way to alter the threshold to show lower scoring motifs, but you can investigate your sequences directly through Biostrings
Code:library(Biostrings) #Load the biostrings library seqs = readFASTA ( "Exfiles/short1.FASTA");#read in the Sequences #Analyze the first sequence with the PWM, specifying a minimum score m = matchPWM(res@pwm@pwm, seqs[[1]]$seq, min.score="75%")
anarichardson (10-03-2012)
Thanks copeg, I think that finding that cutoff value could be the key.. I reviewed all the input parameters and cannot find it..
That's why I'm trying to debug it but it's also confusing..
I'm a psychologist so my background in software is limited..
If I get it right you mean that I could use Biostrings package as another way to find motifs?
I mean that you can use the cosmo package to find the Position Weight Matrix (PWM), then you can use the Biostrings package to search a sequence for a particular motif using a PWM (it does not look like the cosmo function has a parameter to set a threshold). Using Biostrings gives you the liberty to not only manually set the score threshold, but also search any sequence (not just those in the training set)
Dason (10-03-2012)
But I've already found this motifs
How should I search a sequence for a particular motif? which particular motif? and which sequence?Code:1 1146 91 1 TTTTATTTTT 0.9797432 2 1145 89 1 TTTTTTTTTT 0.9762943 3 1187 87 1 TTTTTTTTTA 0.8750529 4 1145 26 1 TTTTTTTTTA 0.8741454 5 1196 73 -1 CAAAAAAAAA 0.8727398 6 1196 97 -1 CAAAAAAAAA 0.8628938 7 1250 79 -1 TAAAAAAAAA 0.8587207 8 1157 40 -1 TAAAATAAAA 0.7922084 9 1256 92 -1 AAAACTAAAA 0.7366366 10 1145 102 1 TTTTTTTTAA 0.6705112 11 1250 13 -1 TAACCAAAAA 0.6118567 12 1114 70 -1 AAACCAAAAA 0.5864355 13 1088 86 -1 CAAACTAAAA 0.5631808 14 1122 73 -1 AAACAAAAAA 0.5310371 15 1088 124 1 TATTATTTTA 0.4966246 16 1187 8 -1 CAACAAAAAA 0.4187826
I think that the easier way is to try to make cosmo show all the motifs and not only 16.. but I don't know how to show variables in R, for sure that the motifs I'm searching for are in some of the variables in the cosmo.R script but I don't know how to show them..
I tried printing some messages and rebuilding the package but didn't work.
See post #7. Does the code make sense?How should I search a sequence for a particular motif? which particular motif? and which sequence?
Why is this easier? Did you try the code I posted above? Again, there may be more than 16 sequences, but the probabilities are going to be low (and thus chance of false positives high)I think that the easier way is to try to make cosmo show all the motifs and not only 16.. but I don't know how to show variables in R, for sure that the motifs I'm searching for are in some of the variables in the cosmo.R script but I don't know how to show them..
I tried printing some messages and rebuilding the package but didn't work
Begin my .02 - while I don't know what you are looking for (promoters?) I'd be skeptical of TTTTTTTTTT being amongst the top hits (it has low complexity and found in a great many sequences). You seem to be convinced there are more motifs than those that cosmo reported...and while there are (defined by the PWM) their probabilities are low (hence not reported by cosmo), and the harder you look - the less restrictive you are with scoring tolerances - your chances of false positives dramatically increases. A recommendation: DUST the sequences (see the NCBI toolkit) prior to analysis (presuming cosmo accepts dusted sequences), and do some post-hoc type of analysis in which you attempt to make sure the PWM you find is enriched in the training sequences relative to another list of similar sequences.
End My .02
anarichardson (10-04-2012)
I think I should have explained the original problem before trying to solve it.. my sequences aren't biological data, they are not DNA sequences, so in my sequences I don't have the G letter, only A,C and T.. so when I run cosmo it crashes. Then I added one G in one of my sequences so cosmo at least does not crash and I get some motifs.
I thought that I could find all the motifs that cosmo uses for building the pwm and then delete the motifs with G (they have low probabilities) and manually create a new pwm.. but I think that this is not the best solution.. I don't even know what a promoter is.. what I'm trying to find are short sequences that appear (with higher probability) in my sequences
|
|