+ Reply to Thread
Results 1 to 5 of 5

Thread: Finding start and stop codons in a DNA sequence

  1. #1
    Points: 675, Level: 13
    Level completed: 50%, Points required for next Level: 25

    Posts
    15
    Thanks
    7
    Thanked 0 Times in 0 Posts

    Finding start and stop codons in a DNA sequence



    Hi

    I copied and pasted texts below that is found in a website (http://a-little-book-of-r-for-bioinf.../chapter7.html). But that didn't work. Instead I got "Error in .Method(..., na.last = na.last, decreasing = decreasing) : argument 1 is not a vector"

    What do I need to do to get this one correct???

    Code: 
    library("Biostrings")
    s1 <- "aaaatgcagtaacccatgccc"
    matchPattern("atg", s1) # Find all ATGs in the sequence s1
    #  Views on a 21-letter BString subject
    #subject: aaaatgcagtaacccatgccc
    #views:
    #    start end width
    #[1]     4   6     3 [atg]
    #[2]    16  18     3 [atg]
    
    findPotentialStartsAndStops <- function(sequence)
      {
         # Define a vector with the sequences of potential start and stop codons
         codons            <- c("atg", "taa", "tag", "tga")
         # Find the number of occurrences of each type of potential start or stop codon
         for (i in 1:4)
         {
            codon <- codons[i]
            # Find all occurrences of codon "codon" in sequence "sequence"
            occurrences <- matchPattern(codon, sequence)
            # Find the start positions of all occurrences of "codon" in sequence "sequence"
            codonpositions <- attr(occurrences,"start")
            # Find the total number of potential start and stop codons in sequence "sequence"
            numoccurrences <- length(codonpositions)
            if (i == 1)
            {
               # Make a copy of vector "codonpositions" called "positions"
               positions <- codonpositions
               # Make a vector "types" containing "numoccurrences" copies of "codon"
               types <- rep(codon, numoccurrences)
            }
            else
            {
               # Add the vector "codonpositions" to the end of vector "positions":
               positions   <- append(positions, codonpositions, after=length(positions))
               # Add the vector "rep(codon, numoccurrences)" to the end of vector "types":
               types       <- append(types, rep(codon, numoccurrences), after=length(types))
            }
         }
         # Sort the vectors "positions" and "types" in order of position along the input sequence:
         indices <- order(positions)
         positions <- positions[indices]
         types <- types[indices]
         # Return a list variable including vectors "positions" and "types":
         mylist <- list(positions,types)
         return(mylist)
      }
    
    s1 <- "aaaatgcagtaacccatgccc"
    findPotentialStartsAndStops(s1)

  2. #2
    RotParaTon
    Points: 47,092, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardCommunity AwardMaster TaggerFrequent Poster
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,184
    Thanks
    212
    Thanked 1,639 Times in 1,401 Posts

    Re: Finding start and stop codons in a DNA sequence

    Well whoever coded that originally isn't very good at coding "the R way" but I'll forget about that for now. The problem occurs with
    Code: 
    codonpositions <- attr(occurrences, "start")
    that doesn't work since an XStringViews object doesn't have a "start" attribute. It has a ranges attribute that has start values but really there's an easier way to get those start values... Just change that line to this:

    Code: 
    codonpositions <- start(occurrences)
    and everything should work fine. But yeah... that code could be much improved.
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  3. #3
    RotParaTon
    Points: 47,092, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardCommunity AwardMaster TaggerFrequent Poster
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,184
    Thanks
    212
    Thanked 1,639 Times in 1,401 Posts

    Re: Finding start and stop codons in a DNA sequence

    But really this would be a much nicer way to do the same task:

    Code: 
    
    library("Biostrings")
    
    findPotentialStartsAndStops <- function(sequence, codons = c("atg", "taa", "tag", "tga")){
        # find the starting position of the matches for all the codons
        startpositions <- sapply(codons, function(x){start(matchPattern(x, sequence))})
        # combine them into a vector and sort by position
        output <- sort(do.call(c, startpositions))
        
        # If we really *need* a list
        #output <- list(as.numeric(output), names(output))
        
        return(output)
    
    }
    
    s1 <- "aaaatgcagtaacccatgccc"
    findPotentialStartsAndStops(s1)
    I personally like the results more as a named vector but if you really want the results as a list then you can just uncomment the list part of the function and you should get the same results as before - just faster... and the code is much more R like.
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

  4. The Following User Says Thank You to Dason For This Useful Post:

    youngukoo (09-23-2012)

  5. #4
    Points: 675, Level: 13
    Level completed: 50%, Points required for next Level: 25

    Posts
    15
    Thanks
    7
    Thanked 0 Times in 0 Posts

    Re: Finding start and stop codons in a DNA sequence

    Thanks for your quick reply!

    That worked

    I encountered another problem with finding open reading frames using texts below. I got "numeric (0)". I would appreciate it if you could help me with this.

    > findORFsinSeq <- function(sequence)
    {
    require(Biostrings)
    # Make vectors "positions" and "types" containing information on the positions of ATGs in the sequence:
    mylist <- findPotentialStartsAndStops(sequence)
    positions <- mylist[[1]]
    types <- mylist[[2]]
    # Make vectors "orfstarts" and "orfstops" to store the predicted start and stop codons of ORFs
    orfstarts <- numeric()
    orfstops <- numeric()
    # Make a vector "orflengths" to store the lengths of the ORFs
    orflengths <- numeric()
    # Print out the positions of ORFs in the sequence:
    # Find the length of vector "positions"
    numpositions <- length(positions)
    # There must be at least one start codon and one stop codon to have an ORF.
    if (numpositions >= 2)
    {
    for (i in 1numpositions-1))
    {
    posi <- positions[i]
    typei <- types[i]
    found <- 0
    while (found == 0)
    {
    for (j in (i+1):numpositions)
    {
    posj <- positions[j]
    typej <- types[j]
    posdiff <- posj - posi
    posdiffmod3 <- posdiff %% 3
    # Add in the length of the stop codon
    orflength <- posj - posi + 3
    if (typei == "atg" && (typej == "taa" || typej == "tag" || typej == "tga") && posdiffmod3 == 0)
    {
    # Check if we have already used the stop codon at posj+2 in an ORF
    numorfs <- length(orfstops)
    usedstop <- -1
    if (numorfs > 0)
    {
    for (k in 1:numorfs)
    {
    orfstopk <- orfstops[k]
    if (orfstopk == (posj + 2)) { usedstop <- 1 }
    }
    }
    if (usedstop == -1)
    {
    orfstarts <- append(orfstarts, posi, after=length(orfstarts))
    orfstops <- append(orfstops, posj+2, after=length(orfstops)) # Including the stop codon.
    orflengths <- append(orflengths, orflength, after=length(orflengths))
    }
    found <- 1
    break
    }
    if (j == numpositions) { found <- 1 }
    }
    }
    }
    }
    # Sort the final ORFs by start position:
    indices <- order(orfstarts)
    orfstarts <- orfstarts[indices]
    orfstops <- orfstops[indices]
    # Find the lengths of the ORFs that we have
    orflengths <- numeric()
    numorfs <- length(orfstarts)
    for (i in 1:numorfs)
    {
    orfstart <- orfstarts[i]
    orfstop <- orfstops[i]
    orflength <- orfstop - orfstart + 1
    orflengths <- append(orflengths,orflength,after=length(orflengths))
    }
    mylist <- list(orfstarts, orfstops, orflengths)
    return(mylist)
    }

    > s1 <- "aaaatgcagtaacccatgccc"
    > findORFsinSeq(s1)
    > findORFsinSeq(s1)
    [[1]]
    numeric(0)

    [[2]]
    numeric(0)

    [[3]]
    [1] NA

    >

  6. #5
    RotParaTon
    Points: 47,092, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Discussion EnderPosting AwardCommunity AwardMaster TaggerFrequent Poster
    Dason's Avatar
    Location
    Ames, IA
    Posts
    9,184
    Thanks
    212
    Thanked 1,639 Times in 1,401 Posts

    Re: Finding start and stop codons in a DNA sequence


    You should really make a separate thread for new questions. It really sounds like you just need some practice with R though so I would recommend checking out some of the resources in this thread. One other thing that would really help is if you would wrap your code in code tags like this:


    [code]
    a <- 2 + 2
    [/code]


    which gets changed to
    Code: 
    a <- 2 + 2
    which is much nicer because inside code tags indentation doesn't get stripped away.
    "His programming is malfunctioning. It begins! Get your weapons, he's going to become a killbot!!!" - bryangoodrich

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats