+ Reply to Thread
Results 1 to 4 of 4

Thread: remove bracketed items of a character string

  1. #1
    FormerlyKnownAsRaptor
    Points: 24,248, Level: 94
    Level completed: 90%, Points required for next Level: 102
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,156
    Thanks
    882
    Thanked 544 Times in 492 Posts

    remove bracketed items of a character string



    It is the convention of transcript writing in my field to use bracketed items such as [unintelligible] to denote things that were not actually spoken. Obviously these need to be removed for textual analysis can be performed. I'm working towards an eventual release of an R package that is text mining geared towards discourse analysis (It's either going to get me me accolades or booted out of the qualitative driven literacy program [I doubt the later because many actually seem to like numbers but are uncomfortable with them]).

    Typically I go through and remove these by determining what was used in the transcript and then deleting each one by "hand". I think it would be better to have a more generalizable function. I was thinking one that takes a character string and deletes anything in between the brackets and including the brackets.

    How can I approach this?

    Here's how I typically do the process:
    Code: 
    x <- "What kind of cheese isn't your cheese? [wonder] Nacho cheese! [groan] [Groan] [Laugh]"
    
    x <- gsub('[wonder]', "", x, fixed=TRUE)
    x <- gsub('[groan]', "", x, fixed=TRUE)
    x <- gsub('[Groan]', "", x, fixed=TRUE)
    x <- gsub('[Laugh]', "", x, fixed=TRUE)
    x
    It'd be nicer to type:
    Code: 
    nonDialgueR(x)
    And it returns:
    Code: 
    "What kind of cheese isn't your cheese?  Nacho cheese!   "
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  2. #2
    Probably A Mammal
    Points: 14,462, Level: 78
    Level completed: 3%, Points required for next Level: 388
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,951
    Thanks
    221
    Thanked 419 Times in 387 Posts

    Re: remove bracketed items of a character string

    Thank you Stacker.

    Code: 
    x <- "What kind of cheese isn't your cheese? [wonder] Nacho cheese! [groan] [Groan] [Laugh]" 
    gsub("\\[.*?\\]", "", x)  # Outputs: [1] "What kind of cheese isn't your cheese?  Nacho cheese!   "
    Explanation about regular expressions? Usually "\[" would be enough to escape sequence the special meaning of "[" and treat it literally. However, this is R, and so you need to "double escape" the sequence with "\\[". Thus, with the string "\\[.*?\\]" we are clearly escaping the opening and closing brackets. Done. (Note, if you wanted to search a literal "\" you would need to double escape it, too; you'd need a "\\\")

    The real trick then, is understanding the sequence ".*?" in the expression. As any reference will tell you, the "." matches any character. If you were to do

    Code: 
    gsub(".", "", x)
    You'd return nothing. Why? Because it matches every single character in your string! The "*" is a quantifier saying "match it one or more times." The "?" makes it optional. When you pair the quantifiers, "*?" becomes a "lazy star". This matches the item one or more times. The solution I provided will work with a "lazy plus," too. To get a taste for it, contrive a simple example.

    Code: 
    # The dot alone takes out one item
    gsub("\\[.\\]", "", "blah [b] blah")
    # [1] "blah  blah"
    
    # Works with multiple occurrences of the pattern
    gsub("\\[.\\]", "", "blah [b] blah [d]")
    # [1] "blah  blah "
    
    # Multiple characters in between [..] require more "." which requires quantification
    gsub("\\[.\\]", "", "blah [bbb] blah [ddd]")
    # [1] "blah [bbb] blah [ddd]"
    
    # The problem is that now it's matching "bbb] blah [ddd" as the stuff between brackets
    gsub("\\[.+\\]", "", "blah [bbb] blah [ddd]")
    # [1] "blah "
    
    # Removes the one instance without requiring a "lazy star/plus" because the above problem doesn't arise
    gsub("\\[.+\\]", "", "blah [bbb] blah")
    # [1] "blah  blah"
    
    # Thus, the solution requires a "lazy star/plus" to keep our quantification from being "greedy"
    gsub("\\[.+?\\]", "", "blah [bbb] blah [ddd]")
    # [1] "blah  blah "
    Since you're doing a lot of textual analysis, I recommend checking out a Regular Expressions book of some sort from your library. It is a very, very, powerful tool. You can spend a lot of time building very short expressions that will do a lot. If you become natural with them, you can use it all over the place. Pretty much every good programming language you use will have an implementation (perl, python, R) as well as if you get into Linux, it and its tools (the Shell, awk, sed) make use of regex. You just have to tweak your expressions to work in the given environment (e.g., like how R uses a different rule with escape sequences, or that R puts the expression as a string input into the function whereas other languages take it as an input between slashes (= /... regex .../). From what I'm learning with awk (which you can implement in Windows), you could probably process text documents pretty **** well. Say you wanted to clean your transcripts before you process them (basically, do the preprocessing of the data outside of R). You could do the above expression in awk (or sed, which I don't find to be a clean language to look at, but is very succinct) with their sub command, write it to a file or even pipe it into an R CMD instance (or in linux, there are ways to make R scripts executable like, say, a python script) and have the preprocessed stuff go straight into your R script for processing. I'm not sure what kind of processing time you might face, but I'm just saying, this could be a good work flow in some instances. That, and I like sharing my new passions, which I'm finding ways to love awk, haha.

  3. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (12-04-2011)

  4. #3
    FormerlyKnownAsRaptor
    Points: 24,248, Level: 94
    Level completed: 90%, Points required for next Level: 102
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    3,156
    Thanks
    882
    Thanked 544 Times in 492 Posts

    Re: remove bracketed items of a character string

    Bryangoodrich

    Thank you that's perfect. I set up the function with switch so it can remove curly, round and square brackets or all. Very nice.

    Good call on needing to learn some regular expression stuff. It's my goal to eventually learn perl and it sounds like the learning here will be transferable to that.

    Do you think you could call awk from within R? I'm hoping to make it all internal because people in my field aren't programming savvy (let's put it this way I'd be considered a top notch programmer in the literacy field ). They like stuff to just work. Getting them to do it in R will be a stretch, but they will if they see the fruits [insert hopeful here].
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  5. #4
    Probably A Mammal
    Points: 14,462, Level: 78
    Level completed: 3%, Points required for next Level: 388
    bryangoodrich's Avatar
    Location
    Sacramento, California, United States
    Posts
    1,951
    Thanks
    221
    Thanked 419 Times in 387 Posts

    Re: remove bracketed items of a character string


    This is why I think learning Python could be useful. While R may have the capabilities to interface with a lot of stuff, I don't believe it is as rich in that regard as Python, and Python can call R code, either executing an external environment, or instantiating one and calling commands from within Python. Something similar to

    Code: 
    r = somewaytoinstantiateR()
    r.plot(c(1,2,3), c(1, 4, 9))
    r.print(rnorm(10))
    Learning a bit of Python could be good for you. Perl is useful, especially in textual analysis, but when it comes to using regular expressions, like I said, it's pretty much available everywhere. As for awk, I haven't found much on integrating it with R. The only way I can tell is to do a system call from within R, but then awk needs to be available on the system. The tradeoff would be to design similar functions in other languages, like Python or Perl, which would still need to be on the systems, but at least Python is much more likely.

    EDIT:

    I did find this LINK about awk. Apparently it comes with RTools. Thus, you'll have the Windows version of gawk when you install RTools. You can make the system call to it for quick processing, especially of larger files that you'd not want to use in R--e.g., preformatting text that might be formatted currency you want as pure number or to reformat a date field to fit the format for Date data type input. The example from that Help page was with regard to grabbing a select set of columns from the large data file that would otherwise be taxing or cause R to crash. I'm dealing with that problem myself. My solution is to store all my vital statistics in a large SQLite database and then connect to the database, letting SQLite handle the basic data parsing with my SQL queries. I could also program C functions, I learned today, that could run aggregation processes in SQLite, which could further customize some processing for me outside of R and remain very (computationally) efficient. In any case, awk can be used from R, and apparently you can even pipe the results with a function so named (like I could take the return output of ls -l and process it in awk--.e.g., ls -l | awk 'printf "%s, %s", $1, $3' should, if my syntax is correct, return the first and third field from taking the long-format directory listing treated as white-space separated). There's another function mentioned called cut, which I doubt is like the R function. I'm going to have to look into it and see what it does. It may be another useful data management utility to add to my swiss army knife!

    EDIT EDIT

    Okay, I looked it up real quick. cut is freaking awesome! Just look at some of these examples

    Suppose this is your data, call it company.data

    Code: 
    406378:Sales:Itorre:Jan
    031762:Marketing:Nasium:Jim
    636496:Research:Ancholie:Mel
    396082:Sales:Jucacion:Ed
    and you want the first 6 character columns (the serials) or just the 4th digit of the serial and the first letter of the department (column 8). Your command is

    Code: 
    cut -c1-6 company.data
    cut -c4,8 company.data
    Since the data is ":" delimited, you can specify that parameter and grab fields, too

    Code: 
    cut -d: -f3 company.data
    Honestly, the syntax makes completely sense ("d" is the parameter for delimited, ":" is the specified delimiter; "f" is for field and "3" is which delimited field). Very simple commands that can help quickly process data, which makes it ripe for batch processing, in Windows or Linux (and in Windows, you can always use cygwin if the tools don't have a native port into Windows).

  6. The Following User Says Thank You to bryangoodrich For This Useful Post:

    trinker (12-04-2011)

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats