Eclipse Zone is brought to you in partnership with:

As principal partner of DataCurl LLC, Dan Wilson runs both the consulting practice and ChallengeWave.com, a way to help employees start and stick with healthier lifestyles. Before launching DataCurl, Dan held numerous senior program and development positions in such industries as Technical Consulting, Health Care, Online Publishing and Government Contracting. Dan is an avid participant in technology communities; an Adobe Community Professional, manager of the Triangle ColdFusion User Group in Research Triangle Park, North Carolina, Managing Director of the popular Model-Glue framework and contributor to numerous open source projects based on ColdFusion, Flex and AIR platforms. Dan presents on ColdFusion, Flex and Rapid Development Techniques at popular conferences around the world. You can find his thoughts on ColdFusion, Flex, AIR and other technology matters at http://www.nodans.com and some occasional ramblings on food at http://blog.chefdanwilson.com. Dan has posted 33 posts at DZone. You can read more from them at their website. View Full User Profile

So You Wanna Learn Regex?

09.27.2009
| 11450 views |
  • submit to reddit

I've had a set of blog posts stewing in my brain for a while. Steve Nelson, last year, helped me out with a Regular Expression (Regex) and I made it a point to practice my Regex skills more. This series will show how to use Regular Expressions in Eclipse and we'll learn some helpful tips along the way.

This series is for you if you are the kind of developer that reads Ben Nadel's blog posts containing Regular Expressions, and has no idea what the heck he is talking about. Seriously Ben, this is unintelligible to us mere mortals:

<cfset blogContent = reReplace( blogContent, "</?\w+(\s*[\w:]+\s*=\s*(""[^""]*""|'[^']*'))*\s*/?>", " ", "all" ) />

(It looks like a catnip crazed kitty went for a prance on a keyboard, doesn't it?)

Enough guffaws and such. On with the learning.

Editors Note:

Simply reading these blog posts aren't going to help you. Open eclipse, and copy/paste this stuff into your find/replace dialog. You'll learn more, or your money back!

So, firstly we need a use case. Let's pretend we are going through some old code and looking to add HTMLEditFormat around some arguments so that the forms won't break if there are quotes.

Assume this set of declarations:

<input name="fred" value="willy" />
<input name="bill" value="mickey" />
<input name="erin" value="harry" />
<input name="baz" value="pissette" />

What we want, is to turn: <input name="fred" value="willy" /> into: <input name="fred" id="fred" value="willy" />

Normally, this would be a forearm/wrist fatiguing flail on the keyboard, furiously cutting/pasting and generally flapping about. Not so with Regular Expressions. A Regex is a pattern matcher, and it can do stuff. We can see our code is repetitive and the pattern we want is: make a new attribute called 'id' and populate it with the value from the attribute 'name'... which is what we'd do over and over via cut/paste/etc.

We can define this pattern in the gobbledegook defining a regular expression, of course, else I'd be writing this post about Cute LOLCats, not Cute Regexes., wouldn't I? We'll go through the exercise, then look at why it worked.

In Eclipse, perform the following:

  1. Open a new file and paste the above set of declarations: ( remember the chunk above starting with <input name="fred" value="willy" />...)
  2. Open the find dialogue (I use CTRL+F) and make sure the Regular Expression option is ticked
  3. Enter the following in the Find: Input name="([^"]+)"
  4. Enter the following in the Replace: Input name="$1" id="$1"
  5. Press Find and make sure the pattern matches what we want
  6. Lastly, press Replace All

You Should Have This:

<input name="fred" id="fred" value="willy" />
<input name="bill" id="bill" value="mickey" />
<input name="erin" id="erin" value="harry" />
<input name="baz" id="baz" value="pisser" />

(if not, you missed a step. Look at the image and compare with what you have in your Find/Replace dialog. Make sure there is no extra whitespace in the find expression)

Blamo! Your code is now properly sorted out with the new ID attribute and you didn't even get carpal tunnel syndrome! Let's decode the code, shall we?

Here is the find portion of the regular expression: name="([^"])+"

  • name="  The first character chunk is the word 'name' followed by an equals sign, then a double quote. These are all literals and need no escaping.
  • (  The next character is an open parenthesis. This defines the beginning of a group. Remember, we want to use the value of the name attribute to populate the name of the ID attribute.
  • [^"]+  The next chunk defines any character that is not a double quote. Note it starts with an open bracket, used to define a set. Inside the open bracket is a carat. This means it is opposite day and our set should NOT INCLUDE the whatever follows. What follows is a double quote, because the value of an attribute is inside the boundaries of the double quotes. We close this character set with the close bracket, then a plus symbol because a plus symbol defines 1 or more of the previous character in the expression. We definitely want more than one character before the closing double quote, else we don't want a match.
  • )  Lastly, we have the closing parenthesis defining the end of our group and another double quote symbolizing the end of our matching boundary.
All of that defines boundaries for a character walking regular expression gnome to take the stuff inside the attribute and hold on to it.

Then in the Replace section, we used: name="$1" id="$1"

  • The 'name' and 'id' attributes, along with both equal signs and both sets of double quotes are all litteral, no escaping needed.
  • The $1 refers to the group we defined in the Find input and we use it twice. $n is called a backreference.

So in plain English, we asked the regular expression find/replace gnome to: Take the stuff inside the 'name' attribute, and stick it back in the 'name' attribute and also inside of a new 'id' attribute.

I'm sure you can agree this was much easier than a copy/paste extravaganza..

Part Two

In our last exercise, we looked at a simple way to add a new attribute to an HTML tag. This was accomplished by making a pattern, defining a group and using a back reference. This time we will look at a slightly more complicated use case.

Assume this set of declarations:

product.setColor(arguments.color);
product.setSize(arguments.size);
product.setCondition(arguments.condition);
product.setRating(arguments.rating);
product.setReliability(arguments.reliability);
product.setNeedsBatteries(arguments.needsBatteries);

What we want, is to turn: product.setColor(arguments.color); into: product.setColor( htmlEditFormat(arguments.color) );

Normally, this would be a forearm/wrist fatiguing flail on the keyboard, furiously cutting/pasting and generally flapping about. Not so with Regular Expressions. A Regex is a pattern matcher, and it can do stuff. We can see our code is repetitive and the pattern we want is: Take Everything Inside The Parenthesis, and Wrap It In A htmlEditFormat() Function. (Same stuff we'd do over and over via cut/paste/etc, isn't it?)

We can define this pattern in the gobbledegook defining a regular expression. When read one chunk at a time, these actually make sense. We'll go through the exercise, then look at why it worked.

In Eclipse, perform the following:

Editors Note:

Simply reading these blog posts aren't going to help you. Open eclipse, and copy/paste this stuff into your find/replace dialog. You'll learn more, or your money back!

  1. Open a new file and paste the above set of declarations: ( remember the chunk above starting with product.setColor(arguments.color);...)
  2. Open the find dialogue (I use CTRL+F) and make sure the Regular Expression option is ticked
  3. Enter the following in the Find: Input (\([^\)]+\))
  4. Enter the following in the Replace: Input ( htmlEditFormat$1 )
  5. Press Find and make sure the pattern matches what we want
  6. Lastly, press Replace All

You Should Have This:

product.setColor( htmlEditFormat( htmlEditFormat(arguments.color) ) );
product.setSize( htmlEditFormat(arguments.size) );
product.setCondition( htmlEditFormat(arguments.condition) );
product.setRating( htmlEditFormat(arguments.rating) );
product.setReliability( htmlEditFormat(arguments.reliability) );
product.setNeedsBatteries( htmlEditFormat(arguments.needsBatteries) );

(if not, you missed a step. Look at the image and compare with what you have in your Find/Replace dialog. Make sure there is no extra whitespace in the find expression)

Blamo! Your code is now all properly HTMLEditFormatted and you didn't even get carpal tunnel syndrome! Let's decode the code, shall we?

Here is the find portion of the regular expression: (\([^\)]+\))

  • (  The first character chunk is an open parenthesis. This basically defines a group. You can see the entire expression is surrounded by parenthesis, so we will be treating what is found as a group.
  • \(  The next chunk is a backslash. Most often this is an escaping character, which means treat the next character as a literal character we want to find in our string. Taken with the next character, we can see we want to find an open parenthesis.
  • [^\)]  The next chunk defines any character that is not a close parenthesis. Note it starts with an open bracket, used to define a set. Inside the open bracket is a carat. This means it is opposite day and our set should NOT INCLUDE the whatever follows. What follows is a backslash and a close parenthesis, regexese for a literal ( Then the close bracket.
  • +\)  The next chunk is a plus symbol, followed by a backslash and close parenthesis. A plus symbol defines 1 or more of the next character in the expression, which is really the next next character, since we need a backslash to escape the close parenthesis.
  • )  Last chunk, the closing parenthesis defining the end of our group.
All of that defines boundaries for a character walking regular expression gnome to take the stuff inside the parenthesis and hold on to it.

Then in the Replace section, we used: ( htmlEditFormat$1 )

  • The surrounding parenthesis are literal, as is the htmlEditFormat.
  • The $1 refers to the group we defined in the Find input. (remember the term backreference?)

 

So in plain English, we asked the regular expression find/replace gnome to: Take the stuff inside the parenthesis, and wrap it with ( HTMLEditFormat+GROUPTEXT+ ).

From http://www.nodans.com

Published at DZone with permission of its author, Dan Wilson.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

ed oconnor replied on Tue, 2009/09/29 - 7:57pm

turn: <input name="fred" value="willy" /> into: <input name="fred" id="fred" value="willy" />

Off topic, but may be of interest to some: In REBOL 3, this would be:

parse/all html [any [

    thru {name="} copy name to {"}

    skip

    insert rejoin [{ id="} name {" }

    ]

    to end

]

 

turn: product.setColor(arguments.color); into: product.setColor( htmlEditFormat(arguments.color) );

 parse/all html [any [

    thru ".set"

    thru "("

    insert " htmlEditFormat("

    thru ")"

    insert " );"

    ]

    to end

]

Docs here. REBOL 3 (alpha) here.

Mohammed Yousuff replied on Thu, 2009/10/08 - 4:02am

Dan Thank you so much, i was looking grouping the text for a long time... thanks again for that :)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.