Thursday, June 5, 2008

Regular Expressions in Java

Regular Expressions in Java:

A regular expression works by matching a String against a template or pattern (a Pattern object in Java), and in its simplest form, returning a boolean to say "yes, the string does look like the pattern" or "no, that doesn't match".

As of release 1.4 of Java, though, there's a standard package
java.util.regex
that's shipped with the JRE, and that's what we'll look at in this module.
"^\\S+@\\S+$"

is a regular expression to see if the string entered is an e-mail

Ex:

Pattern email = Pattern.compile("^\\S+cat\\S+$");

Matcher match=email.matcher("vandicated");
if(match.matches()){
System.out.println("Matches vindicated pattern");
}else{
System.out.println("No Match found");
}


If you want to match at the start of a String, start your pattern with a ^ character; if you want to match at the end, conclude your pattern with a $ character. Should you specify both a ^ and a $, then you're looking to match the complete String to your regular expression.

The ^ and $ elements are known as "anchors" as they tie the start and/or the end of the String down; this group as a whole is also known as "assertions" because they don't match any specific characters in the incoming string, they just assert that while the match is running a certain condition must occur at the given point in the match.


If you write

[abcdef]


in your regular expression, then you're matching any one character from the list given (a b c d e or f). You can expand this capability further by using a minus sign to specify a character range, thus

[a-z]
any lower case letter
[0-9a-fA-F]
any hexadecimal character


and if you want to match any character except one from a list, you can start the character list with an ^ character, for example:

[^a-z]
any character except a lower case letter
[^%0-9]
any character except a digit or a % character

but that would get messy really fast, so there are some common groupings available in Java's regular expressions:

\s
any white space character
\d
any digit
\w
any word character (letter, digit, underscore)


If you want any character except one of these, use a capital letter:

\S
any character that is not a white space
\D
any character that is not a digit
\W
any character that is not a word character


Sequences such as \s will be familiar to you if you use Perl's regular expressions, but there are other character groups too; these use a POSIX standard definition of the character groups, but it's extended and the format isn't taken from Perl, nor PHP, nor Tcl nor SQL!

\p{Space}
Alternative to \s for "any white space"
\p{Blank}
Space or tab character
\p{Alpha}
Any letter (upper or lower case)
\p{Graph}
Any visible character
\p{InGreek}
Any Greek letter
\p{Sc}
A currency symbol


You can negate these groups using \P rather than \p thus

\P{Graph}
Any character that is not visible


One final grouping, the ultimate group if you like, is the "." (full stop or period) character, which matches virtually any character.


COUNTS


The fourth main group (after anchors, literal characters, and character groups) are the counts; you use these in regular expressions if you want to give a quantity to a literal character or group, and you add the count character into you pattern directly after the element to which it applies. There are three very common counts:

+
one or more
*
zero or more
?
zero or one

Source: http://www.wellho.net/solutions/java-regular-expressions-in-java.html

Fun exploring new things. :)

No comments: