Extracting Content with a regular Expression

Regular Expressions are not a very easy skill for any developer to master, that is why so few actually leverage this amazing feature of the .NET framework.  I have been working on a little application this week that needed to extract a specific peice of content from a long piece of text that was based on a predictable pattern.  Let's say a phone number in a resume for a real world example.  There are two options available to me to perform this task, I could write a very complex peice of parsing code that could be very long in code and long to debug or I could invest a little bit of that effort into using the Regex class built into .NET.

Extracting content with a regular expression can be pretty easy and it executes much faster than doing of bunch of indexof and contains off the string class.  There are several supporting classes in the Regular Expression area of the .NET framwork that we need to be concerned with, Regex, Match, MatchCollection, Group and GroupCollection.  Another class is Capture, but I will not be using that in this example.

So here is my code:

    Public Function GetUSPhoneNumber(ByVal sText As String) As String()

        Dim r As String = _
            "\(?(\d{3})\)?[- .]?(\d{3})[- .]?(\d{4})"

        Dim options As RegexOptions = RegexOptions.IgnoreCase Or _
            RegexOptions.Multiline Or RegexOptions.IgnorePatternWhitespace Or RegexOptions.CultureInvariant

        Dim matches As MatchCollection = Regex.Matches(sText, r, options)

        Dim sRet As New StringBuilder

        For Each f As Match In matches

            sRet.AppendLine(m.Value.ToString()) & vbtab


        Return sRet.ToString.Split(vbtab)

    End Function 

I am not going to go into much depth about the regular expression itself, but it will match many common formats of US phone numbers.  What I do want to point out is that our pattern will 'grab' the matched number each time it fines a pattern of text that fits our regular expression.  These are stored in the Matches (MatchCollection) object of our Regex.  We can 'walk' this collection to see what values or matches we have.  In this case we should have a nice collection of phone numbers.  I am appending them to a string builder with a tab added to the end of each entry.  I am then returing this string, split by the tab so I return an array of strings or phone numbers in this case.

Another neat trick with this expression is to walk the groups that are generated from the expression and return those as an array.  A group in this case will be the area code (3 numbers), local code (3 numbers) and the last four digits.  Groups are generated by defining them in the regular expression with ( and ) surrounding the code, (\d{4}) for example returns the last four digits.

        For Each f As Match In matches

  for each g as Group in f.Groups

   sRet.AppendLine(f.Groups(f.Groups.Count - 1).Value.ToString() & vbtab)


You could also use the power of the Regex object to replace text based on the pattern and much more.  I hope to cover some more great Regular Expression features in the coming days.


Share This Article With Your Friends!