Posts tagged ‘Regex’

Regular Expressions in C#(including a new comprehensive email pattern)

Of course C# support regular expressions. I happen to have learned regular expression in my dealing with FreeBSD, shell scripting, php, and other open source work. So naturally I would want to add this as a skill as I develop in C#.

What is a Regular Expression?

This is a method in code or script to describe the format or pattern of a string. For example, look at an email address:

someuser@somedomain.tld

It is important to understand that we are not trying to compare the email string against another string, we are trying to compare the string against a pattern.

To verify the email was in the correct format using String functions, it would take dozens of different functions running one after another.  However, with a regular expression, a proper email address can be verified in one single function.

So instead regular expression is a language, almost like a scripting language in itself, for defining character patterns.

Most characters represent themselves.  However, some characters don’t represent themselves without escaping them with a backslash because they represent something else.  Here is a table of those characters.

Expression Meaning
* Any number of the previous character or character group.
+ One of more of the previous character or character group.
^ Beginning of line or string.
$ End of line or string.
? Pretty much any single character.
. Pretty much any character, zero characters, one character, or any number of characters
[ … ] This forms a character class expression
( … ) This forms a group of items

You should look up more regular expression rules. I don’t explain them all here. This is just to give you an idea.

Example 1 – Parameter=Value

Here is a quick example of a regular expression that matches String=String. At first you might think this is easy and you can use this expression:

.*=.*

While that might work, it is very open. And it allows for zero characters before and after the equals, which should not be allowed.

This next pattern is at least correct but still very open.

.+=.+

What if the first value is limited to only alphanumeric characters?

[a-zA-z0-9]=.+

What if the second value has to be a valid windows file path or URL? And we will make sure we cover start to finish as well.

^[0-9a-zA-Z]+=[^<>|?*\”]+$

See how the more restrictions you put in place, the more complex the expression gets?

Example 2 – The email address

The pattern of an email is as follows: (Reference: wikipedia)

  1. It will always have a single @ sign
  2. 1 to 64 characters before the @ sign called the local-part. Can contain characters a–z, A–Z, 0-9, ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~, and . if it is not at the first or end of the local-part.
  3. Some characters after the @ sign that have a pattern as follows called the domain.
    1. It will always have a period “.”.
    2. One or more character before the period.
    3. Two to four characters after the period.

So a simple patterns of an email address should be something like these:

  1. This one just makes sure there are characters before and after the @
    .+@.+
  2. This one makes sure the are characters before and after the @ as well as a character before and after the . in the domain.
    .+@.*+\..+
  3. This one makes sure that there is only one @ symbol.
    [^@]+@[^@]+\.

This are all quick an easy examples and will not work in every instance but are usually accurate enough for casual programs.

But a comprehensive example is much more complex.

  1. I wrote one myself that is the shortest and gets the best results of any I have found:
    ^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*@((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$
    
  2. Here is another complex one I found: [reference]
    ^(([^<>()[\]\\.,;:\s@\""]+(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$
    

So let me explain the first one that I wrote as it passes my unit tests below:

<^ The start
[\w!#$%&’*+\-/=?\^_`{|}~]+ At least one valid local-part character not including a period.
(\.[\w!#$%&’*+\-/=?\^_`{|}~]+)* Any number (including zero) of a group that starts with a single period and has at least valid local-part character after the period.
@ They @ character
( Start group 1
( Start group 2
([\-\w]+\.)+ At least one group of at least one valid word character or hyphen followed by a period
[\w]{2,4} Any two to four valid top level domain characters.
) End group 2
| an OR statement
( Start group 3
([0-9]{1,3}\.){3}[0-9]{1,3} A regular expression for an IP Address.
) End group 3
) End group 1

Code for both examples

Here is code for both examples.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace RegularExpressionsTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // Example 1 - Parameter=value
            // Match any character before and after the =
            // String thePattern = @"^.+=.+$";

            // Match only Upper and Lowercase letters and numbers before
            // the = as a parameter name and after the equal match the
            // any character that is allowed in a file's full path
            //
            // ^[0-9a-zA-Z]+    This is any number characters upper or lower
            //                  case or 0 thru 9 at the string's beginning.
            //
            // =                Matches the = character exactly
            //
            // [^<>|?*\"]+$     This is any character except < > | ? * "
            //                  as they are not valid in a file path or URL

            String theNameEqualsValue = @"abcd=http://";

            String theParameterEqualsValuePattern = "^[0-9a-zA-Z]+=[^<>|?*\"]+$";
            bool isParameterEqualsValueMatch = Regex.IsMatch(theNameEqualsValue, theParameterEqualsValuePattern);
            Log(isParameterEqualsValueMatch);

            // Example 2 - Email address formats

            String theEmailPattern = @"^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*"
                                   + "@"
                                   + @"((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$";

            // The string pattern from here doesn't not work in all instances.
            // http://www.cambiaresearch.com/c4/bf974b23-484b-41c3-b331-0bd8121d5177/Parsing-Email-Addresses-with-Regular-Expressions.aspx
            //String theEmailPattern = @"^(([^<>()[\]\\.,;:\s@\""]+(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))"
            //                       + "@"
            //                       + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])"
            //                       + "|"
            //                       + @"(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$";

            Console.WriteLine("Bad emails");
            foreach (String email in GetBadEmails())
            {
                Log(Regex.IsMatch(email, theEmailPattern));
            }

            Console.WriteLine("Good emails");
            foreach (String email in GetGoodEmails())
            {
                Log(Regex.IsMatch(email, theEmailPattern));
            }
        }

        private static void Log(bool inValue)
        {
            if (inValue)
            {
                Console.WriteLine("It matches the pattern");
            }
            else
            {
                Console.WriteLine("It doesn't match the pattern");
            }
        }

        private static List<string> GetBadEmails()
        {
            List<string> emails = new List<string>();
            emails.Add("joe"); // should fail
            emails.Add("joe@home"); // should fail
            emails.Add("a@b.c"); // should fail because .c is only one character but must be 2-4 characters
            emails.Add("joe-bob[at]home.com"); // should fail because [at] is not valid
            emails.Add("joe@his.home.place"); // should fail because place is 5 characters but must be 2-4 characters
            emails.Add("joe.@bob.com"); // should fail because there is a dot at the end of the local-part
            emails.Add(".joe@bob.com"); // should fail because there is a dot at the beginning of the local-part
            emails.Add("john..doe@bob.com"); // should fail because there are two dots in the local-part
            emails.Add("john.doe@bob..com"); // should fail because there are two dots in the domain
            emails.Add("joe<>bob@bob.come"); // should fail because <> are not valid
            emails.Add("joe@his.home.com."); // should fail because it can't end with a period
            emails.Add("a@10.1.100.1a");  // Should fail because of the extra character
            return emails;
        }

        private static List<string> GetGoodEmails()
        {
            List<string> emails = new List<string>();
            emails.Add("joe@home.org");
            emails.Add("joe@joebob.name");
            emails.Add("joe&bob@bob.com");
            emails.Add("~joe@bob.com");
            emails.Add("joe$@bob.com");
            emails.Add("joe+bob@bob.com");
            emails.Add("o'reilly@there.com");
            emails.Add("joe@home.com");
            emails.Add("joe.bob@home.com");
            emails.Add("joe@his.home.com");
            emails.Add("a@abc.org");
            emails.Add("a@192.168.0.1");
            emails.Add("a@10.1.100.1");
            return emails;
        }
    }
}