I was writing some validation code for a desktop app and there are some fields require web page URLs. My first thought was to a regular expression to validate the input, then this quote came to mind.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Regular expressions can solve many problems, but if there is a method in the .Net Framework that will solve the problem, I’ll use that first.
I started off using the IsWellFormUriString method from the Uri class.
I created a string extension method called ValidateUrl() and tried it with the following examples
You can see the results of those examples with the following .Net Fiddle:
The first is valid and we expected that. The 2nd is not valid and is expected. The 3rd is valid and I didn’t account for that in the code. The last one, what://foo.com, that was unexpected. That is obviously not a valid URL. Or at least not valid for the purposes of my app.
But I wasn’t check to see if the URL was valid, I was checking to see if the URI was valid. A URI consists of a scheme name (“http”, “ftp”, “mail”, etc), followed by the “:” character, and then by the path the resource. The Uri.IsWellFormedUriString method does not validate the scheme name, it’s just validating that the resource was assembled correctly.
Validating for a web page URL would be on me. I can use Uri.IsWellFormedUriString, but I would need to add an additional check to verify that the URI started with http:// or https://. With that modification, the example from above becomes this Fiddle:
And with the same inputs, I now get the results that I need. The one that starts with “ftp:” is a valid URL, I’m returning it as invalid because that meets the requirements of my app. Your mileage may vary.
While testing the code, I came across an example of a valid URL that IsWellFormedUriString will return as invalid. It fails on URLs with embedded unicode characters. Like this one:
This apparently broke in .NET 4.5 and was logged as bug. And according to the notes attached to that bug, it’s not going to be fixed any time soon. For this project, I know that the URLs that could be entered in by the user would not have unicode characters, so I don’t have to worry about that.
If I did have to handle unicode characters, that’s when I would use a regular expression. Someone already went in search of the perfect URL validation regex and found one from Diego Perini that handles all sorts of use cases.
Weighing in at 502 characters, this handles just about any valid URL.
In that format, that regex is what you would call write-once code. Diego posted a commented version of it as a gist, you can read that without your eyes bleeding.