The 3 Deadliest Sins of Form Validation

Why You Probably Shouldn’t Validate an Email Address with Regex and Other Deadly Sins

Published on
Jul 25, 2019

Read time
5 min read

Introduction

Forms are an integral part of the web. For businesses, they provide an essential way of growing and maintaining a customer base. For users, they’re tedious but necessary. As web developers, it’s our job to cater to both groups. We should aim to make forms as quick and painless as possible, while also ensuring the data we collect is useful and valid.

And yet the web is full of terrible form validation. Maybe it’s because we underestimate the complexity involved. Maybe we find forms as boring as our users. But the end result is frustrating for everyone: we end up with forms that don’t accept legitimate data, and that either turns users away or it forces them to submit incorrect information.

In this article, I’ll share what I believe to be the three deadliest sins of form validation — and what you can do to avoid other people’s mistakes.

1. Email Addresses

Several of the ‘deadly sins’ in this article involve regular expressions — a.k.a. regex. Used well, regular expressions provide a succinct solution to finding and manipulating substrings. But the internet is also full of less-than-perfect regex, which is either too strict — and will prevent some users from submitting perfectly legitimate information — or it is too broad.

Email is a great example. There are many regular expressions out there designed to validate email addresses, but the vast majority are too strict. For instance, here’s a simple regex for email validation:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/;

This expression likely covers 95% of emails out there. But it fails to validate many emails addresses which — though less common — are still perfectly acceptable. For example, the following valid email addresses would be rejected by the above expression:

joe/bloggs@example.com
%jane!doe$@example.com
"john smith"@example.com
company+department@example.com

The mistake many developers make is assuming that, if 19 out of 20 users can enter their email without trouble, that’s good enough. But why settle for 19 users when you could have 20?

What’s the definition a valid email address?

There have been several RFC protocols written to define what is and what isn’t allowed in an email address, and the latest draft (RFC 5322) is over 17,000 words long. Does that make it regex-proof? Not exactly, but it’s certainly too much trouble for most of us to come up with from scratch.

The following expression purports to covers the entire RFC specification, and it’s got over 6,000 characters!

The mistake many developers make is trying to come up with the expression from scratch. For production-level code, it’s better to rely on a more exhaustive expression like the one above — or, better still, to depend on a trusted and regularly updated library like validator.js, which has a convenient isEmail method.

2. Passwords

Whether or not you think this is a ‘deadly sin’ will depend on your school of thought. It’s popular on the web to enforce certain restrictions for user passwords — for example, including at least one digit — but many developers are strongly opposed to this: they consider it bad practise to enforce any kind of password rules.

The ‘No Rules’ Point of View

For some, there’s no good reason to put limitations on people’s passwords, as this only reduces the total number of possible strings: making the entire database that much less secure.

The one practical exception is size: you don’t probably don’t want someone to submit a 1TB password and crash your server. In that case, a simple regex like /^.{0,128}$/ may be all you need.

In a system like this, you’re trusting your users not to choose easily guessable strings like password, qwerty or 123456. People on the side of debate tend to think that, if users choose a weak password, it’s their problem if their account gets compromised.

The ‘Nanny State’ Point of View

By contrast, there are many companies who opt to protect people from themselves, at the expense of fewer possible passwords overall. For example, it is common to require one capital, one number and one special character in user’s passwords, using a regex like this:

/(?=.{8,})(?=.*[a-z])(?=.*[A-Z])(?=.*[!@"£$%^&*()_+-=@~\[\]\(\)]).*/

How Much Less Secure Is It?

To keep things (relatively) simple, let’s compare two 10-character passwords — one with restrictions and one without. For example’s sake, I’ll assume there are 94 options for each character in the passwords.

Without any restrictions, a 10-character password could have 53 quintillion (53,000,000,000,000,000,000) possible variations!

Now let’s imagine a 10-character password with the following restrictions: 1 digit has to be a capital letter, 1 digit has to be a numeral, and 1 digit has to be one of 32 special characters. The number of possible combinations reduces dramatically to 0.5 quintillion — just 1% of the previous number!

But, using a brute force method, it could still take as many as 9 years to decode a password in that less secure system. So the method you choose depends on your priorities: would you prefer fewer of your users to have weak passwords or a system that is more secure overall?

3. Phone Numbers

This is the last ‘deadly sin’. Unlike email, which — though complex — has a standardised definition, there is no global standard for local phone numbers. In some parts of the world, valid numbers can be as short as 3 digits.

If your users are all from one country, it’s possible to create regex for country-specific validation, but even then there’s plenty of room for error. In the UK, where I live, most landlines have an area code followed by 6 digits. But my parents have an area code followed by 5 digits, which many online forms don’t accept — forcing them to submit an incorrect phone number!

As with email, the only certain way to know that a phone number is valid (and active) is to send a verification text.

And another thing!

Have you entered your phone number online for the form to chop off the initial 0? It’s because a site is storing the phone number as a Number, not as a String.

That’s annoying, but it shouldn’t prevent a website owner from contacting a user. However, in JavaScript, there’s an additional risk, because 0 at the beginning of a number is shorthand for an octal number: if someone's phone number happens to lack the digits 8 or 9, it could be silently converted to a completely different decimal. You could end up storing 07123 123456 as 961324846!

Overall, form design on the web could be a lot better if — instead of reaching for a quick regex shortcut — more developers spent time becoming aware of edge cases and understanding the best practices for each particular field.

Or maybe I’m just being picky!

© 2024 Bret Cameron