Password strength estimation - by Dan Wheeler

Speaker: Dan Wheeler (Dropbox)
Date: Friday 31 January 2014
Duration: 1h
Location: MIT, Cambridge, MA

This talk was based on a blog post (archives) authored by the speaker.

Generalities about passwords

If an attacker knows that a password is made of 4 words, the chances he discovers it still are very low. It proves us that xkcd's example is still a good example.

Common substitutions (a → @, e → 3, ...) are well known by crackers and do not add a lot of entropy. However, it makes passwords harder to remember. Having an uppercase as a first letter is also common, as well as ending a password with 1 or 2 digits or special characters. Crackers can specialize their brute force attacks using these rules and find passwords more easily. Rotating letters, appending one random character at the end, ... Crackers try these patterns too. Adding a few letters in front of a word found in common dictionaries word is better but not enough.

Based on 6 million leaked actual passwords, a study ran by Mark Burnett showed that 99.8% occured a top 10,000 word list, with 91% in the top 1,000 (although these numbers must be used carefully as the bias was important).

State of the art of existing password strength estimators

Existing password estimator usually give bad advice. For example, passwordmeter.com gives a score of 27% to cremeuniversebruledinosaur and 83% to cremeuniversebrUledinosaur! You cannot reason about these rules. Twitter's estimator is based on dictionary recognition and recognizes password as a bad password. However, passw0rd is scored as okay and paSsw0rd123456 is scored as perfect.

The speaker published a table showing feedback given by common websites estimators for qwER43@!, Tr0ub4dour&3 and correcthorsebatterystaple. Most systems, such as Gmail or Yahoo!, will recommend the weakest passwords (the first two ones) over the best one. Bank of America will refuse them all. Citibank outputs that the best password (the third one) is the worst one. Twitter's estimator says that all of them are perfect.

hashcat is a tool described as advanced password recovery. It is actually more of a password hacking tool.

zxcvbn, the free and open source password strengh estimator by Dropbox

Dropbox's estimator has its caveats. For example, troubador87 is rated as excellent while troubadour87 is rated as weak because troubadour is part of a dictionary and troubador is not. cremedinosaur is weak because it is just the concatenation of 2 words existing in common dictionaries, but cremedinosau is good.

A password is rated as So-so when it is considered safe against an online attack, that is when the website database has not been compromised. A password is good when an attacker will be hardly able to guess a password or impersonate a user although he has access to the database.

The project is executed on the client side for two reasons:

it is lightweight and easy to include to another project, does not need a specific server
the tested password never get sent on the network, limiting the risks of network traffic interception.

The project is open source. There are many port of the project, such as a Java port for Android, a version in Scala, ... There is no port for iOS so far.

Data

The first thing needed to build a good estimator is to have a good dictionary word on hand. 30,000 words is a good figure, because more makes the list too heavy to be downloaded on the client side while less is not enough to test against. A frequency is also needed. A list from a Wikipedia project was used because it shows a frequency list of common words used in television and movie scripts, which matches popular culture.

A script generating keyboard layouts (QUERTY, Dvorak, keypad, ...) is also useful to predict what will be the next letter based on key sequences.

Other kind of data might include popular dates, common names and surnames, ...

Matching

The script first enumerates all the different matches it detects on a given password. These matches could be heavily overlapping (for example, a password can contain damnation which also contains dam and nation).

This step also create alternative matches including substitutions.

It does a spatial match: for every letter entered, does the next one follow a pattern from a keyboard layout? It also matches repeats and sequences, as well as digits, dates, ... using regular expressions.

The result of the matching step is a sorted list of substrings of the password.

Scoring

For each one of the entries matched, we want to attach a number, which is the entropy.

The script is simple but conservative: it is more appropriate to minimize the entropy estimation of a password (hence the time needed to crack it) rather than spending a long time to compute the accurate entropy.

The ultimate score would be to determinate how long does it take to crack a given password. However, obtaining ultimate accuracy of the estimation would be to code an actual cracker to check the password. Here, we just want a sound advice to give to the user in a reasonable computation time.

Example of a trivial scoring: for a given word ranked in a dictionary based on its frequency, the entropy would be log₂(rank).

This step scores different patterns. Example of spatial pattern on a QWERTY layout: the password qwerfvcx is ranked low in equation because there are not many key patterns of 8 characters including 2 turns. It does the similar computations for lower and upper case letters: changing from password to Password is common so it will just add 1 bit of entropy. Using uppercase letters only is also common, as well as all but one or two. Therefore entropy is maximum when uppercase and lowercase are half and half.

Search

The search step removes the underlapping sequences and keep the simplest set of password, that is, with the lower entropy.

The algorithm also adds one extra character meant to be brute-forced (adding an entropy of log₂(size of the keyboard layout)) and checks if the resulting entropy is lower than the one computed with an exact matching of substrings.

The choice of algorithm really matters.

Threat model

Considering the worst cast scenario (attackers can brute force an offline database using a GPU farm), how long does it take to crack a given password?

Giving feedback to the user using a percentage or an integer value is not intuitive because it depends on what level of compromisation is necessary to find the password.

The math to go from an entropy to an actual number of seconds to decode is simply a linear function.

Downsides of zxcvbn

zxcvbn does not test against reverse patterns, missing letters, foreign dictionaries, n-grams, ... It is difficult to test against all the existing patterns in a short time. It is a problem though, because for every pattern that the tool ignores, it compute an overestimated entropy, giving erroneous feedback to the user.

Another downside to adding these patterns is that it would result in a much bigger dictionary, which would not be downloaded in a reasonable time.

Password strength estimation by Dan Wheeler