






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An analysis of passwords, discussing the importance of strong passwords, common password patterns, and the use of a probabilistic approach to password analysis. The document also includes a large dataset of passwords and their frequencies, as well as an explanation of how to calculate the probabilistic likelihood of a password using n-grams.
Typology: Lecture notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!
A B S T R A C T
Password authentication is still the most used authentication mechanism in today’s computer systems. In most systems, the password is set by the user and must adhere to certain password requirements. Addition- ally, password checkers rank the strength of a password to give the user an indication of how secure their password is. In this paper, we take a look at a large database of user chosen passwords to determine the current state of affairs. In the end, we extract a mod- el from the database and provide our own password checker which ranks passwords in various ways. We ran this checker against our dataset which shows that over 90% of the passwords is highly insecure.
I N T R O D U C TI O N
As we perform increasingly important tasks from our living room computer, the topic of computer security also becomes in- creasingly important. And indeed, many ad- vances have been made in the field of com- puter security to protect home users from digital crime. For instance, the wireless transmission protocol, Wi-Fi, has had sever- al (much overdue) security overhauls to pro- tect home users from being eavesdropped or worse (1). However, when we look at what we typically use to authenticate ourselves with, we are stuck with a system that dates back to the Roman empire (2): passwords.
Although an ancient concept, password authentication is, and most likely will be for a long time, the most used authentication mechanism for computer users.
Password authentication requires no specialized hardware, such as with finger- print authentication, can be easily imple- mented by developers and just as easily used by users. In short: It’s usable. But is it safe? The topic of “Security ver- sus Usability” has always been of much de- bate in the computer security world (3). Us- ers must be protected from harm by security controls, but these controls may not interfere (much) with the tasks the users want to per- form. For instance, a firewall that simply blocks all traffic can be considered secure, but heavily impedes the overall usability of the system. Security experts or system developers are usually the ones who have to make this tra- deoff, but with password authentication, this task is essentially passed on to the user (3): One can either choose a short and simple password, which is easy to remember but also easy to crack, or a very long and com- plicated password, which is hard to remem- ber but also hard to crack. Unfortunately, most users do not see a tradeoff: They see an obstacle and they will choose the path of least resistance to over- come that obstacle. Thus, short and simple passwords that are easy to remember, but also easy to guess, are used (4). To stop users from using weak pass- words, most systems enforce certain re- quirements that a password must meet be- fore it gets set. Examples of common re- quirements are minimal length of the pass- word, the occurrence of uppercase letters, digits and/or symbols in the password and inequality with the user’s username or e- mail address (5).
However, holding on to the principle of the path of least resistance, one can expect users to try and ‘circumvent’ these require- ments in a predictable manner (4). For in- stance, given that an user starts out with an actual word such as ‘house’ and the re- quirement that the password must contain at least one digit is given, one might expect the user to simply suffix the word with a sin- gle digit. In the findings which we will present, you will see that 15% of the pass- words were a word or name suffixed with the number one.
In this paper, we will first discuss the da- taset which we used to perform our research. After this, we will show you the results of our preliminary tests on this dataset. In the next chapter, we will go one step further and ex- tract actual patterns from the passwords. After this, we will show you how we used a probabilistic approach to password analysis. In the final chapter, we will present you our password checker, which combines the re- sults of all the previous chapters.
D A T A S E T
On December 4 th^ 2009, a hacker breached a company database of RockYou! 1 containing the usernames and unencrypted passwords of about 32 million users (3). This database was subsequently published to the internet and is now in wide circulation. Ob- viously, we don’t condone hacking, but the presence of this database gives us an unique opportunity to perform a large scale empiri- cal study on passwords.
The most notable fact about the afore- mentioned affair would not be that a large database was hacked, but that the pass- words inside the database were stored in unencrypted form (so-called plaintext pass- words). These days, it is common practice to salt and hash passwords before permanently storing them, which makes it generally hard to study passwords even when granted access to the right databases.
The RockYou! database was acquired in the form of a long text file where each pass- word resides on its own line. The file con- tained no other information, such as the
(^1) RockYou! (originally known as RockMySpace), based in Redwood City, California is a publisher and developer of applications and other social network services. As of December, 2007 it is the most successful widget maker for the Facebook platform in terms of total installations.
usernames. Since the source of the data was questionable at best, we ran various tests and filters to ensure the quality of the data. Various noise factors were discovered:
P R E L I M I N A R Y T E S T S
Before we started with the actual analy- sis, various basic tests were ran to gain some insights into the database. These in- clude a letter frequency analysis, a character type analysis, a length distribution analysis and a common password analysis. Letter frequency analysis helps us in var- ious ways. Knowing the frequency of each letter gives us the ability to define a more fine-grained metric for measuring password strength. By grading the chance of occur-
This would suggest that a significant part of our dataset are words or names. Another troublesome result, are the passwords that consist solely out of digits, which represents 16% of the dataset. Numeric passwords have limited complexity.
Length distribution analysis gives us in- sights in what the common length of user chosen passwords are.
Figure 3 – Password length distribution
The results of the analysis, as shown in Figure 3, do not show a normal form distri- bution, but rather a truncated form. We ar- gue that this is a result of minimum pass- word length requirements. Strangely, the database contained entries as short as one character. At the moment of writing, the RockYou! website enforces a minimum 8 character password length, but this was most likely less (or non-existent) in the past (10).
In the previous chapter we already noted that we ignored any entry longer than 30 characters. The range of passwords of size 15 through 29 covered about 2 percent of the database. Half the passwords of length 20 and above were e-mail addresses, thus explaining their unusual length.
Finally, we take a look at the most com- mon passwords. We expect these passwords to be really weak, as a password that is widely used is most likely one you can logi- cally guess, find in a common password list- ing or is context dependent (such as the name of the service)
Password Count Percentage 123456 290731 0.8918% 12345 79078 0.2426% 123456789 76790 0.2356% password 59463 0.1824% iloveyou 49952 0.1532%
princess 33291 0.1021% 1234567 21727 0.0666% rockyou 20903 0.0641% 12345678 20553 0.0630% abc123 16648 0.8918% Table 1 – Top 10 passwords
Table 1 shows the top 10 passwords with their absolute counts. The use of numerical sequences immediately stands out as 5 of the 10 passwords represents this class. When we break the numbers down to percentages, two troublesome conclusions can be drawn:
Although these overlapping lists, streng- thens our believe that the RockYou! dataset is representative, it at the same time sad- dens us to see that so many users still use very predictable passwords.
0%
5%
10%
15%
20%
25%
30%
5 6 7 8 9 10 11 12
Percentage
P A T T E R N A N A L Y S I S
The previous chapter showed several ways of defining characteristics of a pass- word. Now, we want to go a step further and extract actual patterns from passwords. This will help us to better understand how pass- words are formed and ultimately allow us to construct an improved password checker.
M. Dell’Amico (6) studied a much smaller database of roughly 10,000 entries. Various regular expressions were ran against the da- tabase, in an attempt to recognize patterns in passwords. We repeat their experiment on the RockYou! database and present both their and our results below in Table 3
Expression Example IIMS RockYou [a-z]+ abcdef 51.20% 41.69% [A-Z]+ ABCDEF 0.29% 1.50% [A-Za-z]+ AbCdEf 53.74% 44.05% [0-9]+ 123456 9.10% 15.93% [a-zA-Z0-9]+ A1b2C3 93.43% 96.20% [a-z]+[0-9]+ abc123 14.51% 27.69% [a-zA-Z]+[0-9]+ aBc123 16.30% 30.16% [0-9]+[a-zA-Z]+ 123aBc 1.80% 2.75% [0-9]+[a-z]+ 123abc 1.65% 2.53%
Table 3 – Regular expressions The most notable difference is the use of digits. Most percentages in the IIMS data- base indicate a reliance on lowercase charac- ters, whilst in the RockYou! database pat- terns that contain digits are more prominent. The patterns that describe a word that is suffixed by a number are almost double as popular in the RockYou! database. Pass- words that are made purely out of digits are also more popular in the RockYou! database.
Based on these results we can argue that the users of RockYou! have been trained to create more secure passwords, most likely by other applications which enforce stricter re- quirements. One exception would be the passwords that consist purely out of digits. These passwords are used more often in the RockYou! dataset and we consider these kind of passwords to be very insecure, due to their limited complexity.
Besides the aforementioned expressions, we propose our own set of supplemental ex- pressions. We are most interested in pass- words that start with letters and end with digits, as they make up for one third of our dataset and regular expressions can help us better understand this set. Therefore, we repeat the regular expression ‘[a-zA-Z]+[0- 9]+’ and keep track of all the numbers that
are matched. We then look at the most popular numbers and revert these back to regular expressions. Number Count Percentage 1 1476941 16.37% 123 325963 3.61% 2 284354 3.15% 12 213870 2.37% 3 166762 1.85% 13 150069 1.66% 7 147951 1.64% 11 122630 1.36% 5 120376 1.33% 22 107444 1.19% 23 106425 1.18% 01 102756 1.14% 4 101573 1.13% 07 100693 1.12% 21 100370 1.11% 14 95288 1.06% 10 92655 1.03% 06 86495 0.96% 08 86065 0.95% 8 83819 0.93% 15 83708 0.93% 69 81299 0.90% 16 78506 0.87% 6 76798 0.85% 18 71343 0.79% Table 4 – Top 25 numeric suffixes
Table 4 shows the 25 most used numeric suffixes, which makes up for halve the passwords matched by the regular expres- sion ‘[a-zA-Z]+[0-9]+’. We expect single digits to be popular suffixes and indeed, the digit ‘1’ covers over a million passwords that use it as a suffix. However, double digit numbers occur twice as often as single di- gits, which we did not expect. Interestingly, the digit 9 is only present once in the listing (and not even as a single digit number).
Besides the top 25, we present a second list which contains entries longer than 2 characters from the top 100. Number Count Percentage 101 51065 0.57% 1234 49619 0.55% 2007 30731 0.34% 2006 29122 0.32% 666 24317 0.27% 2008 24300 0.27% 12345 20276 0.22% 2005 18694 0.21% 007 18261 0.20% 420 16470 0.18% 123456 15811 0.18% 1994 14288 0.16% 1993 13695 0.15%
Since we are calculating the product of a series of probabilities, the assumption is made that the collection of occurrences is mutually independent, which means that we assume that the following property of mutual independence holds:
ୀଵ
ୀଵ Although we know that this property does not actually hold, the assumption that it does is required for us to simplify the problem. Through this independence as- sumption, our model assumes the Markov property, which enables reasoning and com- putation with the model that would other- wise be intractable.
The length of the password determines the amount of iterations, which generally means that the longer the password, the more iterations there are and the smaller the overall outcome of the formula will be. This is taken in account, due to the fact that we’re explicitly including the probability of the length of the password, as denoted by ሻ|ߙ|ሺߣ, in our formula.
In the event that ݅ ݊൏ , which happens in the first ݊ െ 1 iterations of the product, the substring references a non-existing charac- ter index of zero or smaller. These references are substituted with the ‘»’ sign to denote the fact that we are dealing with a starting sequence. Thus, we get the convention:
ߙ| 0 ݅ሺ ݅ ൌ »ሻ The set of possible characters is deter- mined by the distinct characters which we encounter in the dataset. Thus, we do not include the full Unicode character set by de- fault. Given the previous description of ݒ, the corresponding formula looks like:
∑ (^) אҧܿሺߪ (^) ଵ ܿڮ (^) ܿିଵ ҧሻ The formula ߪ denotes the number of oc- currences of the input string inside the mod- el. The numerator part of the fraction represents the number of occurrences of the string that we are interested in (the n-gram), whilst the denominator represents the num- ber of occurrences of all possible permuta- tions of that string. In the denominator, ܿ ҧܥ א indicates a character from the set of all poss- ible characters as encountered in our data- set.
The actual value of ݊ has a huge impact on the behavior of the model. In the next few paragraphs, we will discuss various values of ݊ and their implications. Using a value of one for ݊ (unigrams), will make the algorithm contextless, as the algo- rithm will only look at the frequency of single letters. This is synonymous to using the cha- racter frequency analysis from the previous chapter to grade or predict passwords. If one were to choose a very large ݊ , then the algorithm would lose its ability to dissect passwords and simply grade a password by its entire string of characters. This is syn- onymous to calculating the frequency of a given password within the original dataset. A password generator with such a model would (almost) never generate any passwords out- side the dataset. On another note, a large ݊ would also impose problems of the computational kind. As the value of ݊ increases, the amount of possible n-grams increases exponentially. Given a dataset of our size and a ݊ of 5 our algorithm used up nearly a gigabyte of sys- tem memory. For a larger ݊ we had to resort to using a smaller subset of our dataset. Finally, a large ݊ also imposes another problem with the effectiveness of our algo- rithm. As we said earlier, one of the powerful features of using n-grams is that they can be used to analyze or produce new (unseen) data based on known data. However, this imposes some require- ments on the known data. Say we want to calculate the strength of the phrase “pass- word” with ݊ ൌ 4 , thus giving the formula ܲ ସ ሺ"password"ሻ. This would mean calculating the product of the probabilistic frequencies of all 4-grams contained within the phrase “password”. Consider the 4-gram “word”, which requires the calculation as shown at the top of this page. Now, consider a model which has never seen the 4-gram “word”. This would result in ሺ"word"ሻ ൌ 0 and thus ݒሺ"word"|"wor"ሻ ൌ 0. As we’re taking the product of ݒ in our algo- rithm this would ultimately lead to ܲ ସ ሺ"password"ሻ ൌ 0. One might argue that this is favorable, since we cannot grade something when we have no background data. However, given the previous example, we do have actual values for ݒ for all the 4-grams preceding “word”. Thus there is certainly some meas-
ure of occurrence present in the calculation, which is lost when we multiply this by zero. We overcame this limitation by altering the algorithm to raise all occurrences by one and counting any item that never occurs as one instead of zero. This has little impact on n- grams that occur many times and makes our algorithm more robust when dealing with unseen data. The downside being that n- grams that occur very few times will have a less accurate representation.
We used two third of our dataset as a training set for our algorithm and the re- maining one third as our test set.
N=1 N=2 N=3 N= aaaaa 123456 123456 123456 eeeee 123123 1234567 1234567 ale 12345 12345 12345678 aanaa 121234 12345678 123456789 aaaan 112345 234567 passwo aaaaaa 123234 milove 12345 aaeaaa 123412 112345 passwor aeaeae 123450 lover ilovey ann 123452 012345 password
Table 7 – Most frequent passwords
Table 7 shows the top 10 most likely passwords as graded by our algorithm for various sizes of n. As we noted earlier, an analysis based on unigrams (1-gram), is bas- ically a character frequency analysis. As vo- wels are the most frequently occurring cha- racters in the natural language, passwords that contain vowels are graded higher, than those that do not. When we look further down the list we see mostly short pronoun- ceable passwords.
When using bigrams (2-grams), numbers take the upper hand as they form the most frequent two character pair in the dataset. As we’ve seen from the previous chapter, there are simply a lot of passwords that con- sist out of or contain numerical sequences. Moreover, the chance that some digit follows another is larger than the chance that some letter follows the other, simply because there are only 10 digits as opposed to 26 letters.
When using trigrams (3-grams), text based passwords start to show up, but nu- merical sequences still prevail. This top 10 however shows similarities to the actual top 10 of most occurring passwords in the Rock- You! Dataset.
Finally, when using quadrigrams (4- grams), numerical sequences are overly present, but the larger list of results show a great deal of overlap with our own top 1000.
From our findings we concluded that us- ing trigrams was a good trade-off between accuracy and usability (in terms of perfor- mance).
P A S S W O R D C H E C K E R
The previous chapters illustrated various ways in which we can grade password strength. In this chapter, we combine these metrics into a program that can help users to assess their passwords safety. Each step employs one or more metrics, which leads to one or more results. Each result has a name, a short description in natural language what the result represents and a severity level to indicate the impact of the result. Some re- sults are pure informative and therefore have no severity impact. At the core, our checker does various things.
a. If any words were found during the pattern analysis step, they will be held against common dictionaries of passwords, words and names. We prefer to use small dictionaries of a specific type, since this gives us more power and certainty in describing what the word represents.
Romanosky, Sasha and Cranor, Lorrie F. s.l. : ACM, 2006. pp. 67--78.