Monday, April 7th

Code breaker project

It will be helpful to be able to convert easily between strings of characters and lists of their corresponding ASCII codes.

Exercise: Write functions str_to_ascii and ascii_to_str that will convert between strings and lists of ASCII codes. We can use the ord() and chr() functions to convert any particular character or ASCII code.

In [1]:
def str_to_ascii(s):
#     ascii_list = []
#     for c in s:
#         ascii_list.append(ord(c))
#     return ascii_list

    return [ord(c) for c in s]
In [2]:
str_to_ascii('This is MTH 337')
Out[2]:
[84, 104, 105, 115, 32, 105, 115, 32, 77, 84, 72, 32, 51, 51, 55]
In [3]:
def ascii_to_str(ascii_list):
#     c_list = []
#     for n in ascii_list:
#         c_list.append(chr(n))
#     s = ''.join(c_list)
#     return s

    return ''.join([chr(n) for n in ascii_list])
In [11]:
ascii_to_str([104, 101, 108, 108, 111])
Out[11]:
'hello'

Consider the example described in the project page. Suppose we want encrypt the message "Top secret!" using the secret key "buffalo".

In [12]:
message = 'Top secret!'
key = 'buffalo'

Let's convert both to lists of ASCII codes:

In [14]:
message_ascii = str_to_ascii(message)
key_ascii = str_to_ascii(key)

print(message_ascii)
print(key_ascii)
[84, 111, 112, 32, 115, 101, 99, 114, 101, 116, 33]
[98, 117, 102, 102, 97, 108, 111]

Problem: Our message has more characters than our key, so we need to duplicate our key enough times to match the length of the message.

In [16]:
print(len(message_ascii))
print(len(key_ascii))
11
7

One idea: Count how many times we need to duplicate the key_ascii list to match or exceed the length of message_ascii:

In [23]:
num_repeats = len(message_ascii) // len(key_ascii) + 1

padded_key_ascii = key_ascii * num_repeats
In [24]:
print(padded_key_ascii)
[98, 117, 102, 102, 97, 108, 111, 98, 117, 102, 102, 97, 108, 111]

Another idea: use a while loop:

In [25]:
padded_key_ascii = []
while len(padded_key_ascii) < len(message_ascii):
    for n in key_ascii:
        padded_key_ascii.append(n)
In [26]:
print(padded_key_ascii)
[98, 117, 102, 102, 97, 108, 111, 98, 117, 102, 102, 97, 108, 111]
In [27]:
len(padded_key_ascii)
Out[27]:
14

We've padded out our key to be sufficiently long, now let's go number by number to encrypt:

In [28]:
encrypted_ascii = []

for padded_key_n, message_n in zip(padded_key_ascii, message_ascii):
    encrypted_ascii.append((padded_key_n + message_n) % 128)
In [29]:
encrypted_ascii
Out[29]:
[54, 100, 86, 6, 84, 81, 82, 84, 90, 90, 7]

Exercise: Write a function encrypt(message_ascii, key_ascii) that return the encrypted version of message_ascii using the secret key key_ascii (based on the code above).

Exercise: Write a companion function decrypt(encrypted_ascii, key_ascii) that returns the decrypted message.

In [32]:
decrypted_ascii = []

for padded_key_n, encrypted_n in zip(padded_key_ascii, encrypted_ascii):
    decrypted_ascii.append((encrypted_n - padded_key_n) % 128)
In [33]:
decrypted_ascii
Out[33]:
[84, 111, 112, 32, 115, 101, 99, 114, 101, 116, 33]
In [34]:
ascii_to_str(decrypted_ascii)
Out[34]:
'Top secret!'

Working with text files

I've downloaded the dictionary.txt file and the msmith37.txt file and placed them into my weekly notebook directory.

The open function can be used to open a file for reading or writing. We will also use the with construct to have Python manage the closing of our file when we're finished.

In [4]:
with open('dictionary.txt') as f:  # This opens dictionary.txt and names the contents `f`
    s = f.read()                   # This reads the contents of the file into a string
In [5]:
print(s[:100])
A
a
Aachen
Aalborg
aardvark
Aarhus
Aaron
AB
Ab
abaci
aback
abacus
Abadan
abaft
abalone
abandon
aband

A more appropriate structure would be to split the string on space/new lines to get a list of words in the dictionary:

In [6]:
dictionary = s.split()

print(dictionary[:10])
['A', 'a', 'Aachen', 'Aalborg', 'aardvark', 'Aarhus', 'Aaron', 'AB', 'Ab', 'abaci']

Similarly, we can read in an encrypted message:

In [7]:
with open('msmith37.txt') as f:
    s = f.read()
In [43]:
print(s[:100])
77 49 83 102 88 73 40 93 14 103 84 4 55 87 83 100 70 4 58 80 97 30 1 69 54 15 55 18 73 69 57 84 14 8

Again, we can split the string on spaces to produce a list of "integers".

In [44]:
s.split()[:10]
Out[44]:
['77', '49', '83', '102', '88', '73', '40', '93', '14', '103']

We need to convert each of these "integer" strings into actual integers:

In [73]:
encrypted_ascii = []

for str_n in s.split():
    encrypted_ascii.append(int(str_n))
In [74]:
encrypted_ascii[:10]
Out[74]:
[77, 49, 83, 102, 88, 73, 40, 93, 14, 103]

Let's just try to blindly decrypt this message:

In [89]:
key = 'bad luck'

key_ascii = str_to_ascii(key)

padded_key_ascii = []
while len(padded_key_ascii) < len(encrypted_ascii):
    for n in key_ascii:
        padded_key_ascii.append(n)
        
decrypted_ascii = []
for padded_key_n, encrypted_n in zip(padded_key_ascii, encrypted_ascii):
    decrypted_ascii.append((encrypted_n - padded_key_n) % 128)
    
decrypted_message = ascii_to_str(decrypted_ascii)

print(decrypted_message)
#decrypted_message
1trapa\pj`xR4"ldIw}rbPN},
b%Im+hmg<ckwWw|FjTS0,e)wiysx#BDueghO.y@SEgw=%E^+~`v;{fl`}xriWE${i=wln-#37g?\D}@iPNx,y^(w\}vtjGC"lsmovDZPD},r&Fr+ve#C>oflhF6rV]D$rpdKipmj0t`fbUsmFjaApx
;@ns'skDodll^T<*VaLsp%KnynvR|nbe`Fr*D^VHx,rc8wl`qH=i_`BwxEiTlq1j-Qtpu,p@Bv+AF.r3YySyz|b2w]sldnR{"^rFzvAlCsyi)Oczu+#@ouqpZJurFQAgw=%ElzbhS8e_lQsmF!yArp=;@ns'glRocoklr|Ae_Eh81q,<{hkpRoqcaB|nE^Uxro(J&+ydvD<dic]ox\^SOp:1Q,<opqhBCqo*To~;hUIip1q,<lubkNA"e_]uyAYHsxu)dD[olkHB"t_rkopFPNh,^8w^z~m#@<qkelU.E#7i,vu'?[yndg0"ccpyDYbpmf0P(+HewDAy^p]T~:ZaE$rpdJcwlmfDoqk[Po|6cHi,
Y
In [86]:
decrypted_message[22]
Out[86]:
'\x1e'
In [90]:
dictionary[100:120]
Out[90]:
['ablation',
 'ablative',
 'ablaut',
 'ablaze',
 'able',
 'ableism',
 'ableist',
 'abloom',
 'ablution',
 'ablutions',
 'ably',
 'ABM',
 'Abnaki',
 'abnegate',
 'abnegation',
 'abnegator',
 'Abner',
 'abnormal',
 'abnormality',
 'abnormally']

Note: We can use the in operator to check whether something is an element of a list (or other iterable):

In [53]:
'aardvark' in dictionary
Out[53]:
True
In [54]:
'Aardvark' in dictionary
Out[54]:
False

Wednesday, April 9th

Today, let's look at working with text data. We can get some sample text from the Project Gutenberg website.

I've downloaded the text from the book "Frankenstein" into the file frankenstein.txt and placed it into my weekly notebook folder.

Note: This particular file has some special characters that can't be decoded with the default decoder when opening, so I've included the keyword argument encoding='utf-8' to use a different decoder.

In [3]:
with open('frankenstein.txt',encoding='utf-8') as f:
    text = f.read()
In [8]:
print(text[2000:3000])
ions towards
which I am advancing, gives me a foretaste of those icy climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpetual splendour. There—for with your leave, my sister, I will put
some trust in preceding navigators—there snow and frost are banished;
and, sailing over a calm sea, we may be wafted to a land surpassing in
wonders and in beauty every region hitherto discovered on the habitable
globe. Its productions and features may be without example, as the
phenomena of the heavenly bodies undoubtedly are in those undiscovered
solitudes. What may not be expected in a country of eternal light? I
may there discover the wondrous power which attracts the needle and may
regulate a thousan

Exercise: What is the most commonly used word in Frankenstein?

Idea:

  • Construct a dictionary whose keys are words and whose values are how many times that word appears in Frankenstein.
  • To construct this dictionary:
    • Look through every word in Frankenstein.
    • If the word is not in our dictionary, add it to the dictionary with a value of 1.
    • If the word is in the dictionary, increment the value associated to that word by 1.
In [10]:
# Initialize an empty dictionary
# The keys will be words
# The values will be how many times that word appears in Frankenstein
word_count_dict = {}

# Split the text on white space to get a list of words
frankenstein_words = text.split()

# Iterate through each word in Frankenstein
for word in frankenstein_words:
    # Check if word is a key in word_count_dict
    if word in word_count_dict:
        # If so, increment the count by 1
        word_count_dict[word] += 1
    else:
        # If not, add it to the dictionary with a value of 1
        word_count_dict[word] = 1
In [12]:
#word_count_dict

We've generated a dictionary relating words to their counts, but which word has the highest count?

In [13]:
max(word_count_dict)
Out[13]:
'\ufeffThe'

By default, taking the maximum of a dictionary returns the maximum of the keys of that dictionary. In this case, the keys are strings, so it returns the last in alphabetical order.

In [14]:
max(['a','b','c'])
Out[14]:
'c'

We'd rather take the maximum of the values. We can use word_count_dict.values() to get a "list" of the values.

In [19]:
max(word_count_dict.values())
Out[19]:
4066

This gives us the maximum count, but we want the word that has this maximum count.

Do manage this, we can supply our own key function to the maximum function.

Finding maximums with a custom key

In [20]:
help(max)
Help on built-in function max in module builtins:

max(...)
    max(iterable, *[, default=obj, key=func]) -> value
    max(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its biggest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the largest argument.

Consider the following toy example:

In [21]:
my_list = ['One', 'Two', 'Three', 'Four']
In [22]:
max(my_list)
Out[22]:
'Two'

Let's make a custom key function to pass to max:

In [23]:
def my_key(s):
    if s == 'One':
        return 1
    if s == 'Two':
        return 2
    if s == 'Three':
        return 3
    if s == 'Four':
        return 4
In [24]:
max(my_list, key=my_key)
Out[24]:
'Four'

We can do the same thing with min:

In [25]:
min(my_list)
Out[25]:
'Four'
In [26]:
min(my_list, key=my_key)
Out[26]:
'One'

For the Frankenstein problem, we want to find the "maximum" word, where the size of the word is given by the count of that word.

In [27]:
def word_count_key(word):
    # Look up the value associated to `word`
    return word_count_dict[word]
In [29]:
most_common_word = max(word_count_dict, key=word_count_key)
print(most_common_word)
print(word_count_dict[most_common_word])
the
4066

Sticking with this Frankenstein text a little, suppose we want to find the 10 most commonly used words. In this case, we would like to sort our dictionary and take just the 10 words (and their counts).

We can use the sorted function to sort an iterable:

In [32]:
sorted(word_count_dict)[:20]
Out[32]:
['#84]',
 '$5,000)',
 '($1',
 '(801)',
 '(Godwin)',
 '(I',
 '(July',
 '(May',
 '(a)',
 '(although',
 '(and',
 '(any',
 '(as',
 '(b)',
 '(c)',
 '(does',
 '(foolish',
 '(for',
 '(if',
 '(inexperience']

By default, applying sorted to a dictionary will sort the keys. In this case, the keys are strings, so they are sorted alphabetically. For our purposes, we want to sort the keys based on their values (i.e. the word counts). We can again use the key keyword argument in sorted to change the sorting behavior:

In [35]:
sorted(word_count_dict, key=word_count_key)[-10:]
Out[35]:
['that', 'was', 'in', 'a', 'my', 'to', 'I', 'of', 'and', 'the']

We can also use the keyword argument reverse=True to sort in the reverse direction:

In [40]:
sorted_words = sorted(word_count_dict, key=word_count_key, reverse=True)[:10]
sorted_counts = [word_count_dict[word] for word in sorted_words]

for word, count in zip(sorted_words, sorted_counts):
    print(word, count)
the 4066
and 2968
of 2745
I 2719
to 2144
my 1631
a 1394
in 1129
was 994
that 986

Note: many of the "words" in our dictionary contain some extra characters. For example, "(for", "for", "For", are all words in our dictionary with differents associated to them.

Pre-processing our text

It might be helpful to perform the following steps before calculating word counts:

  • Convert all letters to lowercase
  • Remove all punctuation
  • Remove some other special characters (e.g. (, ', ", #, @, %, etc.)

We can convert any string to all lowercase using the .lower() method:

In [43]:
s = 'ThiS Is a StRiNg with UppEr and LoweR cASe letTeRs'
print(s)
print(s.lower())
ThiS Is a StRiNg with UppEr and LoweR cASe letTeRs
this is a string with upper and lower case letters

We can use the .replace method to replace punctuation with nothing.

In [54]:
s = "This is a string, it contains some punctuation. Here's some more: !?.,"
print(s)

#punctuations = ['.', ',', '!', '?', ':', ';']
punctuations = '.,!?:;()#@%'

for punctuation in punctuations:
    s = s.replace(punctuation,'')

print(s.lower())
This is a string, it contains some punctuation. Here's some more: !?.,
this is a string it contains some punctuation here's some more 

Connecting with the code breakers project

We have a list, frankenstein_words of all of the words that appear in Frankenstein.

In [59]:
frankenstein_words[:10]
Out[59]:
['\ufeffThe',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Frankenstein;',
 'Or,',
 'The',
 'Modern',
 'Prometheus']

How many of these words appear in our dictionary.txt file?

In [60]:
with open('dictionary.txt') as f:
    s = f.read()
words = s.split()
In [61]:
count = 0

for word in words:
    for frankenstein_word in frankenstein_words:
        if word == frankenstein_word:
            count += 1
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[61], line 5
      3 for word in words:
      4     for frankenstein_word in frankenstein_words:
----> 5         if word == frankenstein_word:
      6             count += 1

KeyboardInterrupt: 
In [63]:
len(words), len(frankenstein_words)
Out[63]:
(61406, 78101)
In [64]:
len(words) * len(frankenstein_words)
Out[64]:
4795870006

Let's time things:

In [66]:
from time import time
In [67]:
t0 = time()

count = 0

for frankenstein_word in frankenstein_words:
    if frankenstein_word in words:
        count += 1
        
t1 = time()

print(t1 - t0)
27.272066593170166

Note: There are many kinds of iterable structures in Python. For example, we've dealt with lists, arrays, tuples, dictionaries, generators. Another type of iterable structure is sets. These are just unordered collections of elements.

What happens if we convert our list of words into a set of words?

In [68]:
words_set = set(words)
In [75]:
t0 = time()

count = 0

for frankenstein_word in frankenstein_words:
    if frankenstein_word in words_set:
        count += 1
        
t1 = time()

print(len(frankenstein_words))
print(count)

print(t1 - t0)
78101
60082
0.024001359939575195

This was approximately ~200 times faster to perform the same check. It seems like working with sets may be faster. Can we do more to work with sets exclusively?

Note: sets don't include duplicate elements. If we convert frankenstein_words into a set, it will only one entry for each word that appers.

We can then ask a related question: Of the set of words that appear on Frankentstein, how many are in our dictionary?

In [70]:
frankenstein_words_set = set(frankenstein_words)
In [76]:
t0 = time()

count = 0

for frankenstein_word in frankenstein_words_set:
    if frankenstein_word in words_set:
        count += 1
        
t1 = time()

print(len(frankenstein_words_set))
print(count)
print(t1 - t0)
12175
4620
0.0040051937103271484