It will be helpful to be able to convert easily between strings of characters and lists of their corresponding ASCII codes.
Exercise: Write functions str_to_ascii
and ascii_to_str
that will convert between strings and lists of ASCII codes. We can use the ord()
and chr()
functions to convert any particular character or ASCII code.
def str_to_ascii(s):
# ascii_list = []
# for c in s:
# ascii_list.append(ord(c))
# return ascii_list
return [ord(c) for c in s]
str_to_ascii('This is MTH 337')
def ascii_to_str(ascii_list):
# c_list = []
# for n in ascii_list:
# c_list.append(chr(n))
# s = ''.join(c_list)
# return s
return ''.join([chr(n) for n in ascii_list])
ascii_to_str([104, 101, 108, 108, 111])
Consider the example described in the project page. Suppose we want encrypt the message "Top secret!" using the secret key "buffalo".
message = 'Top secret!'
key = 'buffalo'
Let's convert both to lists of ASCII codes:
message_ascii = str_to_ascii(message)
key_ascii = str_to_ascii(key)
print(message_ascii)
print(key_ascii)
Problem: Our message has more characters than our key, so we need to duplicate our key enough times to match the length of the message.
print(len(message_ascii))
print(len(key_ascii))
One idea: Count how many times we need to duplicate the key_ascii
list to match or exceed the length of message_ascii
:
num_repeats = len(message_ascii) // len(key_ascii) + 1
padded_key_ascii = key_ascii * num_repeats
print(padded_key_ascii)
Another idea: use a while
loop:
padded_key_ascii = []
while len(padded_key_ascii) < len(message_ascii):
for n in key_ascii:
padded_key_ascii.append(n)
print(padded_key_ascii)
len(padded_key_ascii)
We've padded out our key to be sufficiently long, now let's go number by number to encrypt:
encrypted_ascii = []
for padded_key_n, message_n in zip(padded_key_ascii, message_ascii):
encrypted_ascii.append((padded_key_n + message_n) % 128)
encrypted_ascii
Exercise: Write a function encrypt(message_ascii, key_ascii)
that return the encrypted version of message_ascii
using the secret key key_ascii
(based on the code above).
Exercise: Write a companion function decrypt(encrypted_ascii, key_ascii)
that returns the decrypted message.
decrypted_ascii = []
for padded_key_n, encrypted_n in zip(padded_key_ascii, encrypted_ascii):
decrypted_ascii.append((encrypted_n - padded_key_n) % 128)
decrypted_ascii
ascii_to_str(decrypted_ascii)
I've downloaded the dictionary.txt
file and the msmith37.txt
file and placed them into my weekly notebook directory.
The open
function can be used to open a file for reading or writing. We will also use the with
construct to have Python manage the closing of our file when we're finished.
with open('dictionary.txt') as f: # This opens dictionary.txt and names the contents `f`
s = f.read() # This reads the contents of the file into a string
print(s[:100])
A more appropriate structure would be to split the string on space/new lines to get a list of words in the dictionary:
dictionary = s.split()
print(dictionary[:10])
Similarly, we can read in an encrypted message:
with open('msmith37.txt') as f:
s = f.read()
print(s[:100])
Again, we can split the string on spaces to produce a list of "integers".
s.split()[:10]
We need to convert each of these "integer" strings into actual integers:
encrypted_ascii = []
for str_n in s.split():
encrypted_ascii.append(int(str_n))
encrypted_ascii[:10]
Let's just try to blindly decrypt this message:
key = 'bad luck'
key_ascii = str_to_ascii(key)
padded_key_ascii = []
while len(padded_key_ascii) < len(encrypted_ascii):
for n in key_ascii:
padded_key_ascii.append(n)
decrypted_ascii = []
for padded_key_n, encrypted_n in zip(padded_key_ascii, encrypted_ascii):
decrypted_ascii.append((encrypted_n - padded_key_n) % 128)
decrypted_message = ascii_to_str(decrypted_ascii)
print(decrypted_message)
#decrypted_message
decrypted_message[22]
dictionary[100:120]
Note: We can use the in
operator to check whether something is an element of a list (or other iterable):
'aardvark' in dictionary
'Aardvark' in dictionary
Today, let's look at working with text data. We can get some sample text from the Project Gutenberg website.
I've downloaded the text from the book "Frankenstein" into the file frankenstein.txt
and placed it into my weekly notebook folder.
Note: This particular file has some special characters that can't be decoded with the default decoder when opening, so I've included the keyword argument encoding='utf-8'
to use a different decoder.
with open('frankenstein.txt',encoding='utf-8') as f:
text = f.read()
print(text[2000:3000])
Exercise: What is the most commonly used word in Frankenstein?
Idea:
1
.1
.# Initialize an empty dictionary
# The keys will be words
# The values will be how many times that word appears in Frankenstein
word_count_dict = {}
# Split the text on white space to get a list of words
frankenstein_words = text.split()
# Iterate through each word in Frankenstein
for word in frankenstein_words:
# Check if word is a key in word_count_dict
if word in word_count_dict:
# If so, increment the count by 1
word_count_dict[word] += 1
else:
# If not, add it to the dictionary with a value of 1
word_count_dict[word] = 1
#word_count_dict
We've generated a dictionary relating words to their counts, but which word has the highest count?
max(word_count_dict)
By default, taking the maximum of a dictionary returns the maximum of the keys of that dictionary. In this case, the keys are strings, so it returns the last in alphabetical order.
max(['a','b','c'])
We'd rather take the maximum of the values. We can use word_count_dict.values()
to get a "list" of the values.
max(word_count_dict.values())
This gives us the maximum count, but we want the word that has this maximum count.
Do manage this, we can supply our own key
function to the maximum function.
help(max)
Consider the following toy example:
my_list = ['One', 'Two', 'Three', 'Four']
max(my_list)
Let's make a custom key
function to pass to max
:
def my_key(s):
if s == 'One':
return 1
if s == 'Two':
return 2
if s == 'Three':
return 3
if s == 'Four':
return 4
max(my_list, key=my_key)
We can do the same thing with min
:
min(my_list)
min(my_list, key=my_key)
For the Frankenstein problem, we want to find the "maximum" word, where the size of the word is given by the count of that word.
def word_count_key(word):
# Look up the value associated to `word`
return word_count_dict[word]
most_common_word = max(word_count_dict, key=word_count_key)
print(most_common_word)
print(word_count_dict[most_common_word])
Sticking with this Frankenstein text a little, suppose we want to find the 10 most commonly used words. In this case, we would like to sort our dictionary and take just the 10 words (and their counts).
We can use the sorted
function to sort an iterable:
sorted(word_count_dict)[:20]
By default, applying sorted
to a dictionary will sort the keys. In this case, the keys are strings, so they are sorted alphabetically. For our purposes, we want to sort the keys based on their values (i.e. the word counts). We can again use the key
keyword argument in sorted
to change the sorting behavior:
sorted(word_count_dict, key=word_count_key)[-10:]
We can also use the keyword argument reverse=True
to sort in the reverse direction:
sorted_words = sorted(word_count_dict, key=word_count_key, reverse=True)[:10]
sorted_counts = [word_count_dict[word] for word in sorted_words]
for word, count in zip(sorted_words, sorted_counts):
print(word, count)
Note: many of the "words" in our dictionary contain some extra characters. For example, "(for", "for", "For", are all words in our dictionary with differents associated to them.
It might be helpful to perform the following steps before calculating word counts:
(, ', ", #, @, %
, etc.)We can convert any string to all lowercase using the .lower()
method:
s = 'ThiS Is a StRiNg with UppEr and LoweR cASe letTeRs'
print(s)
print(s.lower())
We can use the .replace
method to replace punctuation with nothing.
s = "This is a string, it contains some punctuation. Here's some more: !?.,"
print(s)
#punctuations = ['.', ',', '!', '?', ':', ';']
punctuations = '.,!?:;()#@%'
for punctuation in punctuations:
s = s.replace(punctuation,'')
print(s.lower())
We have a list, frankenstein_words
of all of the words that appear in Frankenstein.
frankenstein_words[:10]
How many of these words appear in our dictionary.txt
file?
with open('dictionary.txt') as f:
s = f.read()
words = s.split()
count = 0
for word in words:
for frankenstein_word in frankenstein_words:
if word == frankenstein_word:
count += 1
len(words), len(frankenstein_words)
len(words) * len(frankenstein_words)
Let's time things:
from time import time
t0 = time()
count = 0
for frankenstein_word in frankenstein_words:
if frankenstein_word in words:
count += 1
t1 = time()
print(t1 - t0)
Note: There are many kinds of iterable structures in Python. For example, we've dealt with lists, arrays, tuples, dictionaries, generators. Another type of iterable structure is set
s. These are just unordered collections of elements.
What happens if we convert our list of words into a set of words?
words_set = set(words)
t0 = time()
count = 0
for frankenstein_word in frankenstein_words:
if frankenstein_word in words_set:
count += 1
t1 = time()
print(len(frankenstein_words))
print(count)
print(t1 - t0)
This was approximately ~200 times faster to perform the same check. It seems like working with sets may be faster. Can we do more to work with sets exclusively?
Note: sets don't include duplicate elements. If we convert frankenstein_words
into a set, it will only one entry for each word that appers.
We can then ask a related question: Of the set of words that appear on Frankentstein, how many are in our dictionary?
frankenstein_words_set = set(frankenstein_words)
t0 = time()
count = 0
for frankenstein_word in frankenstein_words_set:
if frankenstein_word in words_set:
count += 1
t1 = time()
print(len(frankenstein_words_set))
print(count)
print(t1 - t0)