Wednesday, November 5th, 2025¶
Last class, we were exploring the text of Frankenstein; Or, The Modern Prometheus by Mary Shelley.
with open('frankenstein.txt', 'r', encoding='utf-8-sig') as f: # Opens the file for reading
text = f.read() # Reads the content into a string
In particular, we had constructed a dictionary word_count_dict whose keys were words appearing in Frankenstein and whose values counted the number of times that word appeared in the text.
words = text.split() # Separate the text by white space to get a list of words.
word_count_dict = {} # Initialize an empty dictionary.
for word in words: # Iterate through each word in *Frankenstein*.
if word not in word_count_dict: # If it's a new word,
word_count_dict[word] = 1 # add it to our dictionary with an appearance count of 1.
else: # Otherwise,
word_count_dict[word] += 1 # increase the appearance count by 1.
We also discussed how to use a custom key function when sorting (or finding maximums/minimums) in order to find the most commonly appearing words.
def word_count_key(word):
return word_count_dict[word]
sorted_words = sorted(word_count_dict, key=word_count_key, reverse=True)
for word in sorted_words[:10]:
print(word, word_count_dict[word])
the 4263 and 2966 of 2902 I 2794 to 2234 my 1680 a 1447 in 1175 was 1030 that 1012
Pre-processing text for analysis¶
Last class, we noticed that some words appear as several distinct keys in our dictionary. For example, the strings for, For, and FOR each appear as separate keys in word_count_dict.
for word in ['for', 'For', 'FOR']:
print('The word "{}" appears {} times in Frankenstein.'.format(word, word_count_dict[word]))
The word "for" appears 500 times in Frankenstein. The word "For" appears 23 times in Frankenstein. The word "FOR" appears 3 times in Frankenstein.
We discussed how we can use the .lower method to convert a string to all uppercase letters to lowercase to address this issue.
words = text.split() # Separate the text by white space to get a list of words.
word_count_dict = {} # Initialize an empty dictionary.
for word in words: # Iterate through each word in *Frankenstein*.
lowercase_word = word.lower() # Convert all uppercase letters in word to lowercase.
if lowercase_word not in word_count_dict: # If it's a new word,
word_count_dict[lowercase_word] = 1 # add it to our dictionary with an appearance count of 1.
else: # Otherwise,
word_count_dict[lowercase_word] += 1 # increase the appearance count by 1.
word_count_dict['for']
526
For another issue with our code above is with the handling of puncutation. For example, the word strings 'study' and 'study,' appear as distinct keys in word_count_dict.
for word in ['study', 'study,']:
print('The word "{}" appears {} times in Frankenstein.'.format(word, word_count_dict[word]))
The word "study" appears 13 times in Frankenstein. The word "study," appears 5 times in Frankenstein.
We can use the .replace method to replace all instances of ',' with an empty string before calculating word counts. This is illustrated in the example below.
s = "This is a string, it contains some punctuation. Here's some more: !?.,"
print(s)
print(s.replace(',',''))
We can expect that this might happen with several other types of punctuation, such as '.', '!', '?', '(', ')', "'", '"', '#', '@', '%', etc.
Exercise: Use the .replace method to remove punctuation marks and other non-letter characters from the text of Frankenstein. Also user the .lower method to convert all uppercase letters to lowercase, and save this modified text to a new variable cleaned_text. Finally, re-compute word_count_dict using cleaned_text in place of text.
Project 5 - Code breakers¶
Our next project deals with trying to break an encryption to discover the meaning of a secret message. Let's look at how the encryption process works.
Background: ASCII codes¶
Each character on a computer keyboard is assigned an ASCII code, which is an integer in the range 0-127. The ASCII code of a character can be obtained using the ord() function:
for c in "This is MTH 337":
print("'{}' -> {}".format(c, ord(c)))
'T' -> 84 'h' -> 104 'i' -> 105 's' -> 115 ' ' -> 32 'i' -> 105 's' -> 115 ' ' -> 32 'M' -> 77 'T' -> 84 'H' -> 72 ' ' -> 32 '3' -> 51 '3' -> 51 '7' -> 55
Conversely, the function chr() converts ASCII codes into characters:
char_list = []
for n in [104, 101, 108, 108, 111]:
char_list.append(chr(n))
txt = ''.join(char_list)
print(txt)
hello
It will be helpful to be able to convert easily between strings of characters and lists of their corresponding ASCII codes.
Exercise: Write functions str_to_ascii and ascii_to_str that will convert between strings and lists of ASCII codes. We can use the ord() and chr() functions to convert any particular character or ASCII code.
Text encryption¶
In order to securely send a confidential message one usually needs to encrypt it in some way to conceal its content. Here we consider the following encryption scheme:
- One selects a secret key, which is sequence of characters. This key is used to both encrypt and decrypt the message.
- Characters of the secret key and characters of the message are converted into ASCII codes. In this way the key is transformed into a sequence of integers $(k_1, k_2, \dots, k_r)$, and the message becomes another sequence of integers $(m_1, m_2, \dots, m_s)$. If $r<s$, then the secret key sequence is extended by repeating it as many times as necessary until it matches the length of the message.
- Let $c_i$ be the reminder from the division of $m_i + k_i$ by $128$. The sequence of numbers $(c_1, c_2, \dots, c_s)$ is the encrypted message.
For example, if the message is 'Top secret!' and the secret key is 'buffalo' then the encrypted message is: [54, 100, 86, 6, 84, 81, 82, 84, 90, 90, 7]. Let's develop some code that will allow us to perform this encryption ourselves.
message = 'Top secret!'
key = 'buffalo'
First, let's convert both to lists of ASCII codes using the str_to_ascii function.
for c in message:
print("'{}' -> {}".format(c, ord(c)))
'T' -> 84 'o' -> 111 'p' -> 112 ' ' -> 32 's' -> 115 'e' -> 101 'c' -> 99 'r' -> 114 'e' -> 101 't' -> 116 '!' -> 33
for c in key:
print("'{}' -> {}".format(c, ord(c)))
'b' -> 98 'u' -> 117 'f' -> 102 'f' -> 102 'a' -> 97 'l' -> 108 'o' -> 111
(84 + 98) % 128
54
Problem: Our message has more characters than our key, so we need to duplicate our key enough times to match the length of the message.
One idea: use a while loop to keep duplicating key_ascii until it matches or exceeds the length of message_asii.
Another idea: Use integer division to count how many times we need to duplicate the key_ascii list to match or exceed the length of message_ascii.
Another idea: use modular arithmetic on the index of key_ascii, dividing by the length of key_ascii to keep looping through key_ascii until we enough entries.
Exercise: Write a function get_padded_key_ascii that takes in arguments key_ascii and length and returns a padded version of key_ascii of length length, obtained by repeating key_ascii as many times as necessary.
Exercise: Write a function encrypt(message_ascii, key_ascii) that return the encrypted version of message_ascii using the secret key key_ascii (based on the code above).
In order to decrypt the message we work backwards: for each number $c_i$, we compute the reminder from the division of $c_i - k_i$ by $128$. This number is equal to $m_i$, so converting it into a character we get the $i$-th letter of the message.
Exercise: Write a function decrypt(encrypted_ascii, key_ascii) that returns the decrypted message.
Image denoising thoughts¶
i = 0
j = 0
#grid = noisy_img[i-1:i+2, j-1:j+2]
grid = noisy_img[i:i+2, j:j+2]
def get_padded_img(img, pad)
nrows, ncols = img.shape
padded_img = np.ones((nrows + 2*pad, ncols + 2*pad))/2 # Create an array of 0.5 that will store our padded image
padded_img[pad:-pad, pad:-pad] = img
return padded_img