Monday, April 20th, 2026¶

Last week, we looked at processing text data obtained from Project Gutenberg, a repository of public domain eBooks.

Project 5 - Code breakers¶

Our next project deals with trying to break an encryption to discover the meaning of a secret message. Let's look at how the encryption process works.

Background: ASCII codes¶

Each character on a computer keyboard is assigned an ASCII code, which is an integer in the range 0-127. The ASCII code of a character can be obtained using the ord() function:

In [1]:
for c in "This is MTH 337":
    print("'{}'  ->  {}".format(c, ord(c)))
'T'  ->  84
'h'  ->  104
'i'  ->  105
's'  ->  115
' '  ->  32
'i'  ->  105
's'  ->  115
' '  ->  32
'M'  ->  77
'T'  ->  84
'H'  ->  72
' '  ->  32
'3'  ->  51
'3'  ->  51
'7'  ->  55

Conversely, the function chr() converts ASCII codes into characters:

In [3]:
char_list = []
for n in [104, 101, 108, 108, 111]:
    char_list.append(chr(n))
    
txt = ''.join(char_list)
print(txt)
hello
In [15]:
char_list = []
for n in [98, 117, 102, 102, 97, 108, 111]:
    char_list.append(chr(n))
    
txt = ''.join(char_list)
print(txt)
txt
buffalo
Out[15]:
'buffalo'

It will be helpful to be able to convert easily between strings of characters and lists of their corresponding ASCII codes.

Exercise: Write functions str_to_ascii and ascii_to_str that will convert between strings and lists of ASCII codes. We can use the ord() and chr() functions to convert any particular character or ASCII code.

In [16]:
def str_to_ascii(s):
    ascii_codes = []
    for c in s:
        ascii_codes.append(ord(c))
    return ascii_codes
In [17]:
def str_to_ascii(s):
    return [ord(c) for c in s]
In [21]:
def ascii_to_str(ascii_codes):
    char_list = []
    for n in ascii_codes:
        char_list.append(chr(n))
    return ''.join(char_list)
In [23]:
def ascii_to_str(ascii_codes):
    return ''.join([chr(n) for n in ascii_codes])
In [24]:
s = 'This is MTH 337!'

ascii_codes = str_to_ascii(s)
new_s = ascii_to_str(ascii_codes)

print(s)
print(new_s)
This is MTH 337!
This is MTH 337!

Text encryption¶

In order to securely send a confidential message one usually needs to encrypt it in some way to conceal its content. Here we consider the following encryption scheme:

  • One selects a secret key, which is sequence of characters. This key is used to both encrypt and decrypt the message.
  • Characters of the secret key and characters of the message are converted into ASCII codes. In this way the key is transformed into a sequence of integers $(k_1, k_2, \dots, k_r)$, and the message becomes another sequence of integers $(m_1, m_2, \dots, m_s)$. If $r<s$, then the secret key sequence is extended by repeating it as many times as necessary until it matches the length of the message.
  • Let $c_i$ be the reminder from the division of $m_i + k_i$ by $128$. The sequence of numbers $(c_1, c_2, \dots, c_s)$ is the encrypted message.

For example, if the message is 'Top secret!' and the secret key is 'buffalo' then the encrypted message is: [54, 100, 86, 6, 84, 81, 82, 84, 90, 90, 7]. Let's develop some code that will allow us to perform this encryption ourselves.

In [25]:
message = 'Top secret!'
key = 'buffalo'

First, let's convert both to lists of ASCII codes using the str_to_ascii function.

In [40]:
message_ascii = str_to_ascii(message)
key_ascii = str_to_ascii(key)
In [41]:
print(message_ascii)
[84, 111, 112, 32, 115, 101, 99, 114, 101, 116, 33]
In [42]:
print(key_ascii)
[98, 117, 102, 102, 97, 108, 111]

Problem: Our message has more characters than our key, so we need to duplicate our key enough times to match the length of the message.

One idea: use a while loop to keep duplicating key_ascii until it matches or exceeds the length of message_ascii.

In [43]:
padded_key_ascii = key_ascii.copy()

while len(padded_key_ascii) < len(message_ascii):
    padded_key_ascii += key_ascii

padded_key_ascii = padded_key_ascii[:len(message_ascii)]
In [44]:
print(message_ascii)
print(padded_key_ascii)
[84, 111, 112, 32, 115, 101, 99, 114, 101, 116, 33]
[98, 117, 102, 102, 97, 108, 111, 98, 117, 102, 102]

Another idea: Use integer division to count how many times we need to duplicate the key_ascii list to match or exceed the length of message_ascii.

In [45]:
print(len(message))
print(len(key))
11
7
In [46]:
repetitions = len(message) // len(key) + 1
padded_key_ascii = (repetitions * key_ascii)[:len(message)]

print(message_ascii)
print(padded_key_ascii)
[84, 111, 112, 32, 115, 101, 99, 114, 101, 116, 33]
[98, 117, 102, 102, 97, 108, 111, 98, 117, 102, 102]

Another idea: use modular arithmetic on the index of key_ascii, dividing by the length of key_ascii to keep looping through key_ascii until we enough entries.

In [49]:
padded_key_ascii = []

for i in range(len(message)):
    padded_key_ascii.append(key_ascii[i % len(key_ascii)])
In [52]:
padded_key_ascii = [key_ascii[i % len(key_ascii)] for i in range(len(message))]
In [53]:
print(message_ascii)
print(padded_key_ascii)
[84, 111, 112, 32, 115, 101, 99, 114, 101, 116, 33]
[98, 117, 102, 102, 97, 108, 111, 98, 117, 102, 102]

Exercise: Write a function get_padded_key_ascii that takes in arguments key_ascii and length and returns a padded version of key_ascii of length length, obtained by repeating key_ascii as many times as necessary.

In [ ]:
 
In [ ]:
 

Exercise: Write a function encrypt(message_ascii, key_ascii) that return the encrypted version of message_ascii using the secret key key_ascii (based on the code above).

In [ ]:
def encrypt(message_ascii, key_ascii):
    padded_key_ascii = get_padded_key_ascii(key_ascii, len(message_ascii))
    for m,n in zip(message_ascii, padded_key_ascii):
        
In [ ]:
 

Text decryption¶

In order to decrypt the message we work backwards: for each number $c_i$, we compute the reminder from the division of $c_i - k_i$ by $128$. This number is equal to $m_i$, so converting it into a character we get the $i$-th letter of the message.

Exercise: Write a function decrypt(encrypted_ascii, key_ascii) that returns the decrypted message.

In [ ]:
 
In [ ]:
 
In [ ]:
 

Exercise: Test your functions by defining your own message and key strings (with message longer than key), and:

  1. Converting both strings into sequences of ASCII codes (using str_to_ascii function),
  2. Generating the encrypted message as a sequence of ASCII codes (using the encrypt function),
  3. Decrypting the encrypted message (using the decrypt function),
  4. Converting the decrypted message back into a string (using the ascii_to_str function).

The final string should match the original message string.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Loading in your assigned encrypted message¶

Let's read in a file containing an encrypted message to play around with.

In [54]:
with open('adwoakis.txt', 'r') as f:
    text = f.read()

Looking into the text variable, we see that we have a string representing a sequence of integers, each separated by a space.

In [56]:
print(text[:100])
54 79 20 68 8 69 94 106 91 70 102 3 78 81 97 101 82 69 20 69 97 2 99 111 92 1 92 82 93 85 84 107 25 

We need to convert this string into a list of integers which we can then decrypt. We can use .split to separate this into a list of strings, each of which represents one integer.

In [59]:
encrypted_ascii_strs = text.split()
print(encrypted_ascii_strs[:10])
['54', '79', '20', '68', '8', '69', '94', '106', '91', '70']

In order to decrypt this message, we need to convert this list of strings (each of which represents an integer) into a list of integers.

In [60]:
encrypted_ascii = []
for encrypted_ascii_str in encrypted_ascii_strs:
    encrypted_ascii.append(int(encrypted_ascii_str))
In [61]:
print(encrypted_ascii[:10])
[54, 79, 20, 68, 8, 69, 94, 106, 91, 70]
In [63]:
ascii_to_str(encrypted_ascii)
Out[63]:
"6O\x14D\x08E^j[Ff\x03NQaeRE\x14Ea\x02co\\\x01\\R]UTk\x19\x01cI\x08YWaPI\x14RVG\x0fYQWUQKGS\x18ZPfH\x08VWY[\x01hKMl^lUFf\x0f\x08UW]\rTYD\\GS\x18UFfVMNU\x18QPkQ\x08C]\\\rDcZMTT\\\rUcJMVW]_\x0f\x14+MT\x0fdVUhOM\x02U]RU~VPG\x0f`NE\x14GZCff\rD`R[G\x0fm]\x01hR\x08Jo]_\r\x14E]V\x0fkUF\x14JZGf\x18PP`GMT\x0fY[E\x14FWNS]_\r\x14DVF\x0fl\\\x01[RrJ^eR\x01gKM\x02SaQ\x01bR\\\x02e][UiUM\x0e\x0f^\\S\x14VPG\x0f`NE\x14QWV\x0fk\\MX\x03IPh\x18ZBhFPGb\x18NOX\x03KQddQkbR\\\x02QjVO[\x03I\x02UY_U\\LVI\x0fgS\x01aRVGh2\rGfRU\x02W]_\x01ZD\\JTj\rT\\cM\x02fgbMX\x03KGalNJbOa\x02V]akVOWYb$\rBbG\x03\x02Pl\rIcPM\x02Xl\rXUV\x08E^dQ\x01hRW\x0e\x0f^\\S\x14DJQe]\rIYU\x08UW]\rIUG\x08Q]df\x01hKM\x02ag\\G m\\JagbH\\\x03_JX[U\x01hKM\x02fa[E\x14ZPKblYFX\x0f\x08Ge][\x01hKWWV`\rU\\H\x08NPjTFgW\x08EaYPLg\x03_Ga]wThRXRT\\\rVd\x03_Kc`\rThUIY\x0fY[E\x14UIIb&wk<HZ\x02[aaU`H\x08JPfQT\x14ZMTT\x18NMaR[V\x0ffbNVHL\x02faaI\x14FWNS&\r0\\\x04\x08C\x0feNUWK\x08OX_UU\x14DNH^jQk\\HZ\x02P\x18dPfOL\x02^^\rDcPNQal\x19\x01]I\x08UW]\rPbOa\x02SY_FX\x03\\CZ]\rB\x14VQPVdR\x01cQM\x02^ma\x01cI\x08VW]wCiQLNT$\rEfD_\x02Xl\rB[DQPbl\rU\\H\x08YPdY\r\x14DVF\x0foNSa\x03PGa\x18SJbJMTb\x18OZ\x14L\\\x10\x0fKUF\x14GZGf\x02\\OY\x03WWc&\r\x08FL[EWl\x0e\x08\x14KWY\x0faa\x01VOI\\T\\\x19\x01\\R_\x02Xl\rCiUVV\x10\x186U\x14ZIU\x0fY\rXUUU\x0e\x0fZ_J[K\\lUdNNY\x0f\x08NXcR\x01U\x03KC]\\mMY\x0f\x08Cb\x18`a\\H\x08JTdQ\x01\\HZ\x02WY[Eg\x03WXTj\rJh\x1d\x08Kc\x18dBg\x03I\x02fg[EYUNW[\x02YJ[K\\\x10\x0fAa\x01gHMOT\\\rSYDTNh\x18aP\x14WPG\x0fdVUhOM\x02\\YVEYQ\x08Cb\x18aIcXOJ\x0fkUF\x14ZMTT\x18`JhWQPV\x02OFZRZG\x0fY\rMUUOG\x0fa_Pb\x03[V^nR\r\x14ZQVW\x18OVfQQUW]Q\x01VUIUb\x18SFYW\x08C]\\\rB\x14EZCbkwPfQIOTfa\x01UW\x08V^h\x1b\x01HKM\x02Ua_F\x14E]T]]Q\x01kL\\J\x0fkbD\\\x03JNTk`FX\x03QPUdbFbFM\x1d\x0faa\x01kDZOT\\wTc\x03LG[aTIhI]N[q\x1b\x01HKM\x02[aaU`H\x08IXjY\x01\\DL\x02Pd_FUGa\x02bl_FhFPGS\x18\\Vh\x03PGa\x18SFYW\x08V^\x02dBfP\x08VW]Z\x01hRW\x1d\x0fZbU!\x10\\JT\x18`NUOT\x02UdNNY\x03_G]l\rPiW\x14\x02c`R\x01gWWXT\x18cBbL[JT\\'\x01gKMlWYQ\x01cQT[\x0flUF\x14UMOPa[T\x14RN\x02c`R\x01VXZPc%\\Vh\x03UCc[U\x01]Q\x08JTj\rIUQL\x10"
In [62]:
print(ascii_to_str(encrypted_ascii))
IUQLc`RgWWXTcBbL[JT\'gKMlWYQcQT[lUFUMOPa[TRNc`RVXZPc%\VhUCc[U]JTjcLG[aTIhI]N[qHKM[aaU`IXjY\DLPd_FUGabl_FhFPGS\VhPGaSFYV^dBfVW]ZhRWZbU!\JT`NUOTUdNNY_G]l
In [70]:
ascii_to_str(encrypted_ascii).split()
Out[70]:
['6O\x14D\x08E^j[Ff\x03NQaeRE\x14Ea\x02co\\\x01\\R]UTk\x19\x01cI\x08YWaPI\x14RVG\x0fYQWUQKGS\x18ZPfH\x08VWY[\x01hKMl^lUFf\x0f\x08UW]',
 'TYD\\GS\x18UFfVMNU\x18QPkQ\x08C]\\',
 'DcZMTT\\',
 'UcJMVW]_\x0f\x14+MT\x0fdVUhOM\x02U]RU~VPG\x0f`NE\x14GZCff',
 'D`R[G\x0fm]\x01hR\x08Jo]_',
 '\x14E]V\x0fkUF\x14JZGf\x18PP`GMT\x0fY[E\x14FWNS]_',
 '\x14DVF\x0fl\\\x01[RrJ^eR\x01gKM\x02SaQ\x01bR\\\x02e][UiUM\x0e\x0f^\\S\x14VPG\x0f`NE\x14QWV\x0fk\\MX\x03IPh\x18ZBhFPGb\x18NOX\x03KQddQkbR\\\x02QjVO[\x03I\x02UY_U\\LVI\x0fgS\x01aRVGh2',
 'GfRU\x02W]_\x01ZD\\JTj',
 'T\\cM\x02fgbMX\x03KGalNJbOa\x02V]akVOWYb$',
 'BbG\x03\x02Pl',
 'IcPM\x02Xl',
 'XUV\x08E^dQ\x01hRW\x0e\x0f^\\S\x14DJQe]',
 'IYU\x08UW]',
 'IUG\x08Q]df\x01hKM\x02ag\\G',
 'm\\JagbH\\\x03_JX[U\x01hKM\x02fa[E\x14ZPKblYFX\x0f\x08Ge][\x01hKWWV`',
 'U\\H\x08NPjTFgW\x08EaYPLg\x03_Ga]wThRXRT\\',
 'Vd\x03_Kc`',
 'ThUIY\x0fY[E\x14UIIb&wk<HZ\x02[aaU`H\x08JPfQT\x14ZMTT\x18NMaR[V\x0ffbNVHL\x02faaI\x14FWNS&',
 '0\\\x04\x08C\x0feNUWK\x08OX_UU\x14DNH^jQk\\HZ\x02P\x18dPfOL\x02^^',
 'DcPNQal\x19\x01]I\x08UW]',
 'PbOa\x02SY_FX\x03\\CZ]',
 'B\x14VQPVdR\x01cQM\x02^ma\x01cI\x08VW]wCiQLNT$',
 'EfD_\x02Xl',
 'B[DQPbl',
 'U\\H\x08YPdY',
 '\x14DVF\x0foNSa\x03PGa\x18SJbJMTb\x18OZ\x14L\\\x10\x0fKUF\x14GZGf\x02\\OY\x03WWc&',
 '\x08FL[EWl\x0e\x08\x14KWY\x0faa\x01VOI\\T\\\x19\x01\\R_\x02Xl',
 'CiUVV\x10\x186U\x14ZIU\x0fY',
 'XUUU\x0e\x0fZ_J[K\\lUdNNY\x0f\x08NXcR\x01U\x03KC]\\mMY\x0f\x08Cb\x18`a\\H\x08JTdQ\x01\\HZ\x02WY[Eg\x03WXTj',
 'Jh',
 '\x08Kc\x18dBg\x03I\x02fg[EYUNW[\x02YJ[K\\\x10\x0fAa\x01gHMOT\\',
 'SYDTNh\x18aP\x14WPG\x0fdVUhOM\x02\\YVEYQ\x08Cb\x18aIcXOJ\x0fkUF\x14ZMTT\x18`JhWQPV\x02OFZRZG\x0fY',
 'MUUOG\x0fa_Pb\x03[V^nR',
 '\x14ZQVW\x18OVfQQUW]Q\x01VUIUb\x18SFYW\x08C]\\',
 'B\x14EZCbkwPfQIOTfa\x01UW\x08V^h\x1b\x01HKM\x02Ua_F\x14E]T]]Q\x01kL\\J\x0fkbD\\\x03JNTk`FX\x03QPUdbFbFM',
 '\x0faa\x01kDZOT\\wTc\x03LG[aTIhI]N[q\x1b\x01HKM\x02[aaU`H\x08IXjY\x01\\DL\x02Pd_FUGa\x02bl_FhFPGS\x18\\Vh\x03PGa\x18SFYW\x08V^\x02dBfP\x08VW]Z\x01hRW',
 '\x0fZbU!\x10\\JT\x18`NUOT\x02UdNNY\x03_G]l',
 "PiW\x14\x02c`R\x01gWWXT\x18cBbL[JT\\'\x01gKMlWYQ\x01cQT[\x0flUF\x14UMOPa[T\x14RN\x02c`R\x01VXZPc%\\Vh\x03UCc[U\x01]Q\x08JTj",
 'IUQL\x10']

Once we have a working decrypt function, we can start trying to decrypt this message using different possible secret keys chosen from the dictionary.txt file (linked on the project page).

The goal of the project will be to use Python to judge how "good" a possible decrypted message is. We can then have Python try all ~$60,000$ possible secret keys to decrypt the message, then take the secret key that gives the "best" decrypted message.

Suggestion: Create a dictionary whose keys are the words used to decrypt the message and whose values represent how "good" the corresponding decrypted message is. We can then take the "maximum" of this dictionary (according to the values, see Week 12 notebook) to find the word that gave the "best" decryption.

In [64]:
with open('dictionary.txt', 'r') as f:
    text = f.read()
In [68]:
words = text.split()
In [69]:
print(words[:10])
['A', 'a', 'Aachen', 'Aalborg', 'aardvark', 'Aarhus', 'Aaron', 'AB', 'Ab', 'abaci']

Ideas for judging how "good" a decryption is¶

One idea is to look at the number of actual words that appear in a decrypted message. We can use the contents of dictionary.txt to help with this.

Reminder: We can check whether an object is part of a list using the in operator. For example, a in my_list will return True if my_list contains a and False otherwise.

In [71]:
my_list = ['Hello', 'Goodbye', 'Good morning']
In [72]:
print('Hello' in my_list)
print('bye' in my_list)
True
False

To help with efficiency, we can make use of another iterable data type in Python, namely sets. Sets are unordered collections of objects (like dictionary keys). We can similarly use the in operator to test whether a given element is in a set. How does the timing of this check compare to a similar check for lists? Consider the following experiment.

We first define a list of integers 0, 1, 2, ..., 999999, then create set version of the same data.

In [73]:
N = 10**6

my_list = [i for i in range(N)]
my_set = set(my_list)

Let's see how long it takes to check if 999999 is in the list versus in the set. Since this is a relatively quick operation, let's time how long it takes to make each check 100 times.

In [74]:
import time

n = N-1                        # Number to test for membership
num_checks = 100               # Number of times to test for membership

t0 = time.time()               # Store the starting time
for check in range(num_checks):
    n in my_list               # Check if n is a member of my_list
t1 = time.time()               # Store the stopping time
print('The list membership check took {} seconds.'.format(t1 - t0))

t0 = time.time()               # Store the starting time
for check in range(num_checks):
    n in my_set                # Check if n is a member of my_set
t1 = time.time()               # Store the stopping time
print('The set membership check took {} seconds.'.format(t1 - t0))
The list membership check took 2.3139748573303223 seconds.
The set membership check took 0.00015425682067871094 seconds.

This speed difference can be explained through the following analogy:

Suppose someone (e.g. someone making a Python list) has placed a sequence of $1,000,000$ books in a particular order. We are then asked whether Frankenstein is one of the books in this sequence. We would have no choice but to look through the entire sequence of books until we've either found Frankenstein or reached the end.

Suppose instead that a librarian (e.g. Python when a set is defined) places the books into appropriate shelves (perhaps by letter) to keep them organized. If we are asked whether Frankenstein is one for the books in this collection, we go to the shelf for books that start with "F" and check only that shelf.