Monday, November 3rd, 2025¶
We've seen several functions that can be used for processing text (str) data. For example, things like the .split, .replace, and .join methods can be used to manipulate strings. Today, we will look more into working with text data in Python.
Project Gutenberg is a repository containing over 75,000 free eBooks that are in the public domain. The code cell below will download the text from Frankenstein; Or, The Modern Prometheus by Mary Shelley and save it as a text file frankenstein.txt.
Note: Feel free to browse through the Project Gutenberg library and select any other eBook of your choice. Just make sure to download a plain-text version of the eBook (not PDF, EPUB, HTML, or any other formats).
import requests
data = requests.get('https://www.gutenberg.org/cache/epub/42324/pg42324.txt')
with open('frankenstein.txt','wb') as f:
f.write(data.content)
Working with text files in Python¶
To get started, we will need to be able to load the downloaded text file into Python. The open function can be used in Python to open a file for reading or writing. For example, open('frankenstein.txt', 'r') will open the frankenstein.txt file for reading (signified by the argument 'r'). Once a file is opened, we can use the .read() method to read the contents of the file into a string.
This particular text file uses UTF-8 encoding, which is not the default encoding that the open function expects for text data. We can include an optional argument encoding='utf-8-sig' when opening the file to account for this non-default encoding.
f = open('frankenstein.txt', 'r', encoding='utf-8-sig') # Opens the file for reading
text = f.read() # Reads the contents into a string
print(text[2000:3000])
rue that I am very averse to bringing myself forward in print; but as my account will only appear as an appendage to a former production, and as it will be confined to such topics as have connection with my authorship alone, I can scarcely accuse myself of a personal intrusion. It is not singular that, as the daughter of two persons of distinguished literary celebrity, I should very early in life have thought of writing. As a child I scribbled; and my favourite pastime, during the hours given me for recreation, was to "write stories." Still I had a dearer pleasure than this, which was the formation of castles in the air--the indulging in waking dreams--the following up trains of thought, which had for their subject the formation of a succession of imaginary incidents. My dreams were at once more fantastic and agreeable than my writings. In the latter I was a close imitator--rather doing as others had done, than putting down the suggestions of my own mind. What I wrote was intended at
When opening a file in Python, the file remains open to Python until it is closed using the .close method. If we forget to close the file, it is possible for undesirable things to happen (for example, the file could become corrupted).
f.close() # Closes the file
To ensure that we do not forget to close the file, we can use the with construct to have Python automatically close the file when we are done reading from it. To use this with construct, we assign a temporary variable name to an opened file. We then carry out any desired operations on this temporary variable in an indented block. Once the code exits the indented block, the file is closed.
The code below demonstrates how this works.
with open('frankenstein.txt', 'r', encoding='utf-8-sig') as f: # Opens the file for reading
text = f.read() # Reads the content into a string
# The file `f` is now closed, since we have exited the `with` block.
print(text[2000:3000])
rue that I am very averse to bringing myself forward in print; but as my account will only appear as an appendage to a former production, and as it will be confined to such topics as have connection with my authorship alone, I can scarcely accuse myself of a personal intrusion. It is not singular that, as the daughter of two persons of distinguished literary celebrity, I should very early in life have thought of writing. As a child I scribbled; and my favourite pastime, during the hours given me for recreation, was to "write stories." Still I had a dearer pleasure than this, which was the formation of castles in the air--the indulging in waking dreams--the following up trains of thought, which had for their subject the formation of a succession of imaginary incidents. My dreams were at once more fantastic and agreeable than my writings. In the latter I was a close imitator--rather doing as others had done, than putting down the suggestions of my own mind. What I wrote was intended at
Analyzing text data¶
Now that we have some text to work with, what questions can we ask/answer? Suppose that we want to explore the frequency of the words that appear in Frankenstein.
First, it will be helpful to obtain a list of the words that appear in Frankenstein. As a starting point, we can use the .split method to separate the text string into a list of substrings separated by any amount of white space. This will (generally) give us the list of words in text.
words = text.split()
print(words[:10])
['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Frankenstein;', 'Or,', 'The', 'Modern', 'Prometheus']
Exercise: Construct a dictionary word_count_dict whose keys are words and whose values count how many times that word appears in Frankenstein.
To construct this dictionary:
- Look through every word in Frankenstein.
- If the word is not in our dictionary, add it to the dictionary with a value of
1.- We can test whether a dictionary
word_count_dictcontains a keywordusing the Boolean expressionword in word_count_dict. This will beTrueifwordis a defined key inword_count_dictandFalseotherwise.
- We can test whether a dictionary
- If the word is in the dictionary, increment the value associated to that word by
1.
word_count_dict = {}
for word in words:
if word not in word_count_dict:
word_count_dict[word] = 1
else:
word_count_dict[word] += 1
How many times does the word 'monster' appear in the text?
word_count_dict['monster']
21
We've generated a dictionary mapping words to their counts, but which word has the highest count? We can try using the max function to find the most frequently appearing word.
max(word_count_dict)
'•'
What is happening here? Is the symbol above truly the most frequently appearing word in Frankeinstein?
By default, taking the maximum of a dictionary returns the maximum of the keys of that dictionary. In this case, the keys are strings, so it returns the last string in alphabetical order. This is illustrated in the simple example below.
my_dict = {'a': 5,
'b': 3,
'c': 1}
max(my_dict)
'c'
In order to find the most frequently appearing word in word_count_dict, we'd rather take the maximum of the values (that is, of the word counts). We can use word_count_dict.values() to get a "list" of the values, and then take the maximum using the max function.
max(word_count_dict.values())
4263
This gives us the count for the most frequently appearing word, but it does not tell us which word has this count. This is illustrated in the simple example below.
max(my_dict.values())
5
What we'd really like to do is to find the "maximum" key, where we measure the "size" of a key by its associated value.
Finding maximums/minimums with a custom key¶
When using max to find the maximum value of an iterable, we can optionally include an argument key (a function) that tells the max function how to items are to be compared. This key function must be able to take in any of the items in the iterable, and return some type of data that can be sorted (like integers, floats, strings, etc.).
For example, suppose we have a list of strings and we want to find the string that is the longest. We could use key=len inside the max function to find the longest string.
my_list = ['Two', 'One', 'Four', 'Three']
max(my_list)
'Two'
max(my_list, key=len)
'Three'
We can also supply a custom key when using the min function to find minimums.
min(my_list)
'Four'
min(my_list, key=len)
'Two'
Note: If there are multiple items that are maximums or minimums, Python will return the first of these item encountered in the list. In the example above, both 'One' and 'Two' have the minimum length of 3. Python returns 'One' since it appears before 'Two' in the list.
Let's define our own function that can be used to compare the strings in my_list.
def my_key(s):
if s == 'One':
return 1
elif s == 'Two':
return 2
elif s == 'Three':
return 3
elif s == 'Four':
return 4
min(my_list, key=my_key)
'One'
max(my_list, key=my_key)
'Four'
Returning to the Frankenstein text, we want to find the "maximum" word, where the size of each word is given by the number of times that word appears in the text.
Exercise: Write a function word_count_key that takes in a string word and returns the number of times that word appears in Frankenstein. You should use the word_count_dict that was defined earlier.
def word_count_key(word):
return word_count_dict[word]
word_count_key('monster')
21
Exercise: Use the word_count_key function to find the most frequently appearing word in Frankenstein, along with the number of times that the word appears.
most_common_word = max(word_count_dict, key=word_count_key)
print(most_common_word)
the
most_common_word_count = word_count_dict[most_common_word]
print(most_common_word_count)
4263
Sorting¶
The sorted function can take in an iterable structure (e.g. list, dictionary, etc.) and return a sorted list of those items.
my_list
['Two', 'One', 'Four', 'Three']
sorted(my_list)
['Four', 'One', 'Three', 'Two']
Just like the max and min functions, we can supply a key input argument to change the way that the items are sorted.
sorted(my_list, key=len)
['Two', 'One', 'Four', 'Three']
By default, the sorted function will sort in ascending order (i.e. from smallest to largest). We can switch to descending order using the optional argument reverse=True.
sorted(my_list, key=my_key)
['One', 'Two', 'Three', 'Four']
sorted(my_list, key=my_key, reverse=True)
['Four', 'Three', 'Two', 'One']
Note: When sorting a dictionary, the sorted function will return a list of sorted keys (and will drop the dictionary structure and associated values).
Exercise: Use the sorted function along with the word_count_key function to sort the keys of word_count_dict from most frequently appearing to least frequently appearing. Then print out the 10 most frequently appearing words along with their corresponding word counts.
sorted_words = sorted(word_count_dict, key=word_count_key, reverse=True)
for word in sorted_words[:10]:
print(word, word_count_dict[word])
the 4263 and 2966 of 2902 I 2794 to 2234 my 1680 a 1447 in 1175 was 1030 that 1012
Suppose we want to know where a word lies in the sorted_words list, i.e. what its word count ranking is? The .index method to find where an item appears in a list.
For example, sorted_words.index('monster') will return the index i such that sorted_words[i] = 'monster'.
sorted_words.index('monster')
371
The word 'monster' is the 372nd most frequently used word.
Pre-processing text for analysis¶
In the code above, we looked through the text of Frankenstein and counted how many times each word appears in the text. On the other hand, there may be instances of a single word appearing as several different keys in the word_count_dict dictionary, with each key being slight variations of this word.
For example, the word strings for and For both appear as distinct keys in word_count_dict.
word_count_dict['for']
500
word_count_dict['For']
23
word_count_dict['FOR']
3
Can we modify our code to account for this? That is, can we make it so that only the string 'for' appears in word_count_dict, and each appearance of for or For will be included in the word count for the string 'for'?
We can use the .lower method on a string to convert all upper case letters to lowercase, as shown in the example below. Similarly, the .upper method will convert all lowercase letters to uppercase.
s = 'ThiS Is a StRiNg with UppEr and LoweR cASe letTeRs'
print(s)
print(s.lower())
print(s.upper())
ThiS Is a StRiNg with UppEr and LoweR cASe letTeRs this is a string with upper and lower case letters THIS IS A STRING WITH UPPER AND LOWER CASE LETTERS
Exercise: Modify the code above that was used to create word_count_dict so that all keys are lowercase and the associate word counts include all instances of the word (regardless of capitalization).
words = text.split()
word_count_dict = {}
for word in words:
lowercase_word = word.lower()
if lowercase_word not in word_count_dict:
word_count_dict[lowercase_word] = 1
else:
word_count_dict[lowercase_word] += 1
word_count_dict['for']
526