We've seen some example of where dictionaries are a useful structure. Let's talk a little more about defining dictionaries:
One way is to create an empty dictionary (using {}
) and add elements to that dictionary.
my_dict = {}
my_dict['a'] = 'Hello'0
my_dict[1] = 'I go with 1.'
my_dict['Hello'] = 'Goodbye'
my_dict
We can also define a dictionary with a comma separated list of <key>:<value>
pairs:
my_dict = {'a':'Hello', 1:'I go with 1.', 'Hello':'Goodbye'}
my_dict
We can also use syntax similar to list comprehension to define dictionaries:
my_dict = {letter:number for number, letter in enumerate('abcde')}
my_dict
Once we have a dictionary, there are times when we may want to access just the keys, just the values, or the key/value pairs.
.keys()
method to get a "list" of the keys. .values()
method to get a "list" of the values..items()
method to get a "list" of key/value pairs.for key in my_dict.keys():
print(key)
for value in my_dict.values():
print(value)
for item in my_dict.items():
print(item)
for key, value in my_dict.items():
print(key, value)
Recall: We can define functions to take in a specified number of positional arguments and some optional default arguments.
def f(a,b,c,d=7,e=8):
return a+b+c+d+e
f(1,2,3)
f(1,2,3,4)
f(1,2,3,e=4)
Sometimes, it's helpful to be able to write functions where we don't know how many inputs it will accept, and/or we don't know what kinds of keyword arguments will be supplied.
For example, the max
function can take in a comma separated list of values and will return the largest.
help(max)
max(1,2,3,4,5,6,7,8,9,10,11,12)
To do this, we can use an *
when defining a function to collect any unaccounted-for positional arguments:
def f(a,b,*args):
print(a,b)
print(args)
f(1,2)
f(1,2,3,4,5)
def my_sum(*args):
return sum(args)
my_sum(1,2,3,4)
def product(*args):
prod = 1
for arg in args:
prod *= arg
return prod
product(1,2,3,6,7)
We can also use an *
to unpack a list as positional arguments for a function:
product(*[1,2,3,4])
Suppose we have a function that takes in some input variables a
, b
, c
, d
. Suppose we also have a dictionary with keys 'a'
, 'b'
, 'c'
, and 'd'
.
def f(a,b,c,d):
return a + b + c + d
my_dict = {'a':'Hello', 'b': ' ', 'c':'Goodbye', 'd':'!'}
We can use the dictionary to define keyword arguments to plug into our function using **
:
f(**my_dict)
import matplotlib.pyplot as plt
#help(plt.plot)
Let's see an example in practice. Suppose that I want to plot several curves, each with the same linestyle and markerstyle.
import numpy as np
import matplotlib.pyplot as plt
t = np.linspace(0,2*np.pi,1000)
y1 = np.cos(t)
y2 = np.sin(t)
y3 = np.cos(t)**2
y4 = np.sin(t)**2
my_dict = {'linestyle':'dashed', 'linewidth':3}
plt.plot(t,y1, **my_dict)
plt.plot(t,y2, **my_dict)
plt.plot(t,y3, **my_dict)
plt.plot(t,y4, **my_dict)
The regular expressions package, re
, is a powerful tool for processing text data.
import re
We've seen how to use the .replace
function to replace some text with new text:
s = "This string doesn't use contractions. The last sentence wasn't a lie."
s.replace("doesn't", 'does not')
In the above example, we explicitly replaced the string doesn't
with does not
.
Very often, we may want to make more flexible replacements. The re
package gives us this flexibility. Some of the key functions that we will use are:
re.findall
: this is used to find substrings that match a given patternre.sub
: this used to replace substrings with desired texthelp(re.findall)
s = 'This is a test string. Lalalala. Happy Monday! Tis the season.'
re.findall('la', s)
The patterns that we use for regular expressions can be much more general than just explicit strings. For example,
\w
can be used to represent any word character\d
can be used to represent any digit character\s
can be used to represent any white space character\b
can be used to represent the boundary of a word (either the start or the end)re.findall('\w\ws\s', s)
Last time, I downloaded the text of Frankenstein. Let's read it in and explore it with regular expressions:
with open('frankenstein.txt',encoding='utf-8') as f:
text = f.read()
Let's find all four letter words that end with s
:
#re.findall('\s\w\w\ws\s', text)
Rather than using \s
to signify a space before or after our words, we can also use \b
to designate a "boundary character", which is a letter character that is adjacent to white space (or puncutation).
set(re.findall(r'\b\w\w\w\w\w\w\ws\b', text))
Instead of manually typing in 8 different \w
word characters, we can write \w{8}
to denote 8 \w
characters.
set(re.findall(r'\b\w{14}s\b', text))
s = 'Today is 4/14/2025. Tomorrow is 4/15/2025. The time is 4:43PM. In eight hours, it will be 12:43AM.'
re.findall('\d{,2}/\d{,2}/\d{4}', s)
We can use square brackets to specify a collection of valid characters to match:
re.findall('\d{,2}:\d{2}[AP]M', s)
Suppose we want to replace dates that are in the format mm/dd/YYY
with a new format, mm - dd - YYYY
.
help(re.sub)
def reformat_date(date):
return date.group(0).replace('/', ' - ')
re.sub('\d{,2}/\d{,2}/\d{4}', reformat_date, s)
Last time, we starting looking at using the Regular Expressions package (re
) to process text data.
import re
s = 'This is a sample string. It has some numbers: 1, 2, 100, 5000.'
Suppose we want to find any numbers that are in the string s
.
We would like to find as many adjacent digit characters as possible. We can use \d
for a digit character along with the quantifier +
to seek as many digit characters as possible.
pattern = '\d'
matches = re.findall(pattern, s)
print(s)
print(matches)
pattern = '\d+'
matches = re.findall(pattern, s)
print(s)
print(matches)
We can also create groups of acceptable characters for matching by using square brackets []
.
Let's try to find any word that has only lowercase letters.
Warning: Whenever we constructs patterns for use with regular expressions, it is advisable to use "raw" strings. Python won't do any special interpretation of characters in raw strings. We can define a raw string by prepending the quotes with an r
:
pattern = r'\b[abcdefghijklmnopqrstuvwxyz]+\b'
matches = re.findall(pattern, s)
print(s)
print(matches)
Note: We can use hyphens to designate a range of letters:
pattern = r'\b[a-z]+\b'
matches = re.findall(pattern, s)
print(s)
print(matches)
s = 'This is a sample string. It has some numbers: 1, 2, 100, 5000. NOW I AM YELLING!'
pattern = r'\b[A-Z]+\b'
matches = re.findall(pattern, s)
print(s)
print(matches)
s = 'This is a sample string. It has some numbers: 1, 2, 100, 5000. NOW I AM YELLING!'
pattern = r'\b[A-m]+\b'
matches = re.findall(pattern, s)
print(s)
print(matches)
Let's find all words that start with a capital letter and are otherwise lowercase:
s = 'This is a sample string. It has some numbers: 1, 2, 100, 5000. NOW I AM YELLING!'
pattern = r'\b[A-Z][a-z]*\b'
matches = re.findall(pattern, s)
print(s)
print(matches)
Let's get some more "interesting" text data to work with. We'll use the requests
package to download information from the Math Department's webpage.
import requests
help(requests.get)
url = 'https://www.buffalo.edu/cas/math/people/faculty.html'
webpage = requests.get(url)
We can use the .text
attribute on the object returned by requests.get
to get the text:
text = webpage.text
text = text.replace('>','>\n')
To make the text a little more readable, let's replace any >
with >\n
:
#print(text)
Can we find my name somewhere in this HTML code?
pattern = r'Jonathan Lottes'
re.findall(pattern, text)
Looking through the HTML code, it looks like we can identify faculty members based on the following:
<a aria-label="Faculty:FACULTY MEMBER"
pattern = r'<a aria-label="Faculty:[A-z ]+"'
matches = re.findall(pattern, text)
for match in matches[:10]:
print(match)
Note: Our pattern for identification included extra text that we don't actually want. We just the names. We can use parentheses ()
inside our pattern to denote match groups.
s = 'My name is Jon Lottes. I am 35 years old.'
pattern = r'My name is ([A-z ]+)'
re.findall(pattern, s)
pattern = r'I am (\d+ years) old'
re.findall(pattern, s)
pattern = r'<a aria-label="Faculty:([A-z ]+)"'
matches = re.findall(pattern, text)
for match in matches[:10]:
print(match)
Can we find phone numbers?
It looks like my information is contained in a block of HTML code. Can we use regular expressions to identify these blocks?
It looks like <div class="clearfix">
comes after each block of information, and <div class="profileinfo-teaser teaser-block">
comes before each block.
We want to match any number of any characters between these starting and ending strings. One way to do this is to use opposites in square brackets, e.g. [\w\W]*
.
block_start = r'<div class="profileinfo-teaser teaser-block">'
block_end = r'<div class="clearfix">'
pattern = block_start + r'[\w\W]*' + block_end
matches = re.findall(pattern, text)
len(matches)
Note: By default, regular expressions will search for the largest possible match.
s = 'This is a test! Hello to you! Goodbye!'
# Let's find characters that end in exclamation points.
pattern = r'[\w\W]+!'
re.findall(pattern, s)
If we want to stop each match as soon as there's a match, we can use a question mark following the +
quantifier (and similar for *
).
pattern = r'[\w\W]+?!'
re.findall(pattern, s)
For the HTML information blocks, we want to stop as soon as we find a suitable match.
block_start = r'<div class="profileinfo-teaser teaser-block">'
block_end = r'<div class="clearfix">'
pattern = block_start + r'[\w\W]*?' + block_end
matches = re.findall(pattern, text)
len(matches)
Let's try to pick out name information, phone number information, and email address information from each block.
match = matches[22]
#print(match)
name_pattern = r'alt="([\w ,\._]+)\. "'
name = re.findall(name_pattern,match)[0]
print(name)
email_pattern = r'[\w\d\.]+@[\w\d]+\.[\w\d]+'
email = re.findall(email_pattern, match)[0]
print(email)
phone_pattern = r'Phone: (\(\d{3}\) \d{3}-\d{,4})'
phone = re.findall(phone_pattern, match)[0]
print(phone)
Now we can iterate through each block and pull out the name, email, and phone.
block_start = r'<div class="profileinfo-teaser teaser-block">'
block_end = r'<div class="clearfix">'
pattern = block_start + r'[\w\W]*?' + block_end
faculty_blocks = re.findall(pattern, text)
names = []
phones = []
emails = []
for faculty_block in faculty_blocks:
name = re.findall(name_pattern,faculty_block)[0]
email = re.findall(email_pattern, faculty_block)[0]
phone = re.findall(phone_pattern, faculty_block)[0]
names.append(name)
phones.append(phone)
emails.append(email)
for name, phone, email in zip(names, phones, emails):
print(name)
print(phone)
print(email)
print()
It looks like we need to revisit our name matching pattern to make some adjustments. It's based on the picture included for each faculty member, but not every member has a picture.