In this assignment, you’ll take what you learned about using loops and mutation over the past few weeks, and apply these programming techniques to a few new problem domains. In Parts 1 and 2, you will generate English texts by creating and using simple computational models. In Part 3, you will learn about sentiment analysis and debug a small program that analyses the sentiments found in movie reviews. And finally, in Part 4 you will work with the Canadian Forest Fire Weather Index system. What an exciting assignment!
To obtain the starter files for this assignment:
csc110/assignments/ folder. Put all the starter files in a new a3 folder. This should look similar to what you had for Assignment 1.a3 folder as “Sources Root” by right-clicking on it and selecting Mark Directory as -> Sources Root.This assignment contains a mixture of both written and programming questions. All of your written work should be completed in the a3.tex starter file using the LaTeX typesetting language. You went through the process of getting started with LaTeX in the Software Installation Guide, but for a quick start we recommend using the online platform Overleaf for LaTeX. Overleaf also provides many tutorials to help you get started with LaTeX.
Your programming work should be completed in the different starter files provided (each part has its own starter files). We have provided code at the bottom of each file for running tests and/or PythonTA on each file. We are not grading doctests on this assignment, but encourage you to add some as a way to understand each function we’ve asked you to complete. We are using PythonTA to grade your work, so please run that on every Python file you submit using the code we’ve provided.
Warning: one of the purposes of this assignment is to evaluate your understanding and mastery of the concepts that we have covered so far. So on this assignment, you may only use parts of the Python programming language that we have covered in the first five weeks of lecture (i.e., up to and including all of Chapter 5 in the notes). Other parts are not allowed, and parts of your submissions that use them may receive a grade as low as zero for doing so.
+=)text.lower() rather than str.lower(text))Computers have become very good at predicting the next word in a sentence. Consider, for example, the touch-based keyboards on a smartphone. The accuracy of these predictions are based on two things: the model and the data used to “train” that model. In Parts 1 and 2, we will see how we can use textual data to create a model that generates a series of semi-sensible sentences.
Consider the short sentence 'Hello Hello Amy was here'. We can create a dictionary of each word in the sentence to the number of times it appears in this sentence.
One naive way to generate new sentences from this data is to randomly select words from this sentence and join them together to generate a new sentence. Here the start of a function to do so:
def generate_text_uniform(model: dict[str, int], n: int) -> str:
"""Return a string of n randomly-generated words chosen from the given model.
Each word in the returned string is separated by a single space.
Preconditions:
- n >= 0
- model != {}
"""For example, here is how we might call this function on the words in our given sentence:
>>> words = {'Hello': 2, 'Amy': 1, 'was': 1, 'here': 1}
>>> generate_text_uniform(words, 5)
'was Amy was was here'
>>> generate_text_uniform(words, 5)
'Hello Hello Amy here Hello'This function makes n random choices from the given words, and can be implemented using a loop with the following structure:
def generate_text_uniform(model: dict[str, int], n: int) -> str:
"""Return a string of n randomly-generated words chosen from the given model.
Each word in the returned string is separated by a single space.
Preconditions:
- n >= 0
- model != {}
"""
# Unpack model into two lists whose indexes correspond to one another
words = []
word_frequencies = []
for word in model:
words.append(word)
word_frequencies.append(model[word])
new_words = random.choices(words, weights=word_frequencies, k=n)
return str.join(' ', new_words)First, we created two parallel lists words (a list[str]) and word_frequencies (a list[int]). So index 0 in words is a string that appears word_frequencies[0] number of times. We use these parallel lists in random.choices, a function we have not seen yet (not to be confused by random.choice). Our call to the function random.choices will randomly choose strings from words based on the weights in word_frequencies. In addition, we specify the k argument so that random.choices returns a list[str] containing n elements. Finally, recall that the built-in method str.join takes a separator str and list[str], and returns a new string containing all the words in the given list, separated by the separator.
Open a3.tex and answer the following questions.
Complete the loop accumulation table for the example call to generate_text_uniform(words, 5) where words = {'Hello': 2, 'Amy': 1, 'was': 1, 'here': 1}. For this question, you can assume that the for loop goes in the order in which the key-value pairs appear in (so 'Hello', then 'Amy', then…).
Explain why it wouldn’t be a good idea to include this example for generate_text_uniform to check with doctest:
What inputs to generate_text_uniform could we write doctest examples for? (i.e., inputs where the problem you described in part (b) doesn’t apply.)
Open a3_part1.py and complete the function create_model_uniform according to its docstring. You may want to check out a3_sample_tests.py, which does not need to be submitted.
generate_text_uniform doesn’t return meaningful sentences, whether we give it a “small” dictionary of words or the entire collection of words in the English language. The reason is its model, which makes independent random choices for each word in the generated output. The reason independent word choices generate nonsensical sentences is that in English (and every other language), word order matters: each word in a sentence is very much tied to the previous and next words.
In this part, we’ll develop and use a text model that stores not just the words from our input data, but some context of how each word is used as well.
Consider the following sentence:
The one-word context model for this text is a dictionary where:
The one-word context model for 'I really really like chocolate' is the following:
This model tells us:
'I' can be followed by 'really'. Its follow list is ['really'].'really' can be followed by 'really' or 'like'.'like' can be followed by 'chocolate'.'chocolate' has no words that follow it. So it is not included as a key.What about punctuation? To simplify our model, we consider punctuation as part of the word. This means that the word 'chocolate' is not the same as 'chocolate.' nor is it the same as 'chocolate!'.
(Not to be handed in, but useful to complete) answer the following questions:
Write the one-word context model for the following string:
Write a string whose one-word context model is the following:
In a3.tex, write the one-word context model for the following string:
In a3_part2.py, implement the functions update_follow_list and create_model_owc.
Some details about the model:
'love' and 'Love' are treated as different words, and have different follow lists.Now we’re going to use our one-word context model to generate text, by implementing the following function:
def generate_text_owc(count: int, transitions: dict[str, list[str]]) -> str:
"""Return a string containing (count - 1) randomly generated words based on the data in
transitions, which maps words to a list of words that follow it.
A randomly generated word is selected from the keys of transitions when:
- it is the first word; or
- the last randomly generated word is not a key in transitions.
A randomly generated word is selected from the follow list of a key in transitions when the
last randomly generated word is a key in transitions. In addition, one occurrence of the word
selected from the follow list is removed from the follow list (i.e., mutation). When there are
no words in the follow list for a key, the key-value pair is also removed from transitions
(i.e., mutation).
Your implementation MUST use the helper functions: choose_from_keys and choose_from_follow_list.
We recommend completing these functions first, as they simpler and will get you thinking about
how to use it here.
Preconditions:
- model is in the format described by the assignment handout
"""
# ACCUMULATOR: a list of the randomly-generated words so far
words_so_far = []
# We've provided this template as a starting point; you may modify it as necessary.
current_word = ''
for _ in range(count - 1):
...
return str.join(' ', words_so_far)This function behaves similarly to generate_text_uniform, except now we aren’t going to make random choices independent of each other— instead, whenever a word w is chosen, the next word to be chosen must come from the follow list of w stored in the transitions. And for each word w, each word in the follow list of w can only be selected once. For example, in this model:
The word 'like' can only follow the word 'I' once in the generated sentence. But the word 'chocolate.' can follow 'like' twice. What is the longest sentence we can generate from this model?
To accomplish the above, we will mutate transitions and the list[str] values in transitions as we “randomly choose” words from it. Sometimes we will need to randomly choose a key from transitions, like when we are generating the very first word or the last word is not a key in transitions (see the choose_from_keys helper function). Other times, when the last word (i.e., the context) is in transitions, we will randomly choose a string from a follow list in transitions. This is when we mutate (see the choose_from_follow_list helper function). Let’s try an example.
Consider our earlier model:
>>> my_model = {
... 'I': ['like', 'really'],
... 'like': ['chocolate.', 'chocolate.'],
... 'really': ['really', 'like']
... }At the beginning, we need to select the “first word” for our randomly generated text. The possible first words are the keys of my_model:
Suppose we randomly selected 'like'. This means our second word will come from the follow list that corresponds with the key 'like':
Suppose we randomly select 'chocolate.' (though we didn’t have much choice!) and add it to our randomly generated text. We also need to remove one occurence of 'chocolate.' from the follow list. So our string becomes: 'like chocolate.'. And our model is now:
The word 'chocolate.' is not a key in my_model. So we will need to choose a random word from the keys of my_model, just like we did for our very first word. Suppose we randomly select 'like' again. Its follow list is down to just: ['chocolate.']. So when this second 'chocolate.' word is selected and removed from the follow list, 'like' maps to an empty list []. When this happens, we remove the key-value pair from the model. So our string becomes: 'like chocolate. chocolate.'. And our model is now:
These words are rather meaningless. When your done, be sure to try out run_example on one of our sample text files and see what comes out! Here is a small taste:
>>> run_example('data/texts/sample_text_raw.txt')
'has no control about the Little Blind Text, that where it came from the far away, behind the countries Vokalia and again. And if she reached the headline of the way. When she had a large language ocean. A small river named Duden flows by their place and supplies it with the all powerful Pointing the word mountains, far World of blind texts. Separated they live the subline of the first hills of sentences fly into their agency, where they abused her seven versalia, put her drunk with Longe and the belt and devious Semikoli, but the copy warned the Little Blind Text should turn around and so it didn\'t listen. She packed her way. On her and return to leave for their projects again and made herself on the Little Blind Text didn\'t take long until a last view back on the necessary regelialia. It is a copy. The Big Oxmox advised her not to do so, because there live in which roasted parts of Alphabet Village and Consonantia, there were thousands of her cheek, then they are still using her. a small line of Grammar. The copy said could convince her initial into the name of bad Commas, wild Question Marks and everything that was left from its origin would have been rewritten, then she continued her into your mouth. Even the Italic Mountains, she met a paradisematic country, in Bookmarksgrove right at the blind texts it is an almost unorthographic life One day however a thousand times and Parole and the blind text by Copy Writers ambushed her, made her own road, the coast of Lorem Ipsum decided to its own, safe country. But nothing the Line Lane. Pityful rethoric question ran over her for the word "and" and dragged her way she hasn\'t been rewritten a few insidious would be the skyline of her hometown Bookmarksgrove, the Semantics, a far from it'In a3_part2.py, complete the functions generate_text_owc, generate_new_word, and generate_next_word (the latter two are helper functions for generate_text_owc).
Test your functions carefully in the Python console (you can copy-and-paste examples from this handout, and make up your own). You may also want to check out a3_sample_tests.py, which does not need to be submitted.
A movie review typically includes a critic’s comments and a score, but the score doesn’t portray the sentiment and raw emotion in the critic’s comments. Professor Mario has decided to build his own movie review platform that labels reviews as positive, negative, or neutral based on the intensity of the words used in the review.
Professor Mario uses the VADER Lexicon, which we represent as a mapping from words to a positive/negative intensity score. For example, here are two key words and their intensities:
| Word | Intensity |
|---|---|
| awesome | 3.1 |
| awful | -2.0 |
The polarity of a review is one of {'positive', 'negative', 'neutral'}, and is calculated by finding the average intensity of the lexicon words used in the review. Words that don’t appear in the VADER Lexicon are ignored when calculating the average.
| Polarity | Average intensity |
|---|---|
positive |
|
neutral |
|
negative |
In a3_part3.py, Professor Mario has written a small program to read review data from a csv file, extract the lexicon words from the review, and then compute review’s the average intensity and polarity. Professor Mario has painstakingly gone through three reviews and manually calculated their polarity and average intensity. Unfortunately, when Professor Mario compares his calculated results to the return values from his program, he finds that his program has some errors!
Answer the following questions about this program and pytest report. Write your responses in a3.tex, except for 2(b).
Run the program to generate a pytest report.
Based on the report, state which tests, if any, passed, and which tests failed. You can just state the name of the tests and whether they passed/failed, and do not need to give explanations here.
For each failing test from the pytest report:
Explain what error is causing the test to fail.
Hint: each test refers to a data file under data/reviews; the last column of each csv file stores the test of the review.
Edit a3_part3.py by fixing the function code so that all tests pass. Unlike Assignment 1, the tests do not contain errors.
The changes should be small and must be to fix errors only; the original purpose of all functions and tests must remain the same. The expected value in the tests is correct, do not change it.
For each test that passed on the original code (before your changes in question 2), explain why that test passed even though there were errors in the Python file.
In Part 4, our problem domain is forest fires. Specifically, we take a close look at the Canadian Forest Fire Weather Index system. The system calculates several fire danger indices based on measurements made by fire weather stations. The measurements and calculations are typically made daily, and the calculations are based on complex models that are well beyond the scope of this course. Our focus in Part 4 is on developing an understanding of the system (rather than its underlying models), using existing code to generate outputs based on measurements in file data, and testing existing code via the “white-box” method.
The Forest Fire Weather Index (FWI) System takes, as daily inputs, the following weather measurements:
The month and day these measurements are taken is also relevant for the models when making calculations for the system’s outputs. As output, the system calculates three moisture codes (sometimes called “subindexes”):
The higher the values (with type float) of these codes, the drier the conditions of the environment. Based on these codes, three additional outputs of “fire behaviour indexes” can be calculated:
These six components act as a way to monitor the danger of fire in an environment. A summary of the Canadian Forest FWI System can be found here. To see some concrete photo examples of how FWI (ranging from values of 9 to 34) correspond to photos of real forest fires, check out this link.
Included in the starter files is a3_ffwi_system.py, which is based on the publicly available source code found here. We have adapted the original code to follow our course conventions and divided parts of the code into smaller functions. Rather than reading the body of the functions in detail, you should instead focus on the data classes, their docstring descriptions, and the docstring descriptions of the functions.
Your first task is to complete a3_part4.py. This task involves implementing functions that load data from a csv, calculate all six outputs of the Forest FWI System, and plot the data. In order to calculate the daily outputs of the system, you will need to use the functions in a3_ffwi_system appropriately. Note that, when calculating moisture codes, the previous day’s moisutre code calculation is required. For the very first day, you should use the INITIAL_FFMC, INITIAL_DMC, and INITIAL_DC constants defined in a3_ffwi_system.
For data, we have included the csv file data/ffwi/sample_data.csv. We will also be releasing additional data from the government on Quercus under a special license. It will appear under the Assignment 3 module on Quercus a little while after this assignment is available. Please note that if you wish to use this additional data, you must understand and adhere to the license.
Your second task is to complete a3_part4_tests.py. This task let’s you practice with two forms of testing—complete only the test cases in the starter file, you should not add more. First, you do some “white box testing” – we originally discussed this concept in Section 3.4 of the course notes when we introduced conditional execution. The branches you need to test for are described in the docstring description of the unit tests. To verify that your tests are reaching the right branches, consider using the debugger and breakpoints.
Second, you will do a new kind of testing that compares the outputs of the Forest FWI system to the “ground truth”. In this case, the ground truth are the values provided in the data; the term ground truth implies that these are the “correct” (or expected) values. Meanwhile, the values obtained via the calculate_ functions in a3_ffwi_system are the actual values. Your test_ffmc_against_ground_truth should contain a loop, and the loop body should consist of multiple assert statements (one for each system output you are validating).
Using what you know about the object-based Python memory model, answer the following questions in a3.tex. You may find it helpful to draw some memory model diagrams (in fact, we encourage it), but you should not include any memory model diagrams in your submission. Your answers should be brief—long answers may be penalized.
The function calculate_ffmc calls calculate_mr, passing the precipitation from a WeatherMetrics object referred to by wm as the first argument to calculate_mr. Why is it not possible for calculate_mr to mutate the precipitation attribute of the WeatherMetrics object referred to by wm in calculate_ffmc?
In the function calculate_dc, why is the local variable temperature assigned instead of simply re-using/re-assigning wm.temperature when the temperature is below -2.8 degrees Celsius?
In a3_part4, the return type of load_data is tuple[list[WeatherMetrics], list[FfwiOutput]]. We know that a tuple is immutable. Yet the list objects inside the tuple (i.e., list[WeatherMetrics], list[FfwiOutput]) can be mutated. So what do we mean when we say that this tuple return type is immutable?
Please proofread, test, and fix PythonTA errors in your work before your final submission!
Login to MarkUs.
Go to Assignment 3, then the “Submissions” tab.
Submit the following files: a3.tex, a3.pdf (which must be generated from your a3.tex file), a3_part1.py, a3_part2.py, a3_part3.py, a3_part4.py, and a3_part4_tests.py. Please note that MarkUs is picky with filenames, and so your filenames must match these exactly, including using lowercase letters.
Note: for your Python code files, please make sure they run (in the Python console) before submitting them! Code submissions that do not run will receive a grade of zero for that part, which we do not want to happen to any of you, so please check your work carefully.
Refresh the page, and then download each file to make sure you submitted the right version.
Remember, you can submit your files multiple times before the due date. So you can aim to submit your work early, and if you find an error or a place to improve before the due date, you can still make your changes and resubmit your work.
After you’ve submitted your work, please give yourself a well-deserved pat on the back and go take a rest or do something fun or eat some chocolate!
