TDM 10200: Python Project 5 — Spring 2025
Motivation: Real world data has a lot of missing data. It is also helpful to be able to take a subset of data.
Context: It is worthwhile to be prepared to have missing data and to know how to work with it.
Scope: Dealing with missing data, and taking subsets of data.
Dataset(s)
This project will use the following datasets:
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv -
/anvil/projects/tdm/data/beer/reviews_sample.csv -
/anvil/projects/tdm/data/election/itcont1980.txt -
/anvil/projects/tdm/data/flights/subset/1990.csv
Questions
Example 1:
Example 2:
Example 3:
Example 4:
Example 5:
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
Build a subset of the data for which
Sex=='F'and check the head of the subset to make sure that you only have 'F' values in theSexcolumn of your subset. -
Make a table of the
Agevalues from the subset of female data in question 1a. Put the values in order byAgeusingsort_index()instead ofsort_values(). Then plot the table of theseAgevalues. (Notice that 999 is used when theAgevalue is missing in part 1b!) -
Now revise your subset from question 1a, so that you build a subset of the data for which
Sex=='F' & Age!=999and then make of table of theAgevalues from this revised subset of female data. Put these values in order byAgeusingsort_index()instead ofsort_values()(in particular, the age values of 999 should be removed now), and plot the table of theseAgevalues.
-
a. The head of the subset of data for which
Sex=='F' -
b. Plot of the table of
Agevalues for the subset in 1a. -
c. Revise questions 1a and 1b so that
Sex=='F' & Age!=999
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
there are more than 10 million lines of data, as we can see if we check myDF.shape. Each line corresponds to the purchase of an item. The SPEND column is negative when a purchase is refunded, i.e., the item is returned and the money is given back to the customer.
Create a smaller data set called refundsDF that contains only the lines of data for which the SPEND column is negative. Make a table of the STORE_R values in this refundsDF subset, and show the number of times that each STORE_R value appears in the refundsDF subset.
When you read in the grocery data, you need to assign column names, as follows:
myDF = pd.read_csv("/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
myDF.columns = ["BASKET_NUM", "HSHD_NUM", "PURCHASE", "PRODUCT_NUM", "SPEND", "UNITS", "STORE_R", "WEEK_NUM", "YEAR"]
-
Show the number of refunds for each
STORE_Rvalue in therefundsDFsubset. (For instance,CENTRALstores had 2750 refunds.)
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
Make a subset of the beers that have (score != 5) & (overall == 5) (in other words the score value is not equal to 5 but the overall value is equal to 5). How many lines of data are in this subset?
-
How many lines of data are in the subset that has
(score != 5) & (overall == 5)?
Question 4 (2 pts)
Read in the 1980 election data using:
/anvil/projects/tdm/data/election/itcont1980.txt
Remember that we need to specify the column names for the 1980 election donation data:
myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')
myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]
There are only 9 entries in which the TRANSACTION_DT value is missing, namely: one donation from CURCIO, BARBARA G and two donations from WOLFF, GARY W. and six donations from who?? (find their identity)! Find the name of the person who made six donations in 1980 with a missing TRANSACTION_DT.
-
Find the name of the person who made 6 donations in 1980 with a missing
TRANSACTION_DT.
Question 5 (2 pts)
Consider the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
This data set has information about 5270893 flights.
-
For how many flights is the
DepDelaymissing and also (simultaneously) theArrDelayis missing too? -
For how many flights is the
DepDelaygiven but theArrDelayis missing? -
For how many flights is the
ArrDelaygiven but theDepDelayis missing?
-
a. Find the number of flights for which the
DepDelayis missing and also (simultaneously) theArrDelayis missing too. -
b. Find the number of flights for which the
DepDelayis given but theArrDelayis missing. -
c. Find the number of flights for which the
ArrDelayis given but theDepDelayis missing.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.
-
firstname_lastname_project5.ipynb
|
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |