STAT 29000: Project 4 — Fall 2021
Extracting and summarizing data in bash
Motivation: Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs.
Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
Scope: grep, regular expression basics, UNIX utilities, redirection, piping
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/stackoverflow/unprocessed/*
-
/depot/datamine/data/stackoverflow/processed/*
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
One of the first things to do when first looking at a dataset is reading the first few lines of data in the file. Typically, there will be some headers which describe the data, and you get to see what some of the data looks like. Use the UNIX head
command to read the first few lines of the data in unprocessed/2011.csv
.
As you will quickly see, this dataset is just too wide — there are too many columns — to be useful. Let’s try and count the number of columns using head
, tr
, and wc
. If we can get the first row, replace ,’s with newlines, then use `wc -l
to count the number of lines, this should work, right? What happens?
The newline character in UNIX is |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in processed/2011.csv
. How many columns are there?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Let’s switch gears, and look at a larger dataset with more data to analyze. Check out iowa_liquor_sales_cleaner.txt
. What are the 5 largest orders by number of bottles sold?
-
Code used to solve this problem.
-
Output from running the code.
Question 4
What are the different sizes (in ml) that a bottle of liquor comes in?
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Which store has the most invoices? There are 2 columns you could potentially use to solve this problem, which should you use and why? For this dataset, does it end up making a difference?
This may take a few minutes to run. Grab a coffee. To prevent wasting time, try practicing on the |
Be very careful when using
|
-
Code used to solve this problem.
-
Output from running the code.
Question 6
sort
is a particularly powerful function, albeit not always the most user friendly when compared to other tools.
For the largest sale (in USD), what was the volume sold in liters?
For the largest sale (in liters of liquor sold), what was the total cost (in USD)?
Use the |
To remove a dollar sign from text using
|
-
Code used to solve this problem.
-
Output from running the code.
Question 7
Use head
, grep
, sort
, uniq
, wc
, and any other UNIX utilities you feel comfortable using to answer a data-driven question about the iowa_liquor_sales_cleaner.txt
dataset.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |