Learning objectives: ##HStrategies for processing many files Generating output for use in another program Programatically processing general#G input data Introduction to visualization using ggplot2
Skills
coordination + communication (1/6)
organization + planning (3/6)
teamwork (3/6)
programming + tools (5/6)
strategy (4/6)
visualization (6/6)
(*)[The skill scale is from 0 (Fundamental Awareness) to 6 (Main Focus).]
Image description
A pair of glasses. Image source
Overview
In this lab we will again be working in new pairs. With your partner, choose whose lab from last week will be the starting place. Again, a solution to last weeks lab is available, but using it will result in a 10% reduction to your grade for this weeks lab.
Task 1 Description
Retrieve a copy of your findFirstNames.pl Perl script from last weeks lab. Decide with your new partner which of your implementations to use. If you did not get the previous assignment working, you can download working code from the CourseLink site for but you will lose 10% of your grade for this lab. You will see this available as Emergency Kit: Solution for Lab 3.
We will now change gears slightly to create code that will track a names popularity across a number of years. This will require changes to the number of files that will be read in so let us start there.
Copy findFirstName.pl to another file named firstNamesByTime.pl.
Change your code so that instead of reading in one SS name year file you will read in any number of input files. Your command line will look like the following:
$ perl firstNamesByTime . pl 1900 2000 20 querynames . txt
Here, the first two parameters (1900, 2000) denote the start and end years that you want to cover in the SS name files, i.e.; yob1900.txt and yob2000.txt.
The next parameter is increment in years for the files in between the start and end years. So in our example we would be considering all of these files: yob1900.txt, yob1920.txt, yob1940.txt, yob1960.txt, yob1980.txt, and yob2000.txt.
The last parameter is the file containing the names that you want to examine for their popularity in the years described in the first three parameters.
The querynames.txt file will have the following format:name,sex where sex = F or M (female or male)For example, see the file listing shown to the right: | Andrew ,MKassandra , FDavis ,MJulia , F |
For each name and sex in the querynames.txt file, we want to print out the ranking from each of the indicated SS files in the format of a new .csv file. This file should have a header line consisting of the field names Name,Year,Ranking and the remaining lines should consist of the data values for one of the names for a given year. All the data for a given name should appear together, and the years should be in ascending order.
For example, if we run the following command using the above querynames.txt file, we should see this output:
$ perl firstNamesByTime . pl 1990 2000 3 queryNames . txt
Name, Year , Ranking
Andrew,1990,7
Andrew,1993,10
Andrew,1996,10
Andrew,1999,7 Kassandra ,1990,312
Kassandra ,1993,120 Kassandra ,1996,203
Kassandra ,1999,264 Davis ,1990,598 Davis ,1993,461
Davis ,1996,409
Davis ,1999,371
Julia ,1990,83
Julia ,1993,72
Julia ,1996,48
Julia ,1999,30
Pair Programming 3:
- Overview
Learning objectives: ##HStrategies for processing many files Generating output for use in another program Programatically processing general#G input data Introduction to visualization using ggplot2
Skills
coordination + communication (1/6)
organization + planning (3/6)
teamwork (3/6)
programming + tools (5/6)
strategy (4/6)
visualization (6/6)
(*)[The skill scale is from 0 (Fundamental Awareness) to 6 (Main Focus).]
Image description
A pair of glasses. Image source
We will now explore using the Statistics::R package to use the powerful ggplot2 library to produce plots of our data.
You will need to download all of the YoB (Year of Birth) files from CourseLink. You will find the file, names.zip in the Labs section on CourseLink. This contains all the YoB files.
Now you can download the Perl script named createNameRankPlot.pl and test it out with your new output.
Run your code and redirect the output into a file (note that here we are using every years data, not skipping 3 as in the last run):
$ perl firstNamesByTime . pl 1990 2000 1 queryNames . txt > plot1 . txt
To run the plotting programs (even in the THRN labs) you must do the following before running createNameRankPlot.pl:
- Go to the Applications Folder on the machine and double click on R (this is a statistics program)
- In R type the following command to load the plotting library (the > is the R prompt):
> install.packages(ggplot2)
- Now you can continue on with your perl programming and pl will produce lovely PDFs of your plots.
Then run the plotting script:
$ perl createNameRankPlot . pl plot1 . txt plot1 . pdf
Then open the PDF file to see what you have created:
$ open plot1 . pdf
You should see a plot likt this:
Popularity of Names
Overview
Learning objectives: ##HStrategies for processing many files Generating output for use in another program Programatically processing general#G input data Introduction to visualization using ggplot2
Skills
coordination + communication (1/6)
organization + planning (3/6)
teamwork (3/6)
programming + tools (5/6)
strategy (4/6)
visualization (6/6)
(*)[The skill scale is from 0 (Fundamental Awareness) to 6 (Main Focus).]
Image description
A pair of glasses. Image source
Ranking Category
We can calculate different presentations of data. Examine the script convertRankingToRankCategory.pl. This script reads a .csv file and will convert the values in a Ranking column according to the conversion shown to the right.Run this script to convert the plot1.txt file to a new plot2.txt file: | 0>20001000200050099920049910019950991049 | 012 345 67 |
110 | 8 |
$ perl convertRankingToRankCategory . pl plot1 . txt > plot2 . txt
The file open plot2.txt will contain the following:
Name, Year , RankCategory
Andrew,1990,8 Andrew,1993,8
Andrew,1996,8
Andrew,1999,8 Kassandra ,1990,4 Kassandra ,1993,5 Kassandra ,1996,4
Kassandra ,1999,4
Davis ,1990,3
Davis ,1993,4
Davis ,1996,4
Davis ,1999,4 Julia ,1990,6 Julia ,1993,6 Julia ,1996,7
Julia ,1999,7
Creating a plot and then viewing it with produce the plot below:
$ perl createNameRankCategoryPlot . pl plot2 . txt plot2 . pdf
Popularity of Names
Overview
Learning objectives: ##HStrategies for processing many files Generating output for use in another program Programatically processing general#G input data Introduction to visualization using ggplot2
Skills
coordination + communication (1/6)
organization + planning (3/6)
teamwork (3/6)
programming + tools (5/6)
strategy (4/6)
visualization (6/6)
(*)[The skill scale is from 0 (Fundamental Awareness) to 6 (Main Focus).]
Image description
A pair of glasses. Image source
As a final lab task, consider what the differences are between createNameRankPlot.pl and createNameRankCategoryPlot.pl.
Why does createNameRankPlot.pl not work (it gives an error) when it is run on the data in plot2.txt?
Complete the Quiz in Courselink that is part of Lab 4 to provide your answer, and be sure to upload your firstNameByTime.pl file, along with the plot1.pdf and plot2.pdf visualization files that you created.
Reviews
There are no reviews yet.