P4: Text Statistics
Contents
Project Overview
In this program, you are going to implement a program that analyzes text files to generate some useful statistics. We will provide plain text versions of several books and articles (e.g. Alice in Wonderland, The Gettysburg Address, etc.) for you to analyze using your program.Text analysis is used in programs all the time. Your program will provide basic statistics like the number of characters, words, and lines, the average word length, and the number of each letter and word length. However, the same technique could be used to collect more statistics and create a program to ascertain characteristics of books and classify them for various purposes.
Booklamp was a recent Boise startup (two of our CS departments alumni were part of it!) that has created a sophisticated system for analyzing the text in books using techniques similar to what you will be doing in this lab. Booklamp software matches readers to books through an analysis of writing styles. The Booklamp technology allows us to find books that are written with a similar tone, tense, perspective, action level, description level and dialogue level. Booklamp was acquired by Apple on 25th July, 2014.
Objectives
- Use arrays.
- Use command-line arguments.
- Read from text files using the Scanner class.
- Parse and analyze text input.
- Implement an interface.
Getting Started
- Create a new Eclipse project for this assignment.
- The easiest way to import all the files into your project is to download and unzip the starter files directly into your project workspace directory (using the command-line or dolphin). The starter files are available here: http://cs.boisestate.edu/~cs121/projects/p4/stubs (You should download p4-stubs.zip).
- After you unzip the files into your workspace directory outside of Eclipse, go back to Eclipse and refresh your project.
- Create
ProcessText
andTextStatistics
classes in your project. - Start by implementing
ProcessText
. You can ignore the command-line arguments to start. Just hard-code a file name so you can test yourTextStatistics
class as you write it. Create aFile
object and check to see that the file actually exists.- If the file does exist, your program will create a
TextStatistics
object for that file and print out the statistics for the file to the console. - If the file does not exist, a meaningful error message needs to be printed to the user.
- If the file does exist, your program will create a
- Next you can start implementing
TextStatistics
according to the specifications. - At this point, you should go back and add command-line argument processing to
ProcessText
as described in the specifications. To make sure it correctly handles command line arguments, run it from the command line with no arguments, files that dont exist, and files that do exist. - Make sure to test your program thoroughly. We are giving you the test program and scripts we will use to grade your program, so take advantage of this and make sure they pass!
Specification
Project files
For this assignment you are going to implement two classes. TextStatistics
will be the class that reads a text file, parses it, and stores the information about the words and characters in the file. ProcessText
is the driver class that gets a list of one or more filenames from the command line and collects statistics on each of the files using an instance of the TextStatistics
object.
- Classes that you will create: ProcessText.java, TextStatistics.java
- Existing interface and class that you will use: TextStatisticsInterface.java, TextStatisticsTest.java
You should be able to develop this program incrementally in such a way that you can turn in a program that runs even if you dont succeed in implementing all the specified functionality.
ProcessText.java
The driver class with the main
method which processes one or more files to determine some interesting statistics about them.
-
Command-line validation
The names of the files to process will be given as command line arguments. Make sure to validate the number of command line arguments. There should be at least one file name given.If no files are given on the command line, your program must print a usage message and exit the program immediately. The message should read as follows.
Usage: java ProcessText file1 [file2 ...]
This lets the user know how they should run the program without having to go look up the documentation.
-
Processing command-line arguments
If valid filenames are given on the command line, your program will process each command line argument by creating a
File
object from it and checking to see that the file actually exists.- If a file does exist, your program will create a
TextStatistics
object for that file and print out the statistics for the file to the console. - If a file does not exist, a meaningful error message needs to be printed to the user. Continue processing the next file. An invalid file in the list should not result in the program crashing or exiting before all files have been processed.
The example, CmdLineArgs.java, shows how to use command line arguments in your program. The
args
parameter of themain
method is an array ofString
objects that contains the command line arguments to the program. For your program, the array should contain the names of the files to be processed. - If a file does exist, your program will create a
TextStatistics.java
An instantiable class that reads a given text file, parses it, and stores the generated statistics.
-
Implement the Interface
Your
TextStatistics
class must implement the givenTextStatisticsInterface
(dont modify the interface, it just provides a list of methods that your class must include).To implement an interface, you must modify your class header as followspublic class TextStatistics implements TextStatisticsInterface{}
Adding implements TextStatisticsInterface will cause an error in Eclipse. Select the quick fix option to Add unimplemented methods and it will stub out the required methods for you.
-
Instance variables
Include a reference to the processed
File
. Include variables for all of the statistics that are computed for the file. Look at the list of accessor methods in theTextStatisticsInterface
to determine which statistics will be stored. -
Constructor
Takes a
File
object as a parameter. The constructor should open the file and read the entire file line-by-line, processing each line as it reads it.- Your constructor needs to handle the
FileNotFoundException
that can occur when theFile
is opened in aScanner
. Use a try-catch statement to do this. Dont just throw the exception. - As each line is read, collect the following statistics:
- The number of characters and lines in the file. The number of characters should include all whitespace characters, punctuation, etc. The number of lines should include any blank lines in the file.
- The number of words in the file.You must use a
Scanner
on each line to count the number of words in each line of the text file.private static final String DELIMITERS = "[\W\d_]+";
Use
useDelimiter(DELIMITERS)
on your lineScanner
to set the delimiters that theScanner
will use for separating words in the file.The scanner will not return any of the delimiter characters. For example, usinglineScan.next()
on the string.scheme, and the "plan" (for us)
will give the following tokens.
schemeandtheplanforus
The UseScannerDelimiter.java example shows the different results you get using the default delimiters and user-specified delimiters.
- The number of words of each length that appears in the file. Assume that the maximum word length is 23. You do not need to print lengths that have a count of zero.
- The average word length for the file.
- The number of each letter that appears in the file do not separate upper and lower case, just convert all characters to lower case before counting.See LetterCount.java for a similar approach.
- Your constructor needs to handle the
-
Getter (accessor) methods
Implement the accessor methods for the number of characters, number of words, number of lines, average word length and for the arrays that contain the number of words of each length and the number of times each letter occurs in the file.
-
toString method
Write a
toString()
method that generates and returns aString
that can be printed to summarize the statistics for the file as shown in the sample output.
Testing
You must test your program thoroughly before submitting it. We will be using similar testing strategies when we grade your program, so you should have a good idea whether or not your code will pass our tests before submitting.
TextStatisticsTest.java: Automated testing based on the interface.
We have provided a test program that tests your TextStatistics
class using three sample text files. This test program will not compile unless you have properly implemented the required interface. This is available in your starter files.
autograde.sh: Testing based on program output.
-
autograde.sh
-
A shell script (a program made up of shell commands) that you can run to see if your program is going to work with the shell script used for grading the programs.
-
testfile.txt and etext
-
Sample text files used by autograde.sh
-
testresults
-
The expected output of autograde.sh. Your output should match the contents of this file.
To use these files to test your program, copy them into your program directory on onyx and make sure that autograde.sh is executable. (Doing a ls -l on the file should give -rwx------. If the x is missing, type chmod +x autograde.sh).
Now you run the test by typing
./autograde.sh
If your program does not compile and run, you need to fix it if you want any points for the program. Make sure all the files have the names specified.
Sample Sessions
Sample output for bad arguments/non-existing files
[[email protected] p4]$ java ProcessTextUsage: java ProcessText file1 [file2 ...][[email protected] p4]$ java ProcessText not-a-file.txtInvalid file path: not-a-file.txt[[email protected] p4]$ java ProcessText not-a-file.txt testfile.txtInvalid file path: not-a-file.txtStatistics for testfile.txt==========================================================11 lines79 words465 characters------------------------------ a = 27 n = 25 b = 1 o = 26 c = 11 p = 5 d = 10 q = 0 e = 33 r = 21 f = 9 s = 30 g = 7 t = 35 h = 24 u = 7 i = 25 v = 1 j = 0 w = 10 k = 2 x = 1 l = 18 y = 2 m = 5 z = 0------------------------------ length frequency ------ --------- 1 3 2 13 3 24 4 13 5 10 6 2 7 5 8 3 9 1 10 3 11 2Average word length = 4.24==========================================================
Sample output for the input file testfile.txt
[[email protected] p4]$ java ProcessText testfile.txtStatistics for testfile.txt==========================================================11 lines79 words465 characters------------------------------ a = 27 n = 25 b = 1 o = 26 c = 11 p = 5 d = 10 q = 0 e = 33 r = 21 f = 9 s = 30 g = 7 t = 35 h = 24 u = 7 i = 25 v = 1 j = 0 w = 10 k = 2 x = 1 l = 18 y = 2 m = 5 z = 0------------------------------ length frequency ------ --------- 1 3 2 13 3 24 4 13 5 10 6 2 7 5 8 3 9 1 10 3 11 2Average word length = 4.24==========================================================
Expected output for TextStatisticsTest
[[email protected] p4]$ java TextStatisticsTestTesting on data file:testfile.txtPassed! getCharCount()Passed! getWordCount()Passed! getLineCount()Passed! getAverageWordLength()Passed! Arrays frequenciesPassed! Letter frequenciesTesting on data file:etext/Gettysburg-Address.txtPassed! getCharCount()Passed! getWordCount()Passed! getLineCount()Passed! getAverageWordLength()Passed! Arrays frequenciesPassed! Letter frequenciesTesting on data file:etext/Alice-in-Wonderland.txtPassed! getCharCount()Passed! getWordCount()Passed! getLineCount()Passed! getAverageWordLength()Passed! Arrays frequenciesPassed! Letter frequencies
Submitting Your Project
Documentation
Javadoc Comments
If you havent already, add javadoc comments to your program. They should be located immediately before the class header and before each method. If you forgot how to do this, go look at the Documenting Your Program section from lab.
- Have a class javadoc comment before the class.
- Have javadoc comments before every method that you wrote. Comments must include
@param
and@return
tags as appropriate. - To build and view your comments, run the following commands.
javadoc -author -d doc *.javagoogle-chrome doc/index.html
README
Include a plain-text file called README that describes your program and how to use it. Expected formatting and content are described in README_TEMPLATE. See README_EXAMPLE for an example.
5/5 – (1 vote)
You should only have to read through each file once if you are doing this program properly. By the end of the constructor, the TextStatistics
object should have collected all of its statistics and calls to its accessor methods will simply return the stored values.
Reviews
There are no reviews yet.