Assignment
Most email systems automatically filter spam into a separate folder. The filtering process is typically based on a spam list that identifies spam when email is loaded into your inbox. In this assignment youre going to create a C++ program that is able to
- load a spam list,
- display the contents of a spam list,
- check a single email address to see if its spam,
- filter an email list and output the resulting emails to a file
For example, to the right is a screenshot of the program in action, loading spamlist1.txt, displaying the contents, checking a couple email addresses to see if they are spam, and then filtering an email list in emails1.txt to eliminate spam emails. The currently loaded spam list defines what is considered spam; the user can load different spam lists.
Pay close attention to the formatting of the output, as the Gradescope submission system will require you to adhere to this output format. Also note that when the program ends, some stats are output based on how efficiently the problem was solved. If M is the # of emails processed and N is the # of emails in the spam list, your solution should insert at most O(N) elements, and access at most O(MlgN) elements.
Assignment details
The goals of this assignment are to (1) practice creating functions, (2) work with C++ strings and string functions, (3) work with vectors, and (4) read and write files. Many of you are new to C++, so working with strings, vectors, and files are an important introduction to C++.
When you load a spam list, you are required to input the data into 1 or more vectors. You cannot use any other data structure, and you cannot use a C-based array you must use a vector. In fact, youll use an implementation of vector we provide: ourvector<T>, which is available on the course dropbox under Projects, project01files. Your program must support 4 commands: load, display, check, and filter. Each of these commands must be implemented by a function that is called by main(). You can (and should) have additional functions, but these 4 are required. Finally, since efficiency is an important component of this class, you are required to search the spam list using binary search. Binary search requires only O(lgN) time, whereas linear search takes O(N) time. Example: if N = 1,000,000 elements, binary search takes roughly 20 comparisons to find an element, whereas linear search takes an average of 500,000 comparisons. Big difference.
Note that a spam list is guaranteed to be in sorted order by domain, and if two spam entries have the same domain, they are ordered by username (where * is naturally sorted first). This implies that binary search can be used without the need to sort the spam list.
Two larger lists are provided for you to test the efficiency of your program. As shown in the screenshot below, the file spamlist_10k.txt contains 10,000 spam entries. This list is then used to filter an email list of 1,000 emails (emails_1k.txt). Let M = 1,000, and N = 10,000. If your solution is efficient, searching the spam list once should required O(lgN) = log2(10,000) = 14 comparisons / vector accesses. This search is repeated 1,000 times (once for each email), so an efficient filtering should require roughly 14,000 vector accesses. As shown below in the stats, the vector was accessed 12,949 times, denoting an efficient O(M * lgN) solution.
Input and Output file formats
A spam list contains 1 or more lines, where each line contains a single string denoting a spam address. There are 2 possible formats for a spam address: domain:* or domain:username. The * means any username associated with that domain is spam; otherwise an email is spam only if the username and domain both match. Heres the contents of spamlist1.txt that was shown earlier (and again in screenshot to right):
aaa.com:user bestbuy.com:coupons bestbuy.com:offers groupon.com:* groupon.com:reseller important.com:dont_ignore massemail.com:* organicfoods.com:noreply uic.edu:accc uic.edu:chancellor uic.edu:rewards xyz.com:user1
The file is guaranteed to be in alphabetical order by domain; if 2 strings have the same domain, they will be ordered by username (* will proceed any other username).
The other possible input file is an email list, which is processed by the filter command. An email list consists of 1 or more lines, which each line contains 3 values:
MsgID EmailAddress Subject
For example, here is the contents of the email1.txt file used in the screenshot shown above:
- [email protected] Question about grading
- [email protected] This week only: free TV with purchase
- [email protected] Please update profile
- [email protected] This is not spam, seriously, not spam
- [email protected] This is also not spam, seriously, spam it is not
- [email protected] Can you help?
- [email protected] Please buy more stuff!
- [email protected] Please update your profile
33 [email protected] Important msg frm user
- [email protected] A really non-important msg
- [email protected] Seriously, this one is important
- [email protected] Yup, you got me, spam
99 [email protected] youll like this offer
121 [email protected] Please update your pwd
123 [email protected] You cant miss this deal
155 [email protected] Just kidding
- [email protected] VPN is now operational
- [email protected] VPN is down
254 [email protected] Tuition waivers for all!
- [email protected] Subway sandwiches free every friday
- [email protected] Testing
The message ID is always an integer, the email address is a single string (i.e. no spaces), and the subject is a string containing 1 or more words. The message IDs may or may not be in ascending order. As noted later on in the section entitled Learning C++, use the >> operator to input the message ID and email address, and use the getline() function to input the multi-word subject.
Based on a spam list and an email list, the filter command filters the email and produces a corresponding output file containing the email list with spam emails removed. Given the spam list and email lists shown above, the program would filter and produce the following output file (output1.txt in the screenshot):
9 [email protected] Question about grading
- [email protected] Please update profile
- [email protected] This is not spam, seriously, not spam
- [email protected] This is also not spam, seriously, spam it is not
21 [email protected] Can you help?
23 [email protected] Please update your profile
33 [email protected] Important msg frm user
42 [email protected] Seriously, this one is important
123 [email protected] You cant miss this deal 254 [email protected] Tuition waivers for all!
261 [email protected] Testing
Sample input files both spam lists and email lists are provided in the course dropbox under Projects, project01files.
Program functionality
Email filtering is based on the currently loaded spam list. Your program should support filtering by different spam lists. An example is shown here:
Hint: your program should have a spam vector with the currently loaded spam list. When the user loads a
new spam list, clear the vector before loading the new spam entries.
Your program is required to work properly if the input or output files cannot be opened, the user specifies an unknown command, the user inputs a badly-formed email address, or the user tries to display / check / filter before loading a spam list. Examples with expected output:
Keep in mind that our Gradescope submission system will be auto-grading your output, so you need to match the responses shown above *exactly*.
Learning C++
You may be new to C++, or perhaps new to the features of C++ needed in this assignment. Homework #01 and #02 will help you prepare, so we strongly recommend you complete those HW exercises before starting.
Youll need some string processing, namely finding characters within a string, and extracting a substring. While you can certainly write these functions yourself, its expected that youll use the .find() and .substr() functions built into the string class provided by C++: http://www.cplusplus.com/reference/string/string/. Dont forget to #include <string>.
Your solution is required to store all data in a vector<T> class to be precise, in an vector-compatible implementation we are providing: ourvector<T>. For the purposes of this assignment, always start with an empty vector, and then add data to the vector by calling the push_back() member function. When you need to access an element of the vector, use the .at() function, or the more convenient and familiar [ ] syntax. To empty a vector in order to load a new spam list, use the .clear() function. For more info on vector<T>: http://www.cplusplus.com/reference/vector/vector/. Dont forget to #include ourvector.h, which is available on the course dropbox under Projects, project01files.
Finally, to work with files, #include <fstream>. To read from a file, use an ifstream object, and use the >> operator when inputting a single value (e.g. integer or single word). When you need to input 1 or more words into a single string variable, such as the email subject, use getline(infile, variable). Example of reading a file that contains one string per line, where each string is just one word (i.e. no spaces):
string infilename; infilename = ;
ifstream infile(infilename); // use infile object to read from file
if (!infile.good()) // unable to open input file: { } else
{ string oneWord;
infile >> oneWord; while (!infile.eof()) // until we hit the end-of-file: {
.
. // process
.
infile >> oneWord;
}
infile.close();
}
To write to an output file, use an ofstream object and the << operator. Heres an example of writing the string hello world! to a file:
string outfilename; outfilename = ;
ofstream outfile(outfilename); // use outfile object to write to file
if (!outfile.good()) // unable to open output file:
{ } else
{ outfile << hello world! << endl;
outfile.close(); // make sure contents are written by closing file: }
Programming Environment
You are free to program on whatever platform you want, using whatever compiler / programming environment you want. Input files and ourvector.h are available in the course dropbox under Projects, project01files. If you do not have a C++ programming environment that you like, we recommend Codio; see the appendix at the end of this document for getting started with Codio.
Be aware that we are using Gradescope as our grading platform, and its common for C++ programs to
work on my platform but fail on Gradescope. This is due to logic errors in *your* program, not an error with Gradescope. The most common mistake is a memory-related error, e.g. using an uninitialized variable or accessing memory outside the bounds of an array or via an invalid pointer. These errors are hard to find; the tools valgrind and cppcheck (available on Codio) can help.
Gradescope is running on Ubuntu Linux, and we are compiling via g++ with -std=C++11. Do not ask us to change the C++ version; we are compiling against C++ 11.
Requirements
- You must use ourvector<T> (ourvector.h) for storing all data. No other data structures may be used.
- Each input file may be opened and read exactly once; store the data in ourvector<T> if need be.
- You must use binary search to search the spam list, and you must write the algorithm yourself; do not call the built-in binary search function provided by C++.
- The # of inserts must be O(N), where N is the # of spam list entries in the input file(s).
- The # of accesses must be O(MlgN), where M is the # of emails in the input file(s).
- Your main.cpp program file must have a header comment with your name and a program overview. Each function must have a header comment above the function, explaining the functions purpose, parameters, and return value (if any). Inline comments should be supplied as appropriate; comments like declares variable or increments counter are useless. Comments that explain non-obvious assumptions or behavior *are* appropriate.
- Each command (load, display, check, filter) must be implemented using a function; this implies a complete solution must have at least 4 functions.
- No global variables; use parameter passing and function return.
- The cyclomatic complexity (CCN) of any function may not exceed 15, including main(). This will be reported when you submit on Gradescope, and youll be warned if you exceed this threshold. In short, cyclomatic complexity is a representation of code complexity e.g. nesting of loops, nesting of if-statements, etc. You should strive to keep code as simple as possible, which generally means encapsulating complexity within functions (and calling those function) instead of explicitly nesting code. Heres an example of simpler code with low CCN:
while () {
if (searchFunctionFindsWhatWeNeed()) doSomething();
next value;
}
Heres an example of complex code with high CCN:
while () {
for () // loop to do search
{
if (search condition is met)
{ for () // now do something:
{
}
}
} next value;
}
As a general principle, if we see code that has more than 2 levels of explicit looping an example of which is shown above we will score that code as 0, even if its technically correct. The solution is to move one or more loops into a function, and call the function.
Have a question? Use Piazza, not email
As discussed in the syllabus, questions must be posted to our course Piazza site questions via email will be ignored. Remember the guidelines for using Piazza:
- Look before you post the main advantage of Piazza is that common questions are already answered, so search for an existing answer before you post a question. Posts are categorized to help you search, e.g. HW.
- Post publicly only post privately when asked by the staff, or when its absolutely necessary (e.g. the question is of a personal nature). Private posts defeat the purpose of piazza, which is answering questions to the benefit of everyone.
- Ask pointed questions do not post a big chunk of code and then ask help, please fix this. Staff and other students are willing to help, but we arent going to type in that chunk of code to find the error. You need to narrow down the problem, and ask a pointed question, e.g. on the 3rd line I get this error, I dont understand what that means.
- Post a screenshot sometimes a picture captures the essence of your question better than text. Piazza allows the posting of images, so dont hesitate to take a screenshot and post; see http://www.takeaorg/ .
- Dont post your entire answer / code if you do, you just gave away the answer to the ENTIRE CLASS. Sometimes you will need to post code when asking a question in that case post only the fragment that denotes your question, and omit whatever details you can. If you must post the entire code, then do so privately theres an option to create a private post (visible to staff only).
Grading, electronic submission, and Gradescope
Your score on this project is based on two factors: (1) correctness as determined by your Gradescope submission, and (2) manual review by the TAs for commenting, style, and approach (e.g. use of functions and efficiency of solution). The entire project is worth 150 points: 100 points for correctness, and 50 points for commenting, style, and approach. In this class we expect all submissions to compile, run, and pass at least some of the test cases; do not expect partial credit with regards to correctness. Example: if you submit to Gradescope and your score is a reported as a 0, then thats your correctness score. The only way to raise your correctness score is to re-submit and obtain a higher score by passing one or more test cases. You have unlimited submissions on this project assignment.
Note that the TAs will also review for adherence to requirements; breaking a requirement can result in a final score of 0 out of 150. We take all requirements seriously.
A bonus of 10% is earned for submitting by the Bonus deadline on page 1. Bonus points are accumulated and can be applied to a future project. To earn bonus points, your submission must post before the Bonus deadline, earn a correctness score of 100, *and* meet all project requirements in this case required # of functions, efficiency, etc. In other words, you cannot submit a correct but inefficient solution and expect to earn bonus points. Bonus points are reserved for early and well-written submissions.
To submit to Gradescope, you must first create an account; check your UIC email for an invitation to
Gradescope. If you cannot find this invitation email, post privately on Piazza and we will send another invite; you cannot register yourself. Gradescope is running on Ubuntu Linux, and we are using g++ with -std=C++11. Do not ask us to change the C++ version; we are compiling against C++ 11. Our grading script assumes your program will be found in a single main.cpp file; do not create additional .h, .hpp, or .cpp files. You may submit as many files as you want (e.g. the .zip produced by Codios Project >> Export as Zip command), but we extract and run only your main.cpp file.
By default, we grade your last submission. Gradescope keeps a complete submission history, so you can activate an earlier submission if you want us to grade a different one; this must be done before the due date. We assume *every* submission on your Gradescope account is your own work; do not submit someone elses work for any reason, otherwise it will be considered academic misconduct.
Policy
Late work *is* accepted. You may submit as late as 24 hours after the deadline for a penalty of 10%. After 24 hours, no submissions will be accepted.
All work submitted for grading *must* be done individually. While we encourage you to talk to your peers and learn from them (e.g. your iClicker teammates), this interaction must be superficial with regards to all work submitted for grading. This means you *cannot* work in teams, you cannot work side-by-side, you cannot submit someone elses work (partial or complete) as your own. The Universitys policy is available here:
https://dos.uic.edu/conductforstudents.shtml .
In particular, note that you are guilty of academic dishonesty if you extend or receive any kind of unauthorized assistance. Absolutely no transfer of program code between students is permitted (paper or electronic), and you may not solicit code from family, friends, or online forums. Other examples of academic dishonesty include emailing your program to another student, copying-pasting code from the internet, working in a group on a homework assignment, and allowing a tutor, TA, or another individual to write an answer for you. It is also considered academic dishonesty if you click someone elses iClicker with the intent of answering for that student, whether for a quiz, exam, or class participation. Academic dishonesty is unacceptable, and penalties range from a letter grade drop to expulsion from the university; cases are handled via the official student conduct process described at https://dos.uic.edu/conductforstudents.shtml .
Page
Appendix: Codio Programming Environment
Heres a quick summary of how to get started with Codio. Codio is cloud-based, and accessible via a web browser. Its platform-neutral and works from any platform, but you must be online to use it. The first step is to create a Codio account and join the class (CS 251 Spring 2020):
https://codio.com/p/joinclass?token=forumjamaica
Be sure to register using your UIC email address, especially since you may be using this account in future CS classes. The above link will provide access to Codio, and the resources associated with CS 251.
Once you successfully login to Codio, youll see the project cs251-project01-spam-filter pinned to the top of your dashboard. This represents a container-based C++ programming environment think light-weight virtual machine (VM). When you are ready to start programming, click Ready to go and Codio will startup the VM and within a few seconds youll have a complete Ubuntu environment at your disposal. In particular, youll have access to C++ via the GNU g++ compiler. You also have super-user (root) access, so you can install additional software if you want (sudo aptget install XYZ).
Once I login, I normally split the right-side into 2 windows: the top as my editor pane, and the bottom as my terminal window. This can be done via the View menu, Panels, Split Horizontal. Then click on the bottom window, drop the Tools menu, and select Terminal. Heres a snapshot showing student.cpp open in the editor, and the terminal window open below it:
For the purposes of project #01, nothing is provided other than a makefile and some input files for testing purposes. Use the File menu to create a new main.cpp file, and then open a terminal windows and use the provided makefile to compile and run: type make build to compile, and assuming no compilation errors, type make run to execute.
Page
Reviews
There are no reviews yet.