Overview
UWA, like every university around the country (probably around the planet) is very worried about ghost-written submissions for assignments. This is also known as contract cheating. Whatever you call it, ghost-writing is about getting someone else to do your work, but submitting it as if it was only your work. In this case we are concerned with essays. The incidence is believed to be low, but its clearly not a good thing.
Coming from a different angle, debates have raged at various times about whether different authors works were actually by those authors. For example,were all the works attributed to William Shakespeare actually by him?One approach to examining both of these issues is to usestylometry. That is, rather than looking directly at the content of texts, as one does when looking for suspectedplagiarism, stylometic looks for stylistic similarities. In other words, similarities in the ways a particular author uses language, rather than similarities in the actual words on the page, on the assumption that an author will use a similar style for similar sorts of content, fiction, non-fiction, etc.
What you will do for this Project is write a program that reads in either one or two text files containing the works to be analysed and builds a profile for each. Then either the profile is listed, or if there are two text files, the two profiles are compared, returning a score which reflects the distance between the two works in terms of their style; low scores, down to 0, imply that the same author is likely responsible for both works, while large scores imply different authors.
Specification: What your program will need to do
InputYour program must define the functionmainwith the following signature:def main(textfile1, arg2, normalize=False)The first, compulsory argument is the name of a text file with a work to be analysed. The second, compulsory argument will either be the name of a second text file to be analysed or will be the stringlisting. The final optional argument (default value False) is whether the profile values, excluding sentences per paragraph and words per setence (discussed below), are to be normlised.OutputThe output will either be some text with the score from a pairwise comparison, or the listing of the first files profile. These will be printed on standard output.
A more detailed specification
For the purposes of this project, a sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character). Thus:This is some text. This is yet more textcontains one sentence followed by the start of another sentence.
Your program will need to count the number of occcurences of certain words and certain pieces of punctuation. Specifically, the list of words to be counted is:also, although, and, as, because, before, but, for, if, nor, of, or, since, that,though, until, when, whenever, whereas, which, while, yet
If you are wondering why that particular set of words is being used, they are conjunctions, which can indicate more complex sentences without relying on the content of the text.
Your program should also count certain pieces of punctuation: comma and semicolon. In addition, your program should also count single-quote and hyphen, but only under certain circumstances. Specifically, your program should count single-quote marks, but only when they appear as apostrophes surrounded by letters, i.e. indicating a contraction such as shouldnt or wont. (Apostrophe is being included as an indication of more informal writing, perhaps direct speech.)
Finally, your program should count dash (minus) signs, but only when they are surrounded by letters, indicating a compound-word, such as compound-word.
Any other punctuation or letters, e.g . when not at the end of a sentence, should be regarded as white space, so serve to end words. For these purposes, strings of digits are also words as they convey information. Therefore, in the unlikely event that a floating point number, such as 3.142, appears, that is regarded as two words.Note: Some of the texts we will use include double hyphen, i.e. . This is to be regarded as a space character
Each of the words and punctuation symbols should be placed, together with their respective counts, in a dictionary, which I shall call aprofile.
You should also add to the profile two further parameters relating to the text: the average number of words per sentence and the average number of sentences per paragraph, where a paragraph is any number of sentences followed by a blank line or by the end of the text.
If the third, optional parameter inmainis set toTrue, the profile values are to be normalised. That is, except for the words per sentence and sentences per paragraph parameters, each of the others is to be divided by the number of sentences in the respective text. (Clearly, if two texts are nominated, and normalise is set toTrue, both profiles are to be normalised.)
If the second argument is the stringlisting, the profile corresponding to the first file should be printed on standard output, one item per line. On the other hand, if the second argument is another text file, the distance between the corresponding profiles should be computed using the standard distance formula:
Example
An example interaction, which you can findhere, is based on three files:sample1.txtandsample2.txt, both excerpts taken fromLife on the Mississippi, by Mark Twain, andsample3.txt, which is taken from Banjo Pattersons collection of stories, Three Elephant Power.
Some Text Files to Examine
Here are some files for you to try out. All of the texts, apart from Kangaroo, were obtained fromProject Gutenberg. All the files have a long text at the end which containsProject Gutenberg license and terms of use. I have linked the Gutenberg terms and license here rather than left them in the texts because that may affect the profiles.
Author
Title
Fiction/Non-fiction
Henry Lawson
Children of the Bush
Fiction
D. H. Lawrence
Fantasia of the Unconscious
Non Fiction
Mark Twain
Life on the Mississippi
Non Fiction
D. H. Lawrence
Sea and Sardinia
Non Fiction
D. H. Lawrence
Kangaroo
Fiction
Mark Twain
Adventures of Hucklebery Finn
Fiction
Andrew Barton Banjo Paterson
Three Elephant Power
Fiction
A small note of warning. If you decide to download your own texts from Project Gutenberg (recommended), please be aware that many of the texts include spurious Unicode characters. Unfortunately, the file input-output functions we use in CITS1401 (and I use on a daily basis) only work with the standard ASCII character set, so will cause an exception if Unicode characters are in the text. While Python is well able to deal with Unicode, special input-output functions are needed, which are beyond the scope of this unit. What I have done is use the Unix command:cat -vetfilenameto make the Unicode characters visible in the ASCII character set, and then use a text editor to remove them. (Tedious.)
Important
You will have noticed that you have not been asked to write specific functions. That has been left to you. However, as in Project 1,it is important that your program defines the top-level functionmain()as described above.main()should then call the other functions. (Of course, these may call further functions.) The reason this is important is that when I test your program, my testing program will call yourmain()function. So, if you fail to definemain(), or define it with a different signature, my program will not be able to test your program.
Things to avoid
There are a couple things for your program to avoid.
Please donotimportanyPython module, other thanmath, or, if you wish,os. While use of these modules suchreis a perfectly sensible thing to do (and the way I often may do it), it takes away much of the point of different aspects of the project, which is about getting practice creating code to accurately extract the parts of strings that that you need, and use of basic Python structures, in this case dictionaries.
Please donotassume that the input file names will end in .txt. File name suffixes such as .csv and .txt are not mandatory in systems other than Microsoft Windows.
Please make sure your program doesNOTcall theinput()function. That will cause your program to hang, waiting for input that my automated testing system will not provide. In fact, what will happen is that the marking program detects the call(s), and will not test your code at all.
Programming
[SOLVED] math python graph Overview
$25
File Name: math_python_graph_Overview.zip
File Size: 244.92 KB
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.