Goal
The goal of this assignment is to familiarize you with Python and help you understand information retrieval by building a simple information retrieval system.
—
### What to do
In this assignment you will build a simple information retrieval engine that processes a set of documents, creates an inverted index to speed up information retrieval, and returns a ranked list of document filenames that match the specified keyword search, along with relevance scores and a breakdown of weights for all keywords.
You will actually need to build two different Python programs: `pine-index.py` and `pine-search.py`, as follows.
#### Pitt INformation retrieval Engine / Indexer Program (`pine-index.py`)
This program will read all files that are in subdirectory `input/` (assuming they are all text files) and construct an inverted index. Before constructing the inverted index, the program should:
* convert all characters to lower case,
* eliminate punctuation,
* eliminate numbers, and
* perform stemming using the `nltk` stemmer.
The inverted index should have for every word: a list of documents the word appears in, along with how many times the word appeared per document. Additionally, for every word, there should be a count of how many documents the word appears in (in order to compute its inverse document frequency). There should also be a count of the total number of documents. All this information needs to be in a single data structure (suggest *dict*).
Before the program terminates, it should store the created data structure as a JSON object in a file named `inverted-index.json`.
> Directory `input/` has nine sample documents (`doc1.txt` … `doc9.txt`) that can be used for testing your program. You are encouraged to test your program with other documents as well. We will grade your programs with additional document collections as well.
>
#### Pitt INformation retrieval Engine / Search Program (`pine-search.py`)
This program will read the JSON object from file `inverted-index.json` and a set of keywords from file `keywords.txt` and will produce an ordered list of document filenames, along with their relevance scores and a breakdown of weights for all keywords.
The keyword file will contain one set of keywords per line, that are space separated. It may have one ore more lines.
> A sample `keywords.txt` file is provided.
>
The relevance score for each document will be computed as explained in class.
> w(key, doc) = (1 + log2 freq(key,doc)) * log2 (N / n(doc))
> where n(doc) is the number of documents that contain keyword key and N is the total number of documents
>
The output of your program should be in the following format:
“`
————————————————————
keywords = pittsburgh steelers
[1] file=doc3.txt score=3.532495
weight(pittsburgh)=0.362570
weight(steelers)=3.169925
…
“`
> A sample output, for the keywords contained in the `keywords.txt` file is provided in the file `output.txt`
>
Note that in cases of documents with the same relevance score, their ranks are tied and the same number is used for all (for example, there are four documents ranked [5] in the sample output. You should order those by filename order.
### Important notes about grading
It is absolutely imperative that your python programs:
* run without any syntax or other errors (using Python 3) — we will run them using the following command:
`python3 pine-index.py` and
`python3 pine-search.py`
* strictly adhere to the format specifications for input and output, as explained above.
Failure in any of the above will result in **severe** point loss.
### Allowed Python Libraries
You are allowed to use the following Python libraries:
“`
argparse
collections
csv
json
glob
math
nltk
os
pandas
re
requests
string
sys
time
xml
“`
If you would like to use any other libraries, you must ask permission by Sunday, January 27, 2019, using [piazza](https://cs1656.org).
—
### How to submit your assignment
For this assignment, you must use the repository that was created for you after visiting the classroom link. You need to update the repository to include files `pine-index.py` and `pine-search.py`, as described above, and other files that are needed for running your program. You need to make sure to commit your code to the repository provided.
We will clone all repositories shortly after midnight:
* the day of the deadline **Tuesday, February 5th, 2019**
* 24 hours later (for submissions that are one day late / -5 points), and
* 48 hours after the first deadline (for submissions that are two days late / -15 points).
**No submissions will be accepted that are more than two days late, i.e., after midnight on Thursday, February 7th, 2019.**
### About your github account
It is very important that:
* Your github account can do **private** repositories. If this is not already enabled, you can do it by visiting <https://education.github.com/>
* You use the same github account for the duration of the course.
* You use the github account that you already specified using TopHat.
Goal
The goal of this assignment is for you to gain familiarity with SQL.
—
### What to do
In this assignment you are asked to:
* update a skeleton Python script (`moviepro.py`) in order to read input from CSV files and insert the data into the `cs1656.sqlite` database, and
* provide SQL queries that answer 12 questions.
The provided skeleton Python script includes database initialization commands and also includes commands to run the SQL queries and store their output in separate output files, which you should not modify. What you should update are the parts of the script that are responsible for reading in the input data (and inserting it into the database) and for specifying the 12 SQL queries.
### Database Schema
The schema of the database is embedded in the `moviepro.py` Python script and should not be modified. It is as follows:
* Actors (aid, fname, lname, gender)
* Movies (mid, title, year, rank)
* Directors (did, fname, lname)
* Cast (aid, mid, role)
* Movie_Director (did, mid)
### Reading input from CSV files
Your program should read input from the following CSV files:
* `actors.csv`, containing data for the Actors table,
* `cast.csv`, containing data for the Cast table,
* `directors.csv`, containing data for the Directors table,
* `movie_dir.csv`, containing data for the Movie_Director table, and
* `movies.csv`, containing data for the Movies table.
All the data should be inserted into the appropriate tables into the `cs1656.sqlite` database. Sample insert statements have been provided in the `moviepro.py` script, but you are not restricted to doing the insertions in exactly the same way.
Samples of all these files are provided as part of this repository.
### Queries
You are asked to provide SQL queries that provide answers for the following questions. Note that **actors** refers to both male and female actors, unless explicitely specified otherwise. Also note that you should not rely on the data provided in the sample CSV files for any of the answers; the datasets will be replaced with bigger files. Finally, please note that you may define views, etc, as part of other queries.
* **[Q01]** List all the actors (first and last name) who acted in at least one film in the 80s (1980-1990, both ends inclusive) and in at least one film in the 21st century (>=2000). Sort alphabetically, by the actor’s last and first name.
* **[Q02]** List all the movies (title, year) that were released in the same year as the movie entitled `”Rogue One: A Star Wars Story”`, but had a better rank (Note: the higher the value in the *rank* attribute, the better the rank of the movie). Sort alphabetically, by movie title.
* **[Q03]** List all the actors (first and last name) who played in a Star Wars movie (i.e., title like ‘%Star Wars%’) in decreasing order of how many Star Wars movies they appeared in. If an actor plays multiple roles in the same movie, count that still as one movie. If there is a tie, use the actor’s last and first name to generate a full sorted order.
* **[Q04]** Find the actor(s) (first and last name) who **only** acted in films released before 1985. Sort alphabetically, by the actor’s last and first name.
* **[Q05]** List the top 20 directors in descending order of the number of films they directed (first name, last name, number of films directed). For simplicity, feel free to ignore ties at the number 20 spot (i.e., always show up to 20 only).
* **[Q06]** Find the top 10 movies with the largest cast (title, number of cast members) in decreasing order. Note: show all movies in case of a tie.
* **[Q07]** Find the movie(s) whose cast has more actresses than actors (i.e., gender=female vs gender=male). Show the title, the number of actresses, and the number of actors in the results. Sort alphabetically, by movie title.
* **[Q08]** Find all the actors who have worked with at least 7 different directors. Do not consider cases of self-directing (i.e., when the director is also an actor in a movie), but count all directors in a movie towards the threshold of 7 directors. Show the actor’s first, last name, and the number of directors he/she has worked with. Sort in decreasing order of number of directors.
* **[Q09]** For all actors whose first name starts with an S, count the movies that he/she appeared in his/her debut year (i.e., year of their first movie). Show the actor’s first and last name, plus the count. Sort by decreasing order of the count.
* **[Q10]** Find instances of nepotism between actors and directors, i.e., an actor in a movie and the director having the same last name, but a different first name. Show the last name and the title of the movie, sorted alphabetically by last name.
* **[Q11]** The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the *”co-acting”* graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same movie as him have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 have Bacon number 2, etc. List all actors whose Bacon number is 2 (first name, last name). You can familiarize yourself with the concept, by visiting [The Oracle of Bacon](https://oracleofbacon.org).
* **[Q12]** Assume that the *popularity* of an actor is reflected by the average *rank* of all the movies he/she has acted in. Find the top 20 most popular actors (in descreasing order of popularity) — list the actor’s first/last name, the total number of movies he/she has acted, and his/her popularity score. For simplicity, feel free to ignore ties at the number 20 spot (i.e., always show up to 20 only).
—
### Important notes about grading
It is absolutely imperative that your python program:
* runs without any syntax or other errors (using Python3) — we will run it using the following command:
`python3 moviepro.py`
* strictly adheres to the format specifications for input and output, as explained above.
Failure in any of the above will result in **severe** point loss.
### Allowed Python Libraries
You are allowed to use the following Python libraries (although not all are needed):
“`
argparse
collections
csv
json
glob
math
os
pandas
re
requests
string
sqlite3
sys
time
xml
“`
If you would like to use any other libraries, you must ask permission by Monday, February 11, 2019, using [piazza](https://piazza.cs1656.org).
—
### How to submit your assignment
For this assignment, you must use the repository that was created for you after visiting the classroom link. You need to update the file `moviepro.py` as described above, and add other files that are needed for running your program. You need to make sure to commit your code to the repository provided. We will clone all repositories shortly after midnight:
* the day of the deadline **Sunday, February 17, 2019**
* 24 hours later (for submissions that are one day late / -5 points), and
* 48 hours after the first deadline (for submissions that are two days late / -15 points).
Our assumption is that everybody will submit on the first deadline. If you want us to grade a late submission, you need to email us at `[email protected]`
### About your github account
It is very important that:
* Your github account can do **private** repositories. If this is not already enabled, you can do it by visiting <https://education.github.com/>
* You use the same github account for the duration of the course.
* You use the github account that you specified during the test assignment.
The goal of this assignment is for you to gain familiarity with association rule mining and (in the process) to also advance your Python skills.
—
### What to do — arma.py
In this assignment you are asked to implement a simplified version of the A-Priori **a**ssociation **r**ule **m**ining **a**lgorithm, in Python.
You should name your program `arma.py`. It should be called as follows:
`python arma.py input_filename output_filename min_support_percentage min_confidence`
(or just `python` if python version 3 is your default python setup)
where:
* `input_filename` is the name of the file that contains market basket data that is the input to your program. The format for the input file is provided below. A sample file `input.txt` is provided together with this repository.
> **Input Format:** The input data should be provided as a CSV file, in the following format:
`transaction_id, item_1, item_2, item_3, …`
> for example:
“`
1, A100, A105, A207
2, A207
3, A100, A105
“`
> Notes:
> * Item names could consist of either numbers (0-9) or characters (a-zA-z) or combinations of numbers and characters. No spaces or puncuation characters are allowed in item names.
> * The CSV files may or may not contain whitespace between values.
* `output_filename` is the name of the file that will store the required output of your program. The file should contain the frequent item sets and the association rules that you discovered after processing the submitted input data. The required format for the output file is provided below. Sample output files (matching the input file provided) are provided together with this repository.
> **Output Format:** The output data should be provided as a CSV file, where every row is in one of the following formats:
`S, support_percentage, item_1, item_2, item_3, …`
> to denote that this is a frequent item**s**et or:
`R, support_percentage, confidence, item_4, item_5, …, ’=>’, item_6, item_7, … `
> to denote that this is an association **r**ule. The keys “S” and “R” are verbatim and no other substitution is needed.
> It should be noted that the items listed in the frequent itemset case (item 1, item 2, item 3, …) should be in lexicographic order, the items listed to the left of the => sign in the association rule case (item 4, item 5, …) should be in lexicographic order and so should the items listed in the right size of the => sign in the association rule case (item 6, item 7, …).
> The `support_percentage` should be the support percentage (expressed as a floating number between 0 and 1 with 4 decimal points) for the specific frequent itemset or for the specific association rule (and both should be above the user-specified min_support_percentage).
> The `confidence` should be the confidence percentage (expressed as a floating number between 0 and 1 with 4 decimal points) for the specific association rule (and should be above the user-specified min_confidence).
> You should list in the output file all the frequent itemsets that you discover in the input file (S) and all the association rules that you can generate using the A-Priori method (R), that satisfy the min support percentage and min confidence requirements.
>
> Here’s an example output file:
“`
S, 0.3000, A105
S, 0.2500, A100
S, 0.2000, A100, A207
S, 0.2000, A105, A207
S, 0.1500, A100, A105, A207
R, 0.1500, 0.5000, A105, ’=>’, A100, A207
“`
> Important Note: You should print 4 decimal points for all floating point numbers (e.g., use %.4f in your print statement).
> Note: Your program may print additional messages; these will not be considered. You are encouraged to use this mechanism for debugging or progress reporting purposes. Only the results contained in the output file, in the specified format, will be considered.
> The repository contains three output files as follows:
| Input Filename | Output Filename | Minimum Support Percentage | Minimum Confidence |
| — | — | — | — |
| input.csv | output.sup=0.5,conf=0.7.csv | 0.5 | 0.7 |
| input.csv | output.sup=0.5,conf=0.8.csv | 0.5 | 0.8 |
| input.csv | output.sup=0.6,conf=0.8.csv | 0.6 | 0.8 |
* `min_support_percentage` is the minimum support percentage for an itemset / association rule to be considered frequent, e.g., 5%. This should be provided as a floating point number (out of 1), e.g., 0.05, 0.4, 0.5 are used to denote 5%, 40%, and 50% respectively. You should not include a percent symbol.
* `min_confidence` is the minimum confidence for an association rule to be significant, e.g., 50%. This should be provided as a floating point number (out of 1), e.g., 0.05, 0.4, 0.5 are used to denote 5%, 40%, and 50% respectively. You should not include a percent symbol.
An example call to your program could be as follows:
`python3 arma.py input.csv output.csv 0.5 0.7`
—
### What to do — A-Priori Algorithm
The A-Priori algorithm utilizes the subset property for frequent itemsets, enabling significant pruning of the space of possible itemset combinations. Assuming a provided min support percentage and a min confidence, the i-th step of the algorithm works as follows:
**Step i:**
• Consider all the candidate frequent itemsets of size i. Let’s name them CFI(i).
• Count how many times each itemset in CFI(i) appears in our input data. This is the support count,
which is turned into the support percentage by dividing with the total number of transactions.
• The itemsets in CFI(i) whose support percentage is at least as much as the min support percentage
become the verified frequent itemsets, or VFI(i).
• Using itemsets in VFI(i) generate all plausible candidate itemsets of size +1, i.e., CFI(i + 1). This makes use of the subset property. For example, for ABC to be in CFI(3), all of AB, BC, and AB need to be in VFI(2).
This process starts with CFI(1) being all individual items and terminates on Step k, when CFI(k + 1) is empty.
The above process generates all the frequent itemsets, i.e., VFI(i), for 1 <= i <= k. For every frequent itemset we need to generate all possible association rules and keep only the rules whose support is greater or equal to the min support percentage and their confidence is greater or equal to the min confidence. To generate all possible rules from a frequent itemset, we generate all possible 2-partitions of the itemset (one will be the left-hand-side of the association rule and the other will be the right-hand-side), where neither partition is empty. For example, if {A,B,C} is a frequent itemset, then we should check the following association rules:
* A=>B,C
* B=>A,C
* C=>A,B
* A,B=>C
* A,C=>B
* B,C=>A
and compute their support and confidence. Note that the support of all these rules is the same as the support of the frequent itemset from which they came, i.e., {A,B,C}.
—
### Important notes about grading
It is absolutely imperative that your python program:
* runs without any syntax or other errors (using Python 3) — we will run it using the following command:
`python3 arma.py …`
* strictly adheres to the format specifications for input and output, as explained above.
Failure in any of the above will result in **severe** point loss.
### Allowed Python Libraries
You are allowed to use the following Python libraries:
“`
argparse
collections
csv
glob
itertools
math
os
pandas
re
requests
string
sys
“`
If you would like to use any other libraries, you must ask permission by Friday, March 8, 2019, using [piazza](https://piazza.cs1656.org).
—
### How to submit your assignment
For this assignment, you must use the repository that was created for you after visiting the classroom link. You need to create the file `arma.py` as described above, and add other files that are needed for running your program. You need to make sure to commit your code to the repository provided. We will clone all repositories shortly after midnight:
* the day of the deadline **Sunday, March 24, 2019** (i.e., 0:15am, Monday, March 25, 2019)
* 24 hours later (for submissions that are one day late / -5 points), and
* 48 hours after the first deadline (for submissions that are two days late / -15 points).
Our assumption is that everybody will submit on the first deadline. If you want us to grade a late submission, you need to email us at `[email protected]`
### About your github account
It is very important that:
* Your github account can do **private** repositories. If this is not already enabled, you can do it by visiting <https://education.github.com/>
* You use the same github account for the duration of the course.
* You use the github account that you specified on tophat.
The goal of this assignment is for you to gain familiarity with Graph Databases in general, and with Neo4j and, its query language, Cypher, in particular.
—
### What to do
In this assignment you are asked to:
* download neo4j locally,
* download the Movies database locally,
* provide Cypher queries that answer 8 questions, and
* write a Python script (‘movie-queries.py’) that will run your solutions for the 8 queries and store the query output in a file.
### Database Model
We will use the Movies database <https://neo4j.com/developer/movie-database/>, which has the following node labels:
* Actor
* Director
* Movie
* Person
* User
and the following relationship types (i.e., edge labels):
* ACTS_IN
* DIRECTED
* FRIEND
* RATED
The nodes in the Movies database have a number of attributes, including the following:
* name (for Actor/Director/Person/User)
* birthday (for Actor/Director/Person/User)
* title (for Movie)
* genre (for Movie)
### Setup
You are asked to follow the installation instructions and to utilize the lab material provided through `Recitation 09` (March 22, 2019). This will enable you to have a locally running neo4j server, along with an interactive query interface. You will also be able to download the Movies database directly into neo4j.
Please note that although we will use the same database model for testing your submissions. However, it will not necessarily be identical to the one you will download.
### Connecting to neo4j using Python
As part of this repository, you are provided with a sample Python script (`cypher_sample1.py`) that connects to the local graph database (which you have established by following the previous steps).
### Queries
You are asked to provide Cypher queries that provide answers for the following questions. Note that **actors** refers to both male and female actors, unless explicitly specified otherwise.
* **[Q1]** List the first 20 actors in descending order of the number of films they acted in.
*OUTPUT*: actor_name, number_of_films_acted_in
* **[Q2]** List the titles of all movies with a review with at most 3 stars.
*OUTPUT*: movie title
* **[Q3]** Find the movie with the largest cast, out of the list of movies that have a review.
*OUTPUT*: movie_title, number_of_cast_members
* **[Q4]** Find all the actors who have worked with at least 3 different directors (regardless of how many movies they acted in). For example, 3 movies with one director each would satisfy this (provided the directors where different), but also a single movie with 3 directors would satisfy it as well.
*OUTPUT*: actor_name, number_of_directors_he/she_has_worked_with
* **[Q5]** The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the *”co-acting”* graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same movie as him have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 have Bacon number 2, etc. *List all actors whose Bacon number is exactly 2* (first name, last name). You can familiarize yourself with the concept, by visiting [The Oracle of Bacon](https://oracleofbacon.org).
*OUTPUT*: actor_name
* **[Q6]** List which genres have movies where Tom Hanks starred in.
*OUTPUT*: genre
* **[Q7]** Show which directors have directed movies in at least 2 different genres.
*OUTPUT*: director name, number of genres
* **[Q8]** Show the top 5 pairs of actor, director combinations, in descending order of frequency of occurrence.
*OUTPUT*: director’s name, actors’ name, number of times director directed said actor in a movie
### Output Format (ignore at your own risk!)
You are asked to store the output for running all Cypher queries by your python script in a **single** file, named `output.txt`. For each query, you should have a header line `### Q1 ###`, followed by the results of the query (one row at a time, with commas separating multiple fields). If you do not provide an answer for the query, you should still print the header line in your output file, but leave a blank line after it. Answers should be ordered by query number and separated by a blank line as well.
For example, for the following question:
Q0: show the 3 oldest actors in the database, with the oldest one first.
*OUTPUT*: name, id
The corresponding Cypher query should be:
“`
match (n:Actor) return n.name, n.id order by n.birthday ASC LIMIT 3
“`
The output file should be as follows:
“`
### Q0 ###
Claudia Cardinale, 4959
Oliver Reed, 936
Anthony Hopkins, 4173
“`
Finally, there should be an empty line between different results.
—
### Important notes about grading
It is absolutely imperative that your python program:
* runs without any syntax or other errors (using Python 3) — we will run it using the following command:
`python3 movie-queries.py`
* generates file `output.txt` with the answers of all 8 queries
* strictly adheres to the format specifications for output, as explained above.
Failure in any of the above will result in **severe** point loss.
### Allowed Python Libraries
You are allowed to use the following Python libraries:
“`
argparse
collections
csv
glob
json
math
neo4j.v1
numpy
os
pandas
re
requests
statistics
string
sqlite3
sys
“`
If you would like to use any other libraries, you must ask permission by Wednesday, April 3th, 2019, using [piazza](https://piazza.cs1656.org).
—
### How to submit your assignment
For this assignment, you must use the repository that was created for you after visiting the classroom link. You need to create the file `movie-queries.py` as described above, and add other files that are needed for running your program. You need to make sure to commit your code to the repository provided. We will clone all repositories shortly after midnight:
* the day of the deadline **Friday, April 12th, 2019 (i.e., at 12:15am, Saturday, April 13th, 2019)**
* 24 hours later (for submissions that are one day late / -5 points), and
* 48 hours after the first deadline (for submissions that are two days late / -15 points).
### About your github account
It is very important that:
* Your github account can do **private** repositories. If this is not already enabled, you can do it by visiting <https://education.github.com/>
* You use the same github account for the duration of the course.
* You use the github account that you specified during the test assignment.
The goal of this assignment is to familiarize you with classification systems in general and with decision tree classifiers in particular.
### What to do — dec_tree.py
You are asked to write a Python program, called `dec_tree.py` that will
1. read a decision tree (stored in a plain text file),
2. read a test data set (stored in a csv file, with the first row having the variable names), and
3. evaluate the test data using the provided decision tree and provide statistics.
Your program should be invoked as:
“`
python3 dec_tree.py tree.txt test.csv
“`
### (1) Decision Tree Format
The decision tree will be provided as a text file and will essentially be the output from the ID3 Decision Tree Classifier.
A sample decision tree is provided below and also included as file `tree.txt` within this repository:
“`
color black: bad (2)
color blue
| fruit blueberries: good (2)
| fruit grapes: bad (1)
color green
| fruit blueberries: bad (2)
| fruit grapes: good (2)
color red
| fruit blueberries: bad (1)
| fruit grapes: good (1)
“`
The above was generated by the `treegen.py` program which is also included within this repository. The format is fairly straightforward: the above tree corresponds to a two-level decision tree, with `color` being the first variable (valid options: `black`, `blue`, `green`, and `red`) and `fruit` the second variable (valid options: `blueberries` and `grapes`). There are only two labels: `good` and `bad`. The numbers in parentheses denote how many samples each rule was built upon.
Your program **should handle decision trees up to 3 levels deep**.
Please note that although you are encouraged to experiment with the `decision-tree-id3` module (https://svaante.github.io/decision-tree-id3/index.html), used by the `treegen.py` program, as part of preparing your assignment, you are **not allowed to use the decision-tree-id3 module in your submission**.
### (2) Test Data Format
The test data set will be provided as a CSV file. The first row will contain the variable names. A sample test data file, named `test.csv`, is provided within this repository. The first 3 lines of the file are shown below:
“`
“day_of_week”, “fruit”, “color”
“mon”, “blueberries”, “black”
“mon”, “blueberries”, “blue”
“`
Please note that the number of variables in the test data set is greater than or equal to the number of variables specified in the decision tree file. In this example, `day_of_week` was not part of the decision tree.
### (3) How to evaluate the decision tree
Given the decision tree and the test data input files, you are asked to do two things:
1. for each row in the test data set, find which rule from the decision tree it will match against, and
2. keep track of how many times each rule in the decision tree was matched and print these statistics
Your program should only print the statistics for all rules and it must follow the same format as in the decision tree format.
For example, the correct output for running your program with the provided `tree.txt` and `test.csv` files should be the following (included in the repository as `output.txt`):
“`
color black: bad (6)
color blue
| fruit blueberries: good (3)
| fruit grapes: bad (2)
color green
| fruit blueberries: bad (4)
| fruit grapes: good (5)
color red
| fruit blueberries: bad (2)
| fruit grapes: good (4)
UNMATCHED: 1
“`
Note that you must include a line at the end if the test data contain rows that were not matched by any decision tree rules.
**Important Hint** In order to solve this assignment, you are strongly encouraged to read the documentation for the `exec()` python command
https://docs.python.org/3/library/functions.html#exec
### Important: special-cases.txt
If you do something in your code that you would consider a special case, then you are requested to submit an extra file, along with your submission, named `special-cases.txt`, where you described in plain text what the special case(s) is/are and how you handled it/them in your program. We will use this mechanism instead of asking such questions in piazza.
### Important notes about grading
It is absolutely imperative that your python program:
* runs without any syntax or other errors (using Python 3)
* strictly adheres to the format specifications for input and output, as explained above.
Failure in any of the above will result in **severe** point loss.
### Allowed Python Libraries (Updated)
You are allowed to use the following Python libraries (although a fraction of these will actually be needed):
“`
argparse
collections
csv
json
glob
math
numpy
os
pandas
re
requests
string
sys
time
“`
If you would like to use any other libraries, you must ask permission by Sunday, April 14th, 2019, using [piazza](https://cs1656.org).
### About your github account
It is very important that:
* Your github account can do **private** repositories. If this is not already enabled, you can do it by visiting <https://education.github.com/>
* You use the same github account for the duration of the course
* You use the github account that you specified at the beginning of the course
### How to submit your assignment
For this assignment, you must use the repository that was created for you after visiting the classroom link. You need to update the repository to include your own python files as described above, and other files that are needed for running your program. You need to make sure to commit your code to the repository provided. We will clone all repositories shortly after midnight the day of the deadline **Saturday, April 20th, 2019 (i.e., at 12:15am, Sunday, April 21st, 2019)**. There are no late submissions allowed for this assignment.

![[SOLVED] Cs 1656 assignments #1 to 5 solution](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[SOLVED] FIT5145 Foundations of Data Science Assignments 1 3 Business and Data Case Study Semester](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.