, ,

[SOLVED] Eecs 484 project 1 to 4 solutions

$25

File Name: Eecs_484_project_1_to_4_solutions.zip
File Size: 310.86 KB

Categories: , , Tags: , ,
5/5 - (1 vote)

Overview In Project 1, you will design a relational database for storing information about your Fakebook social network. You will begin with a detailed description of the content. Then, you will need to systematically go through the conceptual and logical database design process you learned about in class. You can do the project either alone or in a group of two. If working in a group, a single submission is required. Part 1: ER Design You should do this to derive constraints for the relational model and to make sure you can go from requirements to ER model to Relational model. As a starting point, we have done the initial “requirements analysis” for you. The following is a brief description of the data that you will store in your database. (In real life, you would probably begin with much fuzzier information.) All IDs in the specs below are, of course, unique. User Information: There can be an unlimited number of users. Each user has the following information: ● Profile information This includes the following attributes: user ID, first name, last name, year of birth, month of birth, date of birth, gender. ● Hometown Location A user’s hometown includes the following attributes: city, state, country. ● Current Location Exactly the same attributes as hometown location. ● Education History A user’s educational history contains information on each college program attended, if any, with each college program attended containing the following attributes: name of the institution (e.g., University of Michigan), year of graduation, concentration (e.g., CS, EE, etc.), and degree (e.g., BS, MS, PhD, etc.). ● Friendship information Each user can have any number of friends. Each friend must also be a Fakebook user. Photos “Photos” is an important Fakebook application. It records the following information: ● Album information Each photo MUST belong to exactly one album. An album has the following attributes: album_ID, owner_ID (this refers to the owner’s Fakebook ID), album_name, cover_photo_ID (this refers to a photo ID), album_created_time, album_modified_time , album_link and album_visibility. ● Other information Each photo has the following attributes: photo_ID, photo_caption, photo_created_time, photo_modified_time, and photo_ link. ● Photo Tags Users can also interact by tagging each other. A photo tag identifies a Fakebook user in a photo. It has the following associated attributes: tag_photo_id (a Fakebook photo ID), tag_subject_id (a Fakebook user ID), tag_x_coordinate and tag_y_coordinate, and tag_created_time The database does not track who did the tagging. Note that there can be multiple tags at exactly the same (x, y) location. However, there can be only ONE tag for each subject in the photo; Fakebook doesn’t allow multiple tags for the same subject in a single photo. For example, you cannot tag Lady Gaga twice in a photo, even if she appears to be at two separate locations in the photo. Messages: Users can also send private messages to each other. ● Message information sender_ID (a Fakebook user ID), receiver_id (a Fakebook user ID), message_content (the text of the message), and sent_time In this version of Fakebook, there are no group messages. A user can, of course, send zero or more messages to different users. Events: “Events” is another useful Fakebook feature. ● Basic event information event_ID, event_creator_id (Fakebook user who created the event), event_name, event_tagline, event_description, event_host (this is a string, not a Fakebook user), event_type, event_subtype, event_location, event_city, event_state, event_country, event_start_time, and event_end_time ● Event participants Participants in an event must be Fakebook users. Each participant must have a confirmation status value (attending, declined, unsure, or not‐replied). The sample data does not have information on Event Participants, so you can leave the information on Participants empty. Task for Part 1 Your task in Part 1 is to perform “Conceptual Database Design” using ER Diagrams. There are many ER variants, but for this project, we expect you to use the conventions from the textbook and lecture. Hints for Part 1 You need to identify the entity sets and relationship sets in a reasonable way. We expect there to be multiple correct solutions; ER design is somewhat subjective. Your goal should be to capture the given information using ER constructs that you have learned about in class (participation constraints, key constraints, weak entities, ISA hierarchies and aggregation) as necessary. For the entity set, relationship set and attribute names, you can use the ones we have provided here, or you may also choose your own names, as long as they are intuitive and unambiguous. Before you get started, you should also read Appendix to understand the specifics of the data. Some of the ER diagram constraints are in Appendix. Part 2: Logical Database Design For the second part of the project, your task is to convert your ER diagrams into relational tables. You are required to write SQL DDL statements for this part. You should turn in two files: 1. createTables.sql 2. dropTables.sql As a starting point, we are giving you a set of tables, along with some columns. Your design must use these tables. But, you will need to add any integrity constraints so that the schema is as close to enforcing the requirements as is practical. You can add additional columns as well. Use the most appropriate types for the fields as well.  The required tables and their schema are given below: USERS: USER_ID (NUMBER) FIRST_NAME (VARCHAR2(100)) LAST_NAME (VARCHAR2(100)) YEAR_OF_BIRTH (INTEGER) MONTH_OF_BIRTH (INTEGER) DAY_OF_BIRTH (INTEGER) GENDER (VARCHAR2(100)) FRIENDS: USER1_ID (NUMBER) USER2_ID(NUMBER) CITIES: CITY_ID (INTEGER) CITY_NAME(VARCHAR2(100)) STATE_NAME (VARCHAR2(100)) COUNTRY_NAME (VARCHAR2(100)) USER_CURRENT_CITY: USER_ID (NUMBER) CURRENT_CITY_ID (INTEGER) USER_HOMETOWN_CITY: USER_ID (NUMBER) HOMETOWN_CITY_ID (INTEGER) MESSAGE: MESSAGE_ID (INTEGER) SENDER_ID (NUMBER) RECEIVER_ID(NUMBER) MESSAGE_CONTENT (VARCHAR2(2000)) SENT_TIME (TIMESTAMP) PROGRAMS: PROGRAM_ID (INTEGER) INSTITUTION (VARCHAR2(100)) CONCENTRATION (VARCHAR2(100)) DEGREE (VARCHAR2(100)) EDUCATION: USER_ID (NUMBER) PROGRAM_ID (INTEGER) PROGRAM_YEAR (INTEGER) USER_EVENTS: EVENT_ID (NUMBER) EVENT_CREATOR_ID (NUMBER) EVENT_NAME (VARCHAR2(100)) EVENT_TAGLINE (VARCHAR2(100)) EVENT_DESCRIPTION (VARCHAR2(100)) EVENT_HOST (VARCHAR2(100)) EVENT_TYPE (VARCHAR2(100)) EVENT_SUBTYPE (VARCHAR2(100)) EVENT_LOCATION (VARCHAR2(100)) EVENT_CITY_ID (INTEGER) EVENT_START_TIME (TIMESTAMP) EVENT_END_TIME (TIMESTAMP) PARTICIPANTS: EVENT_ID (NUMBER) USER_ID (NUMBER) CONFIRMATION (VARCHAR2(100)) ALBUMS: ALBUM_ID (VARCHAR2(100)) ALBUM_OWNER_ID (NUMBER) ALBUM_NAME (VARCHAR2(100)) ALBUM_CREATED_TIME (TIMESTAMP) ALBUM_MODIFIED_TIME (TIMESTAMP) ALBUM_LINK (VARCHAR2(2000)) ALBUM_VISIBILITY (VARCHAR2(100)) COVER_PHOTO_ID (VARCHAR2(100)) PHOTOS: PHOTO_ID (VARCHAR2(100)) ALBUM_ID (VARCHAR2(100)) PHOTO_CAPTION (VARCHAR2(2000)) PHOTO_CREATED_TIME (TIMESTAMP) PHOTO_MODIFIED_TIME (TIMESTAMP) PHOTO_LINK (VARCHAR2(2000)) TAGS: TAG_PHOTO_ID (VARCHAR2(100)) TAG_SUBJECT_ID (NUMBER) TAG_CREATED_TIME (TIMESTAMP) TAG_X (NUMBER) TAG_Y (NUMBER) Keep the table and field names exactly as written above. Also, make sure you use the correct field types (e.g., number or integer) as specified above. Failure to do so may result in failing the autograder since the database is case and type sensitive. (Note: The ID types for various fields would normally be INTEGERs in practice, but they are not in this project for reasons other than technical, primarily,  that the input data sets we are importing contains non­integer types for keys  ­­ use it as a learning moment  to deal with IDs of different types!) You need to decide what fields will be primary keys and what fields will be foreign keys(if necessary). Use the smallest candidate keys when possible for primary keys. Hints for Part 2 You should capture as many constraints from your ER diagrams as possible in your createTables.sql file. In your dropTables.sql, you should write the DROP TABLE statements necessary to destroy the tables you have created. (Also, for your own learning, it is good to check if your ER diagrams map to the tables we gave you. If not, you may want to discuss this in the discussions, piazza, or office hours to determine whether the fault lies in your ER diagrams or the tables we gave you). Using Oracle SQL*Plus, you can run your .sql files with the following commands: sqlplus / @ dropTables.sql sqlplus / @ createTables.sql Please double‐check that you can run the following sequence without errors in a single sql script. Otherwise, you may fail our auto‐grading scripts. Also remember to drop any triggers, constraints, etc., that you created. ● createTables.sql ● dropTables.sql ● createTables.sql ● dropTables.sql ● Part 3: Populate Your Database For this part of the project, you will populate your database with Fakebook data (please see  Appendix 1 on where to find the dataset and its description). You should turn in the set of SQL statements (DML) to load data from the public tables (PUBLIC_USER_INFORMATION , etc.) into your tables. You should put all the statements into a file called “loadData.sql”. Hints for Part 3 There will be some variations depending on the schema that you choose. In most cases, however, you can load data into your tables using very simple SQL commands. Please double‐check that you can run the following sequence without errors in a single sql script. Otherwise, you may fail our auto‐grading scripts. Also remember to drop any triggers, constraints, etc., that you created. ● createTables.sql ● loadData.sql ● dropTables.sql Your loadData.sql must load from our PUBLIC dataset given in the Appendix, not from a private copy. We will be testing your system against hidden datasets and therefore need your loadData.sql to be loading from the specified dataset. Otherwise, you will fail our tests. One concern you might have is how to handle the constraint on regard Friend data. For this project, when loading the data, ensure that only the necessary data is loaded. For example, if the original data contains (2,7) and (7,2), only load one of these two values. Loading both or neither would be incorrect. After the data has been loaded, you only need to ensure that any insertion of new data does not break the no duplication constraint. This can either be done by rejecting any insert or batch insert which would violate the constraint or only accepting valid data and rejecting the rest. The first option tends to be easier. Part 4: Create views on your database As a final task, you will create some views on your tables. Here is what we would like: Define views to recreate the same schemas as the PUBLIC tables (see Appendix). The rows in a view do not have to be in exactly the same order as in the corresponding table in the PUBLIC datasets, but the schema must be identical. The columns must have identical names and types. You can check the schema of the PUBLIC tables by using the “DESC TableName” command. For the public dataset, the original data satisfied all the integrity constraints, each view will have the same set of rows as in the corresponding input table. Name your view tables as follows (correspondence to the public tables should be obvious ‐‐ See Appendix later) ● VIEW_USER_INFORMATION ● VIEW_ARE_FRIENDS ● VIEW_PHOTO_INFORMATION ● VIEW_TAG_INFORMATION ● VIEW_EVENT_INFORMATION Turn in the following files that create and drop the views: ● createViews.sql ● dropViews.sql Hints for Part 4: 1. You should check that the following sequence works correctly in a single script (no errors). ● createTables.sql ● loadData.sql ● createviews.sql ● dropViews.sql ● dropTables.sql 2. You should also check for the provided dataset that createViews.sql results in identical tables to the provided tables. For example, the following two statements check if the public dataset is the same as your created view. ○ SELECT * FROM keykholt.PUBLIC_USER_INFORMATION MINUS SELECT * FROM VIEW_USER_INFORMATION; ● SELECT * FROM VIEW_USER_INFORMATION MINUS SELECT * FROM keykholt.PUBLIC_USER_INFORMATION; If both queries return no results, then the rows in both are the same. 3. It is not necessary to exactly recreate the PUBLIC_ARE_FRIENDS table since it is not guaranteed that for every (x,y) row entry, there is a corresponding (y,x) entry. For the VIEW_ARE_FRIENDS, the requirement is that for every (x,y) entry in the public dataset, it either has a (x,y) or (y,x) entry, but not both. For example, if the public dataset has both (2,7) and (7,2), your view should contain only (2,7) or (7,2). 4. DOUBLE CHECK YOUR SCHEMA! Many students forget to check their schema and fail the autograder because of it. Submission Checklist Please put all your files in a single zip file called project1.zip and submit a single file. We will post instructions later. 1. partner.txt:  List the partners who worked together. If you worked alone, just list your name. Remember to follow the Engineering Honor Code and Course policies in Lecture00. 2. ER Diagram. Filename should be er.pdf. 3. Five SQL files a. createTables.sql (Part 2) b. dropTables.sql (Part 2) c. loadData.sql (Part 3) d. createViews.sql (Part 4) e. dropViews.sql (Part 4) Both partners should submit separately, even if the submission is identical. Late policy applies individually (even if submissions are identical). How to create a zip file? Log into a Linux machine. Put all your submission files into one folder % zip ‐r  project1.zip partner.txt er.pdf createTables.sql dropTables.sql loadData.sql createViews.sql dropViews.sql You MUST create the zip file using the above command as exactly typed. That ensures that you include the correct set of files with exactly the right names. You can add in a README.txt file if you wish as well for any additional information. To test that your zip file contains everything, email or copy the zip to another machine or folder and unzip it to make sure you are able to extract all the files. We will update the instructions on how precisely submit your zip file to us by next week. Appendix: Description of the Fake data set for Part 3 This section describes the format of the fake data we will provide you to load into your database Fake social network data Everyone will have access to a fake data set, which is designed to emulate a social network dataset. The fake data includes the following five tables: PUBLIC_USER_INFORMATION PUBLIC_ARE_FRIENDS PUBLIC_PHOTO_INFORMATION PUBLIC_TAG_INFORMATION PUBLIC_EVENT_INFORMATION These tables are stored in the GSI’s account (keykholt). You can access the public tables for the fake data using GSI’s account name (keykholt). For example, to access the PUBLIC_USER_INFORMATION table, you need to refer to the table name as keykholt.PUBLIC_USER_INFORMATION. You can copy the data into your own account with the following command: CREATE TABLE NEW_TABLE_NAME AS (SELECT * FROM keykholt.TABLE_NAME); The data will then be stored into your personal Oracle space. You can login to SQL*Plus to browse the data. The fake data tables we provide actually give you some hints on the previous parts of the assignment. However, these tables are highly “denormalized” (poorly designed), and without any table constraints. As mentioned earlier, the table names are: PUBLIC_USER_INFORMATION PUBLIC_ARE_FRIENDS PUBLIC_PHOTO_INFORMATION PUBLIC_TAG_INFORMATION PUBLIC_EVENT_INFORMATION The fields of those tables are as follows: PUBLIC_USER_INFORMATION table: 1. USER_ID This is the Fakebook unique ID for users 2. FIRST_NAME Every user MUST have a first name on file 3. LAST_NAME Every user MUST have a last name on file 4. YEAR_OF_BIRTH Some users may not provide this information 5. MONTH_OF_BIRTH Some users may not provide this information 6. DAY_OF_BIRTH Some users may not provide this information 7. GENDER Some users may not provide this information 8. HOMETOWN_CITY       Some users may not provide this information 9. HOMETOWN_STATE Some users may not provide this information 10. HOMETOWN_COUNTRY Some users may not provide this information 11. CURRENT_CITY Some users may not provide this information 12. CURRENT_STATE Some users may not provide this information 13. CURRENT_COUNTRY Some users may not provide this information 14. INSTITUTION_NAME       Some users may not provide this information. A single person may have studied in multiple       institutions (college and above). 15. PROGRAM_YEAR       Some users may not provide this information. A single person may have enrolled in multiple      programs. 16. PROGRAM_CONCENTRATION      Some users may not provide this information. This is like a short description of the program. 17. PROGRAM_DEGREE       Some users may not provide this information. PUBLIC_ARE_FRIENDS table:  1. USER1_ID 2. USER2_ID Both USER1_ID and USER2_ID refer to the values in the USER_ID field of the USER_INFORMATION table. If two users appear on the same row, it means they are friends; otherwise they are not friends. A pair of users should only appear once in the table (i.e., a pair should only appear in one of the two possible orders). PUBLIC_PHOTO_INFORMATION table:  All attributes must be present unless otherwise specified 1. ALBUM_ID ALBUM_ID is the Fakebook unique ID for albums. 2. OWNER_ID User ID of the album owner. 3. COVER_PHOTO_ID Each album MUST have one cover photo and the photo must be in the album. The values are the Fakebook unique IDs for photos. 4. ALBUM_NAME 5. ALBUM_CREATED_TIME 6. ALBUM_MODIFIED_TIME 7. ALBUM_LINK The unique URL directly to the album 8. ALBUM_VISIBILITY        It is one of the following values: EVERYONE, FRIENDS_OF_FRIENDS, FRIENDS, MYSELF, CUSTOM 9. PHOTO_ID This is the Fakebook unique ID for photos. 10. PHOTO_CAPTION An arbitrary string describing the photo. This field is not necessarily populated. 11. PHOTO_CREATED_TIME 12. PHOTO_MODIFIED_TIME 13. PHOTO_LINK The unique URL directly to the photo PUBLIC_TAG_INFORMATION table: All attributes must be populated. 1. PHOTO_ID Unique Id of the corresponding photo 2. TAG_SUBJECT_ID Unique Id of the corresponding user 3. TAG_CREATED_TIME 4. TAG_X_COORDINATE 5. TAG_Y_COORDINATE PUBLIC_EVENT_INFORMATION table: All required unless otherwise specified 1. EVENT_ID This is the Fakebook unique ID for events. 2. EVENT_CREATOR_ID Unique Id of the user who created this event 3. EVENT_NAME 4. EVENT_TAGLINE Not necessarily provided 5. EVENT_DESCRIPTION Not necessarily provided 6. EVENT_HOST 7. EVENT_TYPE Fakebook has a fixed set of event types to choose from a drop‐down menu. 8. EVENT_SUBTYPE Fakebook has a fixed set of event subtypes to choose from a drop‐down menu. 9. EVENT_LOCATION User entered arbitrary string. For example, “my backyard”. Not necessarily provided 10. EVENT_CITY Not necessarily provided.  11. EVENT_STATE Not necessarily provided. 12. EVENT_COUNTRY Not necessarily provided. 13. EVENT_START_TIME 14. EVENT_END_TIME Oracle and SQL*Plus This section describes how to get started using Oracle and SQL*Plus. Logging in to your Oracle Account First, connect to login.engin.umich.edu using SSH with your UMich account (uniqname and Kerberos password). Then execute: module load eecs484/f16 NOTE: if you add the “module load” command to your ~/.bash_profile, it will always be executed when you log in to your CAEN account. Then, to connect to the Oracle server, you will just have to enter the sqlplus command. Enter the user name and password for your Oracle account to login. The default password is eecsclass. When you log in the first time, you will be prompted to change your password. Oracle passwords can contain any alpha numeric characters and underscore (_), dollar ($), and number sign (#). Do not use quotation marks or the @ symbol in your new password. If you do, and find that you cannot log in, email one of the instructors to reset your password. After that, you can type SQL commands to interact with the database system. Note that you must end every statement you want to execute with a semicolon. To disconnect from Oracle you can execute: EXIT Try this early! If you have trouble accessing your Oracle account, please speak to the GSI.

Overview While Project 1 focused primarily on database design, in this project you will focus on writing SQL queries. In addition, you will embed your SQL queries into Java source code (using JDBC) to implement “Fakebook Oracle,” a standalone program that answers several queries about the Fakebook database. For this project, you will use a standardized schema that we provide to you, rather than the schema that you designed in Project 1. You will also have access to our public Fakebook dataset for testing. 1. Files Provided to You You are provided with 3 Java files: TestFakebookOracle.java , FakebookOracle.java and MyFakebookOracle.java . We have also provided a jar file ojdbc6.jar. Place all these 4 files, including the jar file under a folder project2. In addition, we have provided you a file solution-public. txt showing sample query results. When submitting your completed project, you only need to turn in MyFakebookOracle.java . 1) TestFakebookOracle.java This file provides the main function for running the program. You can use it to run and test your program, but you don’t need to turn it in. Please only modify the oracleUserName and password static variables, replacing them with your own information. 1 2) FakebookOracle.java DO NOT modify this file, although you are welcome to look at the contents if you are curious. This class defines the query functions (discussed below in Section 3) as abstract functions, which you must implement for Project 2. It also defines some useful data structures and provides formatted print functions for you to make use of. 3) MyFakebookOracle.java This is a subclass of FakebookOracle , in which you must implement the query functions. You should ONLY fill in the body for each of the query functions. DO NOT make any other changes. In this project, you only need to store the results of the queries in our predefined data structures (which we have provided as member variables in the class). You don’t need to worry about output formatting. In the base class FakebookOracle.java , a set of print functions have been provided for you to view the query results. The MyFakebookOracle class contains parameterized names for the tables you will need to use in your queries, and they are constructed in the class constructor as shown in the following lines of code. You should always use the corresponding variable when you are referring to a table in the SQL statement to be executed through JDBC. For example, you should use the variable cityTableNam e instead of using a constant value such as ‘PUBLIC_CITIES’ in your Java code. 2 For creating and managing the database connection, you should use the predefined object oracleConnection. 4) solution-public.txt This file contains the output query results from running our official solution implementation on the public dataset provided to you. You can make use of this to check whether or not your queries are generating the same results from the same input dataset. Note that your submission will be graded using a different input dataset, so producing correct results on the public dataset is not a guarantee that your solution is entirely correct. Make sure that your queries are designed to work more generally on any valid input dataset, not just the sample data we provide. Also, think carefully about the semantics of your queries since it may not be always possible to test them in all scenarios and you often will not have the benefit of knowing the correct answers in practice. 2. Tables For this project, your schema will consist of the following twelve tables: 1. ._USERS 2. ._FRIENDS 3. ._CITIES 3 4. ._PROGRAMS 5. ._USER_CURRENT_CITY 6. ._USER_HOMETOWN_CITY 7. ._EDUCATION 8. ._USER_EVENTS 9. ._PHOTOS 10. ._ALBUMS 11. ._TAGS To Access Public Fakebook Data: should be replaced with “PUBLIC” to access the public Fakebook data tables. The public data tables are stored in the GSI’s account name (tajik). Therefore, you should use the GSI’s account name as in order to access the public tables directly within SQL*Plus for testing your queries. For example, to access the public USERS table, you should refer to the table name as tajik.PUBLIC_USERS . In the Java files provided to you, the above table names are already pre-configured in the given code. 3. Queries (100 points) There are 10 total queries (Query 0 to Query 9). Query 0 is provided to you as an example, and you are left to implement the remaining nine. The points are as shown. The queries vary tremendously in terms of difficulty. If you get stuck on a harder query, try an easier one first and come back to the tough one later. For all of these queries, sample answers on the given data are available in the attached zip file. If the English description is ambiguous, please look at the sample answers. Also, for all of these queries, when feasible, you should try to do most of heavy-lifting to answer the query within SQL. For example, if a query requires you to present the data in sorted order, use ORDER BY in your query rather than retrieving the result and then sorting it within Java. 4 Also, the grading program we use does impose a time limit on the time it waits for a query. If a query appears to be taking too much time, you should consider rewriting it in a different way to make it faster. Nested queries are usually more expensive to run. Query 0: Find information about month of birth (0 points) This function has been implemented for you in MyFakebookOracle.java , so that you can use it as an example. The function computes the month in which the most users were born and the month in which the fewest users were born. The names of these users are also retrieved. The sample function uses the Connection object, oracleConnection, to build a Statement object. Using the Statement object, it issues a SQL query, and retrieves a ResultSet. It iterates over the ResultSet object, and stores the necessary results in a Java object. Finally, it closes both the Statement and the ResultSet objects. Query 1: Find information about names (10 points) The next query asks you to find information about users’ names, including 1) the longest first name, 2) the shortest first name, and 3) the most common first name. If there are ties, you should include all of the matches in your result. The following code snippet illustrates the data structures that should be constructed. However, it is up to you to add your own JDBC query to answer the question correctly. 5 Query 2: Find “lonely” users (10 points) The next query asks you to find information about all users who have no friends in the network. Again, you will place your results into the provided data structures. The sample code in MyFakebookOracle.java illustrates how to do this. Query 3: Find “world travelers” (10 points) The next query asks you to find information about all users who no longer live in their hometowns. In other words, the current_city associated with these users should NOT be the same as their hometown_city (neither should be null). You will place your result into the provided data structures. Query 4: Find information about photo tags (10 points) For this query, you should find the top n photos that have the most tagged users. You will also need to retrieve information about each of the tagged users. If there are ties (i.e. photos with the same number of tagged users), then choose the photo with smaller id first. This will be string lexicographic ordering since the data types are VARCHARs (for instance, “10” will be less than “2”). Query 5: Find users to set up on dates (15 points) For this task, you should find the top n “match pairs” according to the following criteria: (1) One of the users is female, and the other is male (Note: they do not have to be friends of the same person) (2) Their age difference is <= yearDiff (just compare the years of birth for this). (3) They are not friends with one another (4) They should be tagged together in at least one photo You should return up to n “match pairs.” If there are more than n match pairs, you should break ties as follows: (1) First choose the pairs with the largest number of shared photos (2) If there are still ties, choose the pair with the smaller user_id for the female (3)If there are still ties, choose the pair with the smaller user_id for the male 6 Query 6: Suggest friends based on mutual friends (15 points) For this part, you will suggest potential friends to a user based on mutual friends. In particular, you will find the top n pairs of users in the database who have the most common friends, but are not friends themselves. Your output will consist of a set of pairs (user1_id, user2_id). No pair should appear in the result set twice; you should always order the pairs so that user1_id < user2_id. If there are ties, you should give priority to the pair with the smaller user1_id. If there are still ties, then give priority to the pair with the smaller user2_id. Finally, please note that the _FRIENDS table only records binary friend relationships once, where user1_id is always smaller than user2_id. That is, if users 11 and 12 are friends, the pair (11,12) will appear in the _FRIENDS table, but the pair (12,11) will not appear. Query 7: Find the most popular states to hold events (10 points) Find the name of the state with the most events, as well as the number of events in that state. If there is a tie, return the names of all the tied states. Again, you will place your result in the provided data structures, as demonstrated in MyFakebookOracle.java . Query 8: Find oldest and youngest friends (10 points) Given the user_id of a user, your task is to find the oldest and youngest friends of that user. If two friends are exactly the same age, meaning that they were born on the same day, month, and year, then you should assume that the friend with the larger user_id is older. Query 9: Find the pairs of potential siblings (10 points) A pair of users are potential siblings if they have the same last name and hometown, if they are friends, and if they are less than 10 years apart in age. While doing this, you should compute the year-wise difference and not worry about months or days. Pairs of siblings are returned with the lower user_id user first. They are ordered based on the first user_id and, in the event of a tie, the second user_id. 7 4. Compiling and running your code You are provided with an Oracle JDBC Driver (ojdbc6.jar). This driver has been tested with Java JDK 1.7 and JDK 1.8. If you are unsure which Java development environment you prefer to use, we suggest that you develop your code in Eclipse. You can do this by creating a Java Project called ‘project2’ inside Eclipse, and then Importing the 3 Java source files to the project. You should also add your JDBC driver’s JAR to Eclipse’s classpath. To do this in Eclipse, go to ‘Project Settings’ then ‘Java Build Path’, and then click on the ‘Libraries’ tab, then ‘Add External JAR’. If you prefer, you can just use an editor (e.g. vi or emacs) to develop your code. In this case, you should create a directory called ‘project2’ and put the three Java source files provided to you in this directory. To compile your code you should change to the directory that contains ‘project2’. In other words, suppose you created the directory ‘project2’ in /your/home/Private/EECS484 . cd /your/home/Private/EECS484 Then, you can compile the Java source files as follows: javac project2/FakebookOracle.java project2/MyFakebookOracle.java project2/TestFakebookOracle.java You can run your program as follows (note that you should set the class path (-cp) for your copy of ojdbc6.jar): java -Xmx64M -cp “project2/ojdbc6.jar:” project2/TestFakebookOracle Note the colon (:) after ojdbc6.jar. Connect from Campus or over a VPN If you get a timeout error from Oracle, make sure you connect from campus or use the University VPN to connect (see this page for information: https://www.itcom.itd.umich.edu/vpn/. It 8 is possible that the guest wireless network may not work for remote access to the database without being on the VPN. Alternatively, use UM Wireless or a CAEN machine directly for your development. 5. Testing A good strategy to write embedded SQL is to first test your SQL statements in a more interactive database client such as SQL*Plus before writing them inside Java code, especially for very complex SQL queries. You have the public dataset available to test your application. We provide you with the output from our official solution querying against the public data (available in solution-public.txt ). You can compare your output with ours to debug. During grading, we will run your code on a second (hidden) dataset. 6. Submission and Grading You only need to turn in MyFakebookOracle.java . Please submit it by going to the online autograder: https://grader484.eecs.umich.edu If you are working in a team, then you should join a team first and then submit. Both partners must submit individually, even for a group project. While no online autograder is currently available, we will be grading your answers to the queries using an automated script, so it is important that you adhere to the given instructions, and that your file, MyFakebookOracle.java , works correctly with an unmodified version of FakebookOracle.java . We might later adjust (reduce) the autograder’s score if we notice poor Java/SQL programming style. Here are the key elements: ● For each of these tasks, think carefully about what computation is best done in SQL, and what computation is best done in Java. Of course, you can always compute the “correct” answer by fetching all of the data from Oracle, and loading it into Java virtual memory, but this is very inefficient! Instead, most of the computation can (and should!) be done in SQL, and you should only retrieve the minimum amount of data necessary. ● Close all SQL resources cleanly. See Appendix below. 9 ● Make sure your queries are nicely formatted and readable. Explain the logic briefly in comments if the query is complicated. ● You are not being evaluated on optimizing the queries. So, no need to worry about that. However, if you find that they are taking inordinately long, then you should think about simplifying the queries. Else, they could fail the tests. ● Generally, non-nested queries are preferred over nested ones, when feasible. It may not be feasible to do so in all cases. Basically, think about simplifying the queries and use comments to explain, if needed, so that someone other than you can understand your logic. 10 Appendix: Tips on Closing Your JDBC Resources Cleanly It is important that you close your JDBC connections properly. Otherwise, you risk getting locked out of the database server. Even if you kill your Java program, it is possible that the database server thinks that the client program is still around. It may keep its end of the connection open, waiting for data. This could eventually lock you out of the database if you end up creating too many instances of open connections. Here is a real example of an unfortunate situation posted by a student: “So I’ve been having this same problem, but I locked myself over 24 hours ago and I still don’t have access. I’m not sure what to do since ITS isn’t open and if CAEN can’t help I’m skeptical anyway. I created this problem by my SQL queries were running on forever (can you even get into an infinite loop in SQL? I think I was) and when you Ctrl+C out it closes SQL improperly. I’ve learned from my mistake now, but it’s too late and I can’t get back in.” Here are a few tips to help reduce the likelihood of the above type of problem: 1. First, use SQL*Plus to debug your queries rather than using Java. Also make sure you quit your SQL*Plus sessions, otherwise you can still get locked out. It may help to design your queries on paper first and avoid too much nesting of queries. Nested queries tend to run slower as query optimizers have difficulty handling them. The above student was smart enough to do that but still ran into trouble. So, let’s look at one more thing to do (Step 2) that may help. 2. Make sure you close all Connections, Statements, and ResultSets in your code. This is tricky to do. Read this for some of the nuances: https://blog.shinetech.com/2007/08/04/howtoclosejdbcresourcesproperlyeverytime/ The problem is, even closing a connection can lead to an exception in rare cases. We suggest using try-with-resources statements in your Java code to automatically close any JDBC objects that you create. This is a newer feature introduced in Java SE 7 that makes closing resources easier and more reliable. The next page shows an example of how this can be done for JDBC connections. 11 // Example use of a try-with-resources statement public static void viewTable(Connection con) throws SQLException { String query = “SELECT COF_NAME, SUP_ID, PRICE, SALES, TOTAL FROM COFFEES”; try (Statement stmt = con.createStatement()) { ResultSet rs = stmt.executeQuery(query); while (rs.next()) { String coffeeName = rs.getString(“COF_NAME”); int supplierID = rs.getInt(“SUP_ID”); float price = rs.getFloat(“PRICE”); int sales = rs.getInt(“SALES”); int total = rs.getInt(“TOTAL”); System.out.println(coffeeName + “, ” + supplierID + “, ” + price + “, ” + sales + “, ” + total); } } catch (SQLException e) { System.err.println(e.getMessage()); } } This code is adapted from an example at: https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html Feel free to take the code and adapt it to your needs. The above code assumes that you only need to use a single Statement object, for example. That may not always be appropriate. For example, you may sometimes want to use multiple Statement objects to execute different fixed queries, in which case you could have one try-with-resources block for each Statement object that you create. Since a Connection object is provided to you in the Project 2 code, you do not need to worry about creating a try-with-resources block for the Connection object that you use in your query implementations. If you think about the problem, it is actually pretty hard for a database server to distinguish between a slow client at the other end and a dead client. Remember that the communication between the client and the server occurs over a network. Doing all you can within Java to close the connections in all possible situations (including when queries fail) will help the server greatly.

Introduction It’s a sad fact of life: computers crash. One minute everything is working just fine, and the next minute everything that was in volatile memory is gone forever. To deal with this problem on your personal machine, you’ve probably gotten into the habit of saving frequently. In this project, you will implement an algorithm that database systems commonly use to address the same problem without slowing down the system too much. Some Basics of Transactions 1 A transaction is any one execution of a user program in a database management system (DBMS). For example, suppose Ada and Bob both have accounts at the same bank. Ada logs onto the bank’s website and instructs the bank to make a payment of $100 from her account into Bob’s account. If the bank’s computer runs normally, it will deduct $100 from Ada’s account balance, add $100 to Bob’s account, and trigger emails to Ada and Bob confirming the transfer. Those steps—deducting, adding, and confirming that it was done—make up one transaction. But suppose that the moment after the bank’s computer deducts the money from Ada’s account balance, it crashes. When it reboots, the instructions about that $100 that were in its memory are gone. We would like to ensure that if this happens, the $100 has not simply vanished—it is back in Ada’s account, or it is in Bob’s account. Either the whole transaction goes through, or the part of the transaction that happened before the crash gets undone. Another way of saying this is that we want a transaction to be atomic. 1 Transactions are covered in lecture. Check out chapter 16 of the textbook for more in-depth analysis. 1 Suppose the timing of the crash was a little different. The bank’s computer read Ada’s balance into memory and modified it there, read Bob’s balance into memory and modified it there, and triggered the emails. Then, just as it was about to write the updated balances back onto disk, it crashed. We want to make sure that if the system says a transaction completed successfully, the changes the transaction made will persist even through a crash. We call this durability. (In lecture, you also learn about two other properties that transactions should have, called consistency and isolation. These have to do with concurrency and are important to understand, but you don’t need to worry about them for this project.) To ensure atomicity and durability while minimizing the harm to performance, DBMSs use write-ahead logging and with a recovery algorithm, such as ARIES. In this project, you will be implementing a single-threaded version of ARIES. What you need to do: Read. Read sections 16.7 through 16.7.2 and 18.1 through 18.6 in your textbook. This is important. The rest of this project will not make sense until you have understood this material. This will also greatly help for the project. (Since the textbook is required, we assume you already own the textbook. The book can also be checked out for two hours at a time from the Art Architecture & Engineering library’s course reserve desk on the second floor. Older editions of the book Database Management Systems by Raghu Ramakrishnan (1998 and 2000) also have most of the same material — though the chapter numbering is different. Older editions may be available cheaply at Amazon.) Download. Download the recovery simulator from Canvas and unzip it. You will see that the directory contains directories called StorageEngine, StudentComponent, output, correct, and testcases. The section on Understanding the Simulator below explains what these pieces do. You’ll also want to spend some time familiarizing yourself with the header files. Implement. Implement ARIES in LogMgr.cpp. You will just be submitting this file to the autograder. Do not alter the other files in any way. The autograder will compile and run your LogMgr.cpp with unaltered copies of all of the other files, so it is in your best interest to test with exactly the same files. Please do not add any other header files; if you need helper functions, you may declare them and define them in LogMgr.cpp. You may use the C++11 STL, except that you may not use do file I/O or interact with the network. The autograder may reject such submissions and your submission will still count. 2 Submit. An autograder will be available by 11/9/2015 (or possibly earlier) at https://grader484.eecs.umich.edu. Submit your LogMgr.cpp file to the autograder. You may submit up to four times per day with autograder feedback. Please take advantage of this, and start submitting early and often. We will keep your best score from various submissions. If you submit before the regular deadline and also during the late days, we will take the best of your scores, max(regular score without late penalty, late score with late penalty). So, there is no disincentive of submitting late. If there is a bug in the autograder, we reserve the right to regrade your past submissions, which may change your best score. In that case, we will notify you of the change as soon as feasible. Ultimately, you are also responsible for testing your LogMgr.cpp with your own test cases. Grading Your grade will be based on your code’s performance on the autograder test cases. We will diff your log file and your recovered db file against our known correct files for each test case. There are hidden test cases on the autograder that are different from the public test cases we have provided you. Passing the public test cases means that you are on the right path, but doesn’t ensure full marks on the private test case. The autograder test cases may also cover corner cases the public test cases do not. Thus, make sure you think carefully about all the cases that the algorithm needs to handle. However, your submission will not be checked against test cases not included in autograder. Hence, the final grade on autograder is your grade. Understanding the Simulator The simulator has two main pieces: the StorageEngine and the LogMgr. The StorageEngine handles all disk reads and writes. The LogMgr handles logging and recovery. If the LogMgr needs to write something to disk, it must go through the StorageEngine. Your implementation of LogMgr.cpp must not include any static variables or functions, and it must not write directly to disk. If LogMgr wants to write to disk, it must use StorageEngine’s public methods (interface). The StorageEngine keeps track of Pages. Each Page has a page_id, a pageLSN, a dirty bit, and a string of data. All Pages are kept on disk. If a page needs to be read or written, 2 StorageEngine adds it to the in-memory page buffer. If the page buffer is full, the StorageEngine chooses a page to flush to disk, informs the LogMgr of the flush through LogMgr’s pageFlushed 2 In a real database, of course, pages contain records, but here we’ve abstracted them away, so you don’t need to worry about them. 3 function (Hint: that probably means LogMgr should do something when the page is flushed), and flushes it, freeing up memory for the next page. StorageEngine’s constructor declares the MEMORY_SIZE that it uses as its buffer pool: StorageEngine::StorageEngine() : MEMORY_SIZE(10) { page_writes_permitted = 0; } Thus, in the above, it has 10 pages. If it ever needs to bring 11th page to memory, it selects a victim page and calls LogMgr:flushPage(page_id). See the StorageEngine:findPage(int page_id) method. For more thorough testing, you may want to change 10 to a smaller or larger value (no other change to StorageEngine should be required and you are not submitting StorageEngine.cpp). That will cause more stealing or less stealing, respectively. It will also change the output disk and log and you will have to manually check whether the resulting disk and log is correct. Sometimes LogMgr will need to write Pages; for example, recovering from a crash and aborting a transaction both require LogMgr to alter Pages. When this is necessary, LogMgr can call StorageEngine::pageWrite. You should always try to minimize the number of Pages LogMgr writes. To force you to do this, StorageEngine limits the number of pageWrites you are permitted. If you try to exceed the limit, pageWrites will stop writing pages and return false. If 3 this happens, LogMgr should stop what it is doing. The LogMgr also needs to keep the log. A log is made up of LogRecords of various types. You can keep the most recent part of the log (called the log tail) in memory, but sometimes you will need to write the log to disk. LogMgr can call a LogRecord’s toString method to transform the LogRecord into a string, and then pass a string to StorageEngine::updateLog to append a string to the log on disk. The log on disk will have one record per line; you can append multi-line strings to it if you want to add more than one record at once. If LogMgr wants to read old log entries, it can call StorageEngine::getLog(), which will return a (multi-line) string. LogRecord::stringToRecordPtr can parse a line of that string and give you a pointer to a LogRecord of that line. StorageEngine provides a few other utilities LogMgr might find useful: ● StorageEngine::nextLSN provides a unique id number assigned in monotonically increasing order ● StorageEngine::store_master and StorageEngine::get_master write an int to stable storage and retrieve it. ● StorageEngine::getLSN(int page_id) returns the page LSN of the specified page. 3 The limit at any given time is set by the testcase. Sometimes we give you enough page writes to finish an operation. Sometimes, to simulate a crash in the middle of an operation, you will run out of pageWrites early, and the next event in the testcase will be a crash. 4 In the StorageEngine directory, you will also find main.cpp. Main takes the relative path to a testcase as a command line argument. For example, once you have compiled the whole project, you might call ./main testcases/test00 from the directory to run testcase 0. Main calls the runTestcase function, which will read in the testcase line by line and run a simulation. Test Cases Public testcases are given to you in the testcases folder. The first line of the testcase specifies a “database” file (db) to use for this run. For this simulation, we are using an oversimplified format, so that you can see the effects of writing to pages: a database file is a text file, and each line of text represents a page of data on disk. If the first line of a database file looks like this: -1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx it means that the page with page_id 1 (because it is the first line of the file) has pageLSN -1, and the data currently written to that page is 50 consecutive x’s. The remaining lines of the testcase are instructions to write, abort, commit, make a checkpoint, crash, or end the simulation. Each of these is described in more detail below, with examples. ● write (e.g., 1 write 34 27 ABC ): An instruction from a particular transaction (1) to write an update (ABC) to a particular page (34) at a particular offset (27). Main’s runTestcase function calls StorageEngine’s write function, which in turn calls LogMgr’s write function, giving it the transaction id, page id, offset, input string, and what was already written at that page-offset location. StorageEngine expects to get back a pageLSN for the write; once it receives this, it updates the page. ● abort (e.g., 1 abort 5): An instruction to abort a particular transaction (1), and a cap on the number of page writes the LogMgr is permitted (5). RunTestcase calls StorageEngine’s abort function; StorageEngine sets the number of page writes permitted to five and then calls LogMgr::abort(1). ● commit (e.g., 1 commit): An instruction to commit a transaction (1). RunTestcase calls LogMgr::commit(1). ● checkpoint (checkpoint): An instruction to create a checkpoint. RunTestcase calls LogMgr::checkpoint(). ● crash {v} (e.g., crash {2 5} ): where v is a list of integers. An instruction to simulate n crashes and recovery phases, where n is the number of values in the list v. During crash i, v[i] number of writes are permitted during each recovery phase. In this example, the first crash happens, and the LogMgr gets two writes before another crash happens; for the second crash, LogMgr gets five writes to recover. Sometimes the number of writes will allow a full recovery before the next crash; sometimes it will not, and the second crash will begin with StorageEngine no longer writing when LogMgr requests it. When crashing, RunTestcase will destroy the current instance of LogMgr and create a new one, then call StorageEngine::crash(). StorageEngine will destroy all of the Pages it has in memory, then call LogMgr’s recover method. LogMgr is responsible for recovering according to the ARIES protocol. 5 ● end (end): An instruction to end the simulation. Saves the final state of the on-disk pages and exits the program. At the end of the testcase, the simulator will have generated two files: a log and a modified version of the input db file. These can be found in the output/log/ and output/dbs/ , respectively. Each filename includes the number of the corresponding testcase. NOTE: If you already have a file in the log folder with the same name as the new one, the new log entries will be appended to the file, rather than overwriting it. You might want to rename or delete old log files to avoid some confusing bugs. Reading the Log The log file is saved as a multi-line string, with one line for each log entry, so that you can easily review it. The format is loosely based on Figure 18.2 in the textbook. For example, a log might have content like this (heading and table borders added for clarity): LSN prevLSN tx_id type page_id offset before image after image 2 -1 1 update 5 0 xxxxxxxx update10 3 -1 2 update 3 0 xxxxxxxx update20 4 3 2 commit End checkpoint log records also include a record of the transaction table and the dirty page table. Each table is enclosed in curly braces, and each entry in the table is enclosed in square braces. For example, LSN prevLSN tx_id type TX_table Dirty Page Table 5 4 -1 end_checkpoint {[ 1 2 U ] [ 2 3 U ]} {[ 3 3 ] [ 5 2 ]} This transaction table has two entries, and so does the dirty page table. At times, you have to read the log from the disk, which is in string form (see the sample logs under the correct/log s folder). You will have to add a function to LogMgr.cpp, which is declared in LogMgr.h: vector LogMgr::stringToLRVector(string logstring) The above function converts a log in the string form to reconstruct a vector of log records. To help do the conversion, a helper function is available in LogRecord.cpp that converts one line to a LogRecord of appropriate type, allocating the record using new(). 6 LogRecord *LogRecord::stringToRecordPtr(rec_string) Testing We have given you some public test cases. The correct output for these can be found in the correct/logs/ and correct/dbs/ directories. You will likely need to add your own tests. Testing is challenging as your goal is to write testcases that uncover behavior that is not being tested by existing testcases. You can use a GNU test coverage tool called gcov that works with g++. gcov will tell you whether your tests are covering every (or at least most) of the line in LogMgr.cpp. Makefile2 will build your project for test coverage analysis. Simply run “make” with the -f argument, specifying Makefile2 instead of Makefile. % make -f Makefile2 Look at the contents of Makefile2 to see what it does. You may have to do a few edits as you add new test cases. It should run on CAEN and, hopefully, on Macs running recent versions of XCode with command-line tools. Trouble Debugging? If you are using gdb on the CAEN computers to debug, you might find that it has some difficulty displaying stl::map and other data structures that your implementation might use. To fix this, you can use Pretty Print: Make a folder for it wherever you like to keep such things (mine is ~/gdb_printers) and, from inside that folder, run svn co svn://gcc.gnu.org/svn/gcc/branches/gcc-4_6-branch/libstdc++-v3/python Once you have that, create a file ~/.gdbinit (or append to it if the file already exists) and add the following lines to it: python import sys sys.path.insert(0, ‘/path/to/gdb_printers/python’) from libstdcxx.v6.printers import register_libstdcxx_printers register_libstdcxx_printers (None) end Make sure you replace /path/to/gdb_printers/python with the actual path. 7 Now try running gdb –args ./main.o testcases/test00 again, and you should be able to view the maps properly. Important Information about the Honor Code Remember that this assignment is to be done individually. All work on the assignment must be your own. No sharing of code, use of code (or pseudo-code), or sharing of test case files written by others is permitted. We do run cheating detection software against all past and current submissions and will report any suspected violations to the Honor Council. Trying to hack the autograder or trick the autograder into giving you a good grade without doing the assigned work is cheating (though we welcome bug reports with the autograder, including any security vulnerabilities, as long as you don’t use them to your advantage). Any violations of these policies caught will receive a zero on the project and will be referred to the Honor Council. Your assignment is to implement ARIES within the simulator environment. As mentioned above, your LogMgr.cpp file must not attempt to read or write any file or use network calls directly, and it must not include any static functions or variables. They are not needed for this assignment. 8 A Few Clarifications: ● When you’re undoing a transaction, any time you see a log record for that transaction, you need to get the prevLSN from that log record and add it to your toUndo list (or end the transaction if the prevLSN is -1). If the record is a CLR or an update record, you then process it the way the book says; if it’s something else (like an abort record), you just move on to the next item in your toUndo list. ● You may have noticed in lecture and some of the problems in the textbook that the log tail is sometimes flushed to disk without a triggering commit or page write or end checkpoint. ARIES permits a regular background flushing of the log tail; for example, a certain amount of space in memory called a log buffer may be set aside, and any time it fills up, the log tail can be flushed to disk. You don’t need to worry about that background flushing for this project. Any reasonably sized log buffer would be bigger than the log tail ever gets in these test cases. ● The book’s description of abort is a little brief. What you need to know is that your abort function should write an abort record, then do basically the same thing as undo, only on a smaller scale. For undo, you need to look at all of the loser transactions and put their most recent LSNs in your toUndo queue. For abort, you just need to find the most recent LSN for one transaction you are aborting. Look at the log record for that LSN. Its prevLSN field will tell you which log record you need to look at next (or if you’re done); its type will tell you if you need to process it the way you process records during undo. When you get to a log record with prevLSN = -1, process that record, then write your end log record, and you’re done.

Overview This project consists of two independent components: 1. Query planning in PostgresQL using a DVD store database. 2. Exporting FakeBook Project 2 data from Oracle to MongoDB, and then writing some MongoDB queries. You may find the details of each component below. You will need to be on the University of Michigan network to do the assignment, since the database servers are only accessible from within the University. If you are off-campus, you must first connect to the University of Michigan VPN network. See https://www.itcom.itd.umich.edu/vpn/ for instructions on connecting to the VPN. You can do 1. and 2. in either order — they are totally independent activities. We have talked about MongoDB in the lecture on 11/21. So, you may consider starting on that first. Within MongoDB project, you can do Part I or Part 2 in either order as well. Part I is about writing a Java program to convert Oracle data from Project 2 to MongoDB format (JSON) and then importing it into MongoDB. Part 2 is about writing 7 queries on the imported data. We have given you a JSON file that you can use for Part 2, if you wish to do Part 2 first. Our advice for F16: Do MongoDB part of of the project first. That is 80% of the project and is completely auto-graded. Do Part I or Part 2 of that in either order (or in parallel!). It is worth 8% of the overall grade. Postgres Portion does not require coding and there is no auto-grader — it requires you to use the Postgres database, read its manual, following the instructions, and fill out a Google form based on the observations. It is almost like doing a homework (and since it is worth 20% of this project, it ends up being worth the same as other homeworks – 2% of the overall grade). PostgreSQL (20 points) We will have a couple of exercises using PostgreSQL to help you understand query planning and query optimization. The spec and questions can be found in the following Google form link. You need to fill in your answer in the Google form. You don’t need to submit any files for this exercise. https://goo.gl/forms/mqrz6jxjYmh0LAvr1 Following is an introduction on how to connect to Postgres, load data and some helpful links, before you answer the questions at the above google form. Logging into Postgres To login to postgres, login to a CAEN machine. Alternatively, download a postgres client program such as psql to your machine (first check – you may already have it if you have a Mac or Linux machine). % psql -h eecs484.eecs.umich.edu -U uniquename Your default Postgres password is postgres484student. You need to connect from either a University machine, while you are connected to the University network via UM Wireless (not guest), or be on the University of Michigan VPN. Once you login, you can change your password as follows on the postgres => prompt: => password Remember your new password. Resetting it is possible but will incur a delay (possibly even 24-48 hours) as it would be a manual process. If you need to reset it, then you should email the teaching staff and allow us 24-48 hours to respond. CAEN or ITD do not support this database system. Initializing the Database This project comes with a zip file (dvd_store_data.zip) that contains relevant commands to initialize the database. Download and unzip the file and you will find a file called setup_db.sql. You will use this file to populate the database. Specifically, you can do the following steps on CAEN to initialize the database: % unzip dvd_store_data.zip % cd dvd_store_data % psql -h eecs484.eecs.umich.edu -U uniquename Password for user uniqname: You are now connected to the Postgres interactive terminal. Run the following command (note the backslash) to execute commands from setup_db.sql: i setup_db.sql It can take a few minutes for the database to be initialized. Here is what you may see when initializing (or re-initializing). The error messages about permission being denied on triggers can be ignored. uniquename=> i setup_db.sql psql:pgsqlds2_delete_all.sql:2: ERROR: permission denied: “RI_ConstraintTrigger_19357” is a system trigger psql:pgsqlds2_delete_all.sql:3: ERROR: permission denied: “RI_ConstraintTrigger_19359” is a system trigger psql:pgsqlds2_delete_all.sql:4: ERROR: permission denied: “RI_ConstraintTrigger_19409” is a system trigger psql:pgsqlds2_delete_all.sql:5: ERROR: permission denied: “RI_ConstraintTrigger_19391” is a system trigger psql:pgsqlds2_delete_all.sql:6: ERROR: permission denied: “RI_ConstraintTrigger_19424” is a system trigger psql:pgsqlds2_delete_all.sql:7: ERROR: permission denied: “RI_ConstraintTrigger_19383” is a system trigger DROP TABLE DROP TABLE psql:pgsqlds2_create_tbl2.sql:30: NOTICE: CREATE TABLE will create implicit sequence “customers_customerid_seq” for serial column “customers.customerid” psql:pgsqlds2_create_tbl2.sql:30: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “customers_pkey” for table “customers” CREATE TABLE psql:pgsqlds2_create_tbl2.sql:43: NOTICE: CREATE TABLE will create implicit sequence “orders_orderid_seq” for serial column “orders.orderid” psql:pgsqlds2_create_tbl2.sql:43: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “orders_pkey” for table “orders” CREATE TABLE psql:pgsqlds2_create_tbl2.sql:51: NOTICE: CREATE TABLE will create implicit sequence “categories_category_seq” for serial column “categories.category” psql:pgsqlds2_create_tbl2.sql:51: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “categories_pkey” for table “categories” CREATE TABLE INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 psql:pgsqlds2_create_tbl2.sql:82: NOTICE: CREATE TABLE will create implicit sequence “products_prod_id_seq” for serial column “products.prod_id” psql:pgsqlds2_create_tbl2.sql:82: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “products_pkey” for table “products” CREATE TABLE CREATE TABLE CREATE TABLE psql:pgsqlds2_create_tbl2.sql:115: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index “inventory_pkey” for table “inventory” CREATE TABLE CREATE TABLE psql:pgsqlds2_load_orderlines.sql:1: ERROR: permission denied: “RI_ConstraintTrigger_20643” is a system trigger psql:pgsqlds2_load_orderlines.sql:16: ERROR: permission denied: “RI_ConstraintTrigger_20643” is a system trigger Helpful Links Now, you should proceed to do the quiz at the Google Forms link given at the beginning. When you do the quiz, you may have to refer to Postgres documentation and run postgres commands to find the answers to the quiz questions. Some of the tables that you will be using during the course are system catalog pages. In most databases, there are tables about tables that you create. For example, one of the tables in Postgres is pg_class that has general information about all the relations, indexes, etc., including their column names. Another table is pg_stats, which contains approximate number of tuples in each table, number of distinct values in each column, etc. These two tables are very useful in query optimization. The data in these tables helps the query optimizer estimate the cost of different ways of evaluating a query (e.g., whether to use an index or to ignore it). For example, ignoring an index and just doing a regular file scan may be more efficient in some cases (e.g., see Lecture Notes where examples of SELECT queries on age being 18 for UM students were discussed). The following are links to Postgres documentations relevant to this assignment. Be prepared to look up the documentation as you work on the exercises. ● Full PostgreSQL 8.4.16 documentation: https://www.postgresql.org/docs/8.4/static/index.html ● System catalogs that give you information about tables: https://www.postgresql.org/docs/8.4/static/catalogs.html ● Statistics used by the query planner https://www.postgresql.org/docs/8.4/static/planner-stats.html ● How to manipulate the query planner (such as disabling the use of a join algorithm) https://www.postgresql.org/docs/8.4/static/runtime-config-query.html ● Syntax of EXPLAIN command https://www.postgresql.org/docs/8.4/static/sql-explain.html ● How to use EXPLAIN command and interpret its output https://www.postgresql.org/docs/8.4/static/using-explain.html ● Creating an index https://www.postgresql.org/docs/8.4/static/sql-createindex.html ● Creating a clustered index https://www.postgresql.org/docs/8.4/interactive/sql-cluster.html MongoDB (80 points) Introduction In this project, you will learn how to transfer data from SQL to MongoDB and learn to perform MongoDB queries. There are two parts in this project. In part 1, you will need to write a Java program to export a small portion of Fakebook data stored in Project 2 tables into one JSON file which serves as the input to part 2. In part 2, you will need to import this JSON file into MongoDB and perform a couple of queries in the form of JavaScript functions. (Note: we have provided you the JSON file that should result from your part 1, to allow you to work on part 1 and part 2 in either order, and to help you check the correctness of part 1. See more on this later in this document.) JSON objects and arrays MongoDB uses a format called JSON (Javascript Object Notation) extensively. JSON is a key-value representation, in which values can also be JSON objects. Here are some examples of objects in JSON notation: Example 1 (JSON object): {“firstName”:”John”, “lastName”:”Doe”} Here, “firstName” is a key, and the corresponding value is “John”. Similarly, “lastName” is a key and “Doe” is a value. Think of it like a map. Here is some Javascript code that uses the above: var employee = {“firstName”:”John”, “lastName”:”Doe”}; employee[“firstName”]; // displays John One can also have a JSON array, which are an array of JSON objects. For example, variable employees is an array of JSON objects. var employees = [ {“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Atul”, “lastName”:”Prakash”}, {“firstName”:”Barzan”,”lastName”: “Mozafari”} ]; employees[1][“firstname”]; // prints Atul Nesting is possible. For example, in a key-value pair, the value can be a JSON array, another JSON object, or just a simple string or number. MongoDB does not use SQL. But there are some similarities. Here are a few simple examples. See the following more examples: https://docs.mongodb.org/manual/reference/sql-comparison/ SQL MongoDB Table Collection. Initialized using a JSON array. Tuple or row of a table Document. Corresponds to a JSON object SELECT * FROM users; db.users.find(); SELECT * FROM users WHERE name = ‘John’ AND age = 50; db.users.find({name : “John”, age : 50}); SELECT user_id, addr FROM users WHERE name = ‘John’; db.users.find({name : “John”}, {user_id : 1, addr : 1, _id : 0}); Install mongodb on your local machine: We encourage you to install mongodb locally in your laptop to have a more pleasant development environment and also to explore mongdb’s functionality by yourself. You can follow instructions in the following links. Post on piazza if you get stuck and feel free to help other students get mongo installed. A great way is to reply to any questions related to installation on Piazza, based on your experience. Google search on installation error messages can also help. Install mongodb on Mac: https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-os-x/ (Note: You may need to use sudo command at times to temporarily become root if you get Permission Denied errors). Install mongodb on Windows: https://docs.mongodb.com/v3.2/tutorial/install-mongodb-on-windows/ Install mongodb on Linux https://docs.mongodb.com/v3.2/administration/install-on-linux/ (Note: You may need to use sudo command at times to temporarily become root if you get Permission Denied errors.) When you have it successfully installed, the following commands should work from a Terminal: % mongo % mongoimport If you have trouble installing locally, you can also use the mongo server on eecs484.eecs.umich.edu directly by supplying a userid and password. See Part 2 below. Part 1 – Export SQL data to JSON using JDBC This part does not really make use of MongoDB. It instead relies on your knowledge from Project 2 (SQL!). You will be retrieving data from the public tables of Project 2 and outputting a subset of the information as a JSON array. We have given you the output file as well that we are expecting as output so that you can check your answer. You are provided with 3 files: GetData.java, Main.java, and Makefile for this part. You will write code inside GetData.java, resulting a file output.json . Also, you are provided sample.json, which is one of the possible correct outputs. You should get an equivalent JSON array in output.json as in sample.json. 1) Main.java This file contains the main function to run the program. The only modification you need to make to this file is to provide your SqlPlus username and password that you used in Project 2 as well. Please refer to your project 2 files in case you forgot what it is, since it is probably embedded in one of the files there. public class Main { static String dataType = “PUBLIC”; static String oracleUserName = “username”; //replace with your Oracle account name static String password = “password”; //replace with your Oracle password … 2) GetData.java This file contains the function you need to write to export data from SqlPlus to a JSON file. The function will output a JSON file called “output.json” which you will need to submit. public JSONArray toJSON() throws S QLException{ // Your implementation goes here…. // This is an example usage of JSONArray and JSONObject // The array contains a list of objects // All user information should be stored in the JSONArray object: users_info // You will need to DELETE this stuff. This is just an example. // A JSONObject is an unordered collection of name/value pairs. Add a few name/value pairs. JSONObject test = new JS ONObject(); / / declare a new JSONObject // A JSONArray consists of multiple JSONObjects. JSONArray users_info = new JS ONArray(); test.put(“user_id”, “testid”) ; / / populate the JSONObject test.put(“first_name”, “testname”); JSONObject test2 = new JS ONObject(); test2.put(“user_id”, “test2id”); test2.put(“first_name”, “test2name”); users_info.add(test); / / add the JSONObject to JSONArray users_info.add(test2); / / add the JSONObject to JSONArray return users_info; } You need to use JDBC to query relevant project 2 tables to get information about users and store the information in a JSONArray called users_info. It is OK to use multiple queries — in fact, that may be more convenient to do. The users_info object is a JSONArray and contains an array of JSONObjects. Each JSONObject should contain information about one user. For each user (stored as a JSONObject), you will need to retrieve the following information: ● user_id ● first_name ● last_name ● gender ● YOB ● MOB ● DOB ● hometown (JSONObject). Hometown contains city, state and country. ● friends (JSONArray). Friends contains an array of user_ids which are greater than the user_id of that user. Following is an example of one user JSONObject (order of key/value pairs in JSONObject doesn’t matter). See the file sample.json for an example valid output. 3) Makefile We provide a simple Makefile to compile the Java files and run them. You may make changes to this file if necessary. To compile the code, do: $ make To run the code, do: $ make run An output file output.json should result. 4) output.json Since the order of attributes inside a jsonObject is not fixed, there are a lot of correct answers for output.json . However, when you import them into database, they should be all identical for queries. For your convenience, we have provided a sample.json, which is one of the correct answers. To test whether your output.json is correct, you could do Part 2 with your output.json as the input instead of sample.json . If you get the same answers, that is a good sign (though not a proof). If you get different answers, then something is definitely wrong your output.json . Part 2 – Query MongoDB The first step is to import output.json file from part 1 to MongoDB as a collection. Alternatively, you can use sample.json as your input file. If you are working on CAEN machine, use the following command to input sample.json into the database, for example: $ module load mongodb $ mongoimport –host eecs484.eecs.umich.edu –username –password –collection users –db –file sample.json –jsonArray You can also do this by modifying the Makefile and then doing: $ make setupsampledb Alternatively, to use your output.json, you can do: $ make setupmydb On eecs484 server, we have set up Mongo databases for each student. The database name is your uniquename, and the username is also your uniquename. Password is eecs484class for all student, you can change your password for your database, see(https://docs.mongodb.org/manual/tutorial/manage-users-and-roles/). What you need to do is db.updateUser(, {pwd : “newpassword”}) You can do it either in mongo shell or run it as script (If you are using a private, local copy of mongodb on your personal machine, you can instead just omit the –host, –username, and –password flags). When importing a json file, please use “users” as your collection name, and do not modify this collection. The above command does that. You can create additional collections besides if you want as helper collections to answer the queries. For the second step of this part, you will need to write 5 queries in the form of JavaScript functions to query MongoDB. MongoDB can load JavaScript files. You can find more information via the following link. https://docs.mongodb.org/manual/tutorial/write-scripts-for-the-mongo-shell/ If a collection is created in a query, you may reuse that collection in subsequent queries to save time. Note: Since only hometown information is retrieved in part 1. We assume that the city in queries means hometown city. Query 1: Find users who live in a certain city This query should return a javascript array of user_ids who live in the specified city. City is passed as a parameter of function find_user. Hint: A possible answer would start out like: var result = db.users.find(….); // Read MongoDB tutorials on the find command. In addition, you may find the following useful. https://docs.mongodb.org/v3.0/reference/method/cursor.forEach/ Instead of using forEach, you can also iterate over the cursor result that is returned by result, and push the user_id from the result into a Javascript array variable. function find_user(city, dbname){ db = db.getSiblingDB(dbname) //implementation goes here // returns a Javascript array. See test.js for a partial correctness check. This will be // an array of integers. The order does not matter. } Note: Query 2-5 assume that the variable db has been initialized from Query 1 above. Do not drop the db database. Query 2: Unwind friends Each document in the collection represents one user’s information, including a list of friends’ id. In this query, you need to unwind the friends list such that the resulting collection contains document which represents a friend pair. You don’t need to return anything. The new collection must be named flat_users. You may find this link on MongoDB unwind helpful. https://docs.mongodb.org/manual/reference/operator/aggregation/unwind/ function unwind_friends(dbname){ db = db.getSiblingDB(dbname) //implementation goes here // returns nothing. It creates a collection instead as specified above. } You may also find the following useful: https://docs.mongodb.org/manual/reference/operator/aggregation/ In particular, besides $unwind, $project and $out can also be useful. $out can create a collection directly. Instead of $out, you can also iterate over the resulting cursor from the query and use insert operator to insert the tuples into flat_users. See the documentation link in the query below as well. Query 3: Create a city to user_id mapping Create a new collection. Documents in the collection should contain two fields: _id field holds the city name, users field holds an array of user_id who live in that city. You don’t need to return anything. The new collection must be named cities. You may find the following link helpful. https://docs.mongodb.org/manual/reference/operator/aggregation/out/ function cities_table(dbname) { db = db.getSiblingDB(dbname) //implementation goes here // Returns nothing. Instead, it creates a collection inside the database. } Query 4: Recommend friends Find user pairs such that, one is male, the other is female, their year difference is less than year_diff, and they live in same city and they are not friends with each other. Store each friend pair as an array (male first, female second), and store all pairs as an array of arrays, and return the array at the end of the function. function suggest_friends(year_diff, dbname) { db = db.getSiblingDB(dbname) //implementation goes here // Return an array of arrays. } Query 5: Find the oldest friend Find the oldest friend for each user who has friends. For simplicity, use only year of birth to determine age. If there is a tie, use the one with smallest user_id. Notice that in the collection, each user only has the information of friends whose user_id is greater than that user, due to the requirement in Fakebook database. You need to find the information of friends who have smaller user_ids than the user. You should find the idea of query 2 and 3 useful. Return a javascript object with keys as user_ids and the value of keys is the oldest friend’s id. The number of keys of the object should equal to the number of users who has friends. function oldest_friend(dbname){ db = db.getSiblingDB(dbname) //implementation goes here //return an javascript object described above } Query 6: Find the Average friend count for users We define the `friend count` as the number of friends of a user. The average friend count is the average `friend count` towards a collection of users. In this function we ask you to find the `average friend count` for the users collection. function find_avg_friend_count(dbname){ db = db.getSiblingDB(dbname) //implementation goes here //return an javascript object described above } Return a decimal as the average user friend count of all users in the “users” collection. Query 7: Find the city average friend count using MapReduce MapReduce is a very powerful yet simple parallel data processing paradigm. Please refer to the discussion 8 (will be released on 11/30) for more information. In the question we are asking you to use MapReduce to find the “average friend count” at the city level (i.e., average friend count per user where the users belong to the same city). We have set up the MapReduce calling point in the test.js and we are asking to write the mapper, reducer and finalizer (if needed) to find the average friend count per user for each city. You may find the following link helpful. https://docs.mongodb.com/v3.2/core/map-reduce/ var city_average_friendcount_mapper = function() { // implem ction of average friend count }; var city_average_friendcount_reducer = function(key, values) { // implement the reduce function of average friend count }; var city_average_friendcount_finalizer = function(key, reduceValue) { // We’ve implemented a simple forwarding finalize function. This implementation is naive: it just forwards the reduceVal to the output collection. // Feel free to change it if needed. However, please keep this unchanged: the var ret should be the average friend count per user of each city. var ret = reduceVal; return ret; }; Sanity check: Note that after running the test.js, running db.friend_city_population.find() in mongo shell, you should find the documentation friend_city_population have the records in the similar form as following: { “_id” : “some_city”, “value” : 15.23} Where the _id is the name of the city, and the value is the average friend count per user. Sample test We offer a test.js which will call all 7 query javascript files and will print “Query x is correct!” if you query passes the simplistic test in test.js. In test.js, you need to put your database name in the dbname variable. Please make sure that all 7 query javascript files are in the same directory as test.js. Also, please note that the autograder will use a similar program as test.js but is more exhaustive. In particular, we compare the output of your queries against a reference output in more depth. You are free to make changes on test.js for more exhaustive testing. To run the tests from command-line, you can do: $ module load mongodb # This is only required one-time per login. $ mongo -u -p –host eecs484.eecs.umich.edu < test.js This will access to your database(created as your uniquename) on eecs484 mongodb server and run the script test.js. Alternatively, you can use the Makefile as well. $ make mongoquerytest will basically do the above for you. If you want to open the mongodb shell, you will do: $mongo -u -p –host eecs484.eecs.umich.edu Again, if you are using a local mongodb on your personal computer, just do the following instead: $ mongo dname < test.js Note: it may take some time to run query 4 and query 5. However, query 4 should take less than 3 minutes and query 5 should take less than 6 minutes. You will receive 0 on that query if it exceeds this time limit. What to submit You should submit a zip file named p4.zip. The zip file should include: GetData.java — Your java program to create the json file. Do not change the code for writing to output.json. query1.js — return a javascript array of user_ids. query2.js — create a new collection called flat_users, nothing to return. query3.js — create a new collection called cities, nothing to return. query4.js — return a javascript array of user pairs. query5.js — return an javascript object, keys are user_ids, value for each key is the oldest friend id for that user_id query6.js — return a decimal which is the “average friend count” for all users query7.js finish the mapper, reducer and finalizer Where to submit For Mongodb part of the project, we have made an autograder available at grader484.eecs.umich.edu. You will submit the mongodb part of the project online. (Postgres portion is to be submitted via Google forms on the link given earlier in the specs.)

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Eecs 484 project 1 to 4 solutions[SOLVED] Eecs 484 project 1 to 4 solutions
$25