School of Computer Science Dr. Ying Zhou
COMP5338: Advanced Data Models 2.Sem./2019
Project: NoSQL Schema Design and Query Workload Implementation
Group Work: 20%
1 Introduction
20.09.2019
In this assignment, you will demonstrate that you are able to work with both MongoDB and Neo4j in terms of designing suitable schema and writing practical queries. You will also demonstrate that you understand the strength and weakness of each system with respect to certain query workload features. You will be given a real world data set in Question and Answer area and a set of target queries.
2
The summary of your tasks are as follows:
Design a schema for each storage system based on data and query feature; Implement all queries in each system;
For each system, select two queries to provide an alternative implementation; Compare the execution performance of the two implementations;
Document the schema design, query design and performance analysis in a report;
Data set
The data that you will use is the latest dump (publication date: 2018-06-05) of the Artificial Intelligence Stack Exchange question and answer site (https://ai.stackexchange. com/). The dump is released and maintained by stackexchange: https://archive.org/ details/stackexchange. The original dump contains many files in XML format. The as- signment uses a subset of the data stored in five csv files. The data files and the description (readme.txt) can be downloaded from Canvas.
The assignment data set contains the following files:
Posts.csv stores information about post; each row represents a post, which could be a question or an answer
Users.csv stores users profile; each row represents a user, a user can be the author of a post or an answer.
1
3
Topic: Each question may belong to a few topics. The topic(s) of a question are recorded as a list of keywords in the Tags column in Posts.csv. Both answers and comments belong to this questions have the same topic(s) as the question.
User: Questions, answers and comments are all made by registered users. Users are identified by UserId field in various CSV files. Some users are removed for various reasons. The removed users no longer have an Id and should be ignored in all queries
Target Queries
[Q1] Find the question that attracts most discussions in a given topic; We measure the intensity of discussion by the total number of answers and comments in a question.
[Q2] Find the user with the highest UpVote number in a given topic, return the users name and UpVote number. Any user who has posted a question,an answered or a comment in this topic are candidate users.
[Q3] For a given topic, discover the questions that are hardest to answer. Here we measure the difficulty of question by the time it takes to receive an accepted answer. Questions that do not have an accepted answer will be ignored.
[Q4] Discover questions with arguable accepted answer. Users can give upVote to both question and answer. Usually the accepted answer of a question receives the highest number of upVote among all answers of this question. In rare case, another answer(s) may receive higher upVote count than the upVote count of the accepted answer. In this query, you are asked to discover such questions whose accepted answer has less upVote than the upVote counts of its other answers. Note We are only interested in questions with at least 5 answers.
[Q5] Given a time period as indicated by starting and ending date, find the top 5 topics in that period. We rank a topic by the number of users participated in that topic during the period. Posting question, answering or commenting are all considered as participation.
2
Comments.csv stores comments meta data; each row represents a comment, which can be made for a question or an answer, identified by the PostId
Votes.csv stores detailed vote information about post, each row represents a vote, including the vote type, the date this vote is made and a few other information
Tags.csv contains summary of tag usage in this site.
Two concepts that will appear in many query descriptions are: Topic and User.
4
[Q6] Find the top 5 co-authors of a given user. Consider all users involved in a question as co-authors. This include users posting the question, answering the question or making comments on either question or answers. For a given user, we rank the coauthors by the number of questions this user and the coauthor appear together.
Task Details
Your tasks include:
Schema Design for MongoDB and Neo4j
For each storage option design a proper schema that would best support the query and data set feature. For each schema version, make sure you utilize features of the storage system such as indexing, aggregation, ordering, filtering and so on.
The original data set follows relational structure. It may contain data that are not useful or not involved in the query. During schema design, you may discard data that are not needed. You may duplicate original data following the schema design.
Query Design and Implementation
Load the data set (after some necessary preporocessing) into both systems and set up proper indexes that will be used by the target queries. Design and implement all queries in each system. You may implement a query using the shell command (e.g. MongoDB shell or Cypher query) alone, or a combination of JavaScript and shell commands in the case of MongoDB or as Python/Java program. In case that a programming language is used, make sure that you do majority of the processing on the database side. The client side processing should be restricted to activities like collecting output from previous database query and send the output as is to the subsequent one. In particular, you should avoid sorting, filtering and grouping query output on the client side.
Performance Analysis
For each storage option (MongoDB and Neo4j), pick two queries as the performance analysis target queries. Design a different implementation for each query. Then collect execution statistics of each implementation and make a side by side comparison.
Deliverable and Submission Guideline
This is a group project, each group can have up to 2 students. Each group needs to produce the following:
A Written Report.
The report should contain five sections. The first section is a brief introduction of the project. Section two and three should cover a storage option each. Section four
3
should provide a summary and brief comparison of the two storage systems. Section five should be an appendix for sample results.
There is no point allocated on section one. It is included to make the report complete. So please keep it short.
Section two and three should each contain the following three sub sections
Schema Design
In this section, describe the schema with respect to the particular system. Your description should include information at table and column level as well as possible primary keys/row keys and secondary indexes. You should show sam- ple data based on schema. For instance, you may show sample documents of each MongoDB collection, a sample property graph involving all node types and relationship types for Neo4j.
Query Design
In this section, describe implementation of each query. You should include the entire command and/or code. For each query, briefly explain the behaviour. Example of MongoDB query description can be found in week 2 and week3 labs instructions.
Performance Analysis
In this section, list the two queries you have chosen for performance analysis. For each query, include the entire command/or code for each implementation. Show the execution statistics in tabular format or as screenshots. Give a brief comparison by highlighting the important execution differences.
In section four, briefly compare the two storage systems with respect to ease of use, query design and schema differences. You can also describe problems encountered in schema design or query design.
In section five, document the sample query results as well as the respective argument(s) you use for queries that take argument. This would include: a sample topic in Q1, Q2, Q3; a sample period in Q5, a simple userId in Q6.
System Demo
Each group will demo in week 10 lab. You can run demo on your own machine, on lab machine or on some cloud servers. Please make sure you prepare the data before the demo. The marker does not need to see your data loading steps. The marker will ask you to run a few randomly selected queries to get an overview of the data model and query design. All members of the group are required to attend the demo. The marker will ask each member a few questions to establish their respective contribution to the project. Members in the same group may get different marks depending on their individual contributions.
4
Source Code/Script and soft copy of report submission
There will be different links for script (zip file) and report (PDF file) submission to facilitate plagiarism detection. The script submission should be a zip file (no rar, 7z) include the following:
query script or program code for each option
data loading script.
a Readme document for how to run the data loading script and the target queries. The instruction should be detailed enough for the markers to quickly prepare the data and to run the queries. For instance, you should indicate where and how run-time argument are supplied. If you use special features only available in a particular version or environment, indicate that as well.
Remember, only script or source code and read.me file should be included. There will be penalty for including data file in the submission.
5
Reviews
There are no reviews yet.