Final Homework
The JLU news Spider (60%)
You should design and implement a web spider to crawl the OA system.
Starter URL:https://www.jlu.edu.cn/index/tzgg.htm
The date range: 2018-01-01~2019-06-01
The information you crawled should include the following information.
Title
Submission date
Submission department
Main body of the news
The technology may use
RegExp
Multi-thread
String Handle
File I/O
The results are indexed by the submission date
One date one directory
Named by the data like 2019-05-21
One news one file
named by the title
saved in the same directory
Simple analysis of the results
The total amount of the crawled news. The more news your crawl, the more score you will get.
The total amount news of each week and shown by curve [1]. If possible, divided by department.
The average amount news of each day
The average amount news of each weekday, its better to give a boxplot [2].
The average amount news of each department, its better to give a boxplot 2].
Other Statistics data you interested
The Word Cloud Plot of the results (40%)
Segment the news by Jieba [3].
Delete the stop words
Extract the keywords of each news by TF-INF and TextRank (Jieba).
Demonstrate one days news by word cloud plot and the scores (D3 [4-5]).
Design a web page to demonstrate all the results indexed by date.
The Final thesis
You should submit the following result to our system.
The spider codes
The demonstration web page
The thesis
The results files should be uploaded to your private cloud server and submit the downloading url!
Its a teamwork.
Each team has 1~5 students.
Every group has a leader.
The leader should specific each members contribution and give the percentage.
Your thesis for should include:
Title Page
Abstract
Table of Contents (optional)
Chapter One Introduction
Chapter Two Review of Literature (optional)
Chapter Three Methods
Chapter Four Data Analysis and Results
Chapter Five Conclusion
References
Reference
https://bl.ocks.org/mbostock/3884955
https://bl.ocks.org/mbostock/4061502
https://github.com/anderscui/jieba.NET/
https://www.jasondavies.com/wordcloud/
https://d3js.org/
GitHub zlzforever/DotnetSpider. https://github.com/zlzforever/DotnetSpider.
Web scraping Wikipedia. https://en.wikipedia.org/wiki/Web_scraping.
GitHub code4craft/webmagic: A scalable web crawler framework for . https://github.com/code4craft/webmagic.
Scrapy. https://scrapy.org/.
Plagiarism Wikipedia. https://en.wikipedia.org/wiki/Plagiarism.
Programming style Wikipedia. https://en.wikipedia.org/wiki/Programming_style.
Viewing the history of your project GitHub Help. https://help.github.com/desktop/guides/contributing/viewing-the-history-of-your-project/.
Webscraping with C# CodeProject. 20 Oct. 2015, https://www.codeproject.com/Articles/1041115/Webscraping-with-Csharp.
Programming
[SOLVED] scala statistic C# Final Homework
$25
File Name: scala_statistic_C#_Final_Homework.zip
File Size: 310.86 KB
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.