CSE205 Introduction to Networking Project 2
CSE205 Introduction to Networking
Project 2 Imagecrawler
How the work should be submitted?
SOFT COPY ONLY!
You must submit your work through ICE so that markers can run your codes during
marking.
Make sure your name (Last name SURNAME, eg. San ZHANG) and your student
ID on the cover page of your report.
Project requirements
The goal of this project is to build an Imagecrawler application that can download images from websites and save them on your local computer. The program should take two parameters (input from keyboard): a URL that is the starting point of the crawl, and a depth, which is how many pages deep your crawler should go. The depth parameter is optional, and defaults to 5, if it is not specified.
The way this should work is:
1. (10 points) Connect to the supplied URL and request the web page. Because we
are studying networking, you are only allowed to use requests module in Python or only use TCP sockets (using pure socket would have 5 points bonus).
a) You should create a folder named after the website. For example, if the URL is http://www.cnn.com, your folder should be named www.cnn.com.
b) b. URLs can include paths. For example: http://www.xjtlu.edu.cn/en/departments/academic-departments/computer- science-and-software-engineering/. In this case, you should make a series of directories with the same structure as the URL.
2. (20 points) Download all images in the page (.gif, .jpg, .jpeg, .png, .webp, case
Contribution to Overall Marks
15%
Submission Deadline
Fri. 22nd Nov. 2019, 23:59
1
CSE205 Introduction to Networking Project 2
insensitive).
a) The names of the images should be the same as they are on the remote server.
b) Images may be on the same server, or on different servers. You should store it
based on the page it appears on, not the server it exists on. (for example, http://www.cnn.com has many images that exist on different servers, but when you download them, store them all in the www.cnn.com folder.) This includes images that exist on the same server. Regardless of the folder the image exists in, it should be stored in the folder for the current URL.
3. (20 points) For all href links in the page, repeat steps 1 and 2, up to the depth specified by the user.
a) Remember that links can be absolute or relative to the current server. (e.g. http://www.cnn.com is absolute, but en/departments/academic- departments/computer-science-and-software-engineering/about/learning-and- teaching is relative to the current server.
b) The depth of a page is the number of links you followed to get to it. The original URL is depth 0, any links on that page have depth 1, any links on any depth 1 pages have depth 2, etc.
c) Remember that links may be circular. I could link to you, and you could link back to me.
d) Ignore style sheets, Javascript, etc.
e) To find links in the HTML text, I suggest you look at the Python regular
expression tools (re.regex) or HTMLParser. These are both standard libraries
in Python.
4. (10 points) To speed up your application, thread your application to download in
parallel
5. (40 points) A development report:
a) Introduction: project requirement (in your own language), background, literature review (try to find some papers or development reports of similar apps), whats your contribution
b) Methodology: proposed ideas, program flow chats
c) Implementation: steps of implementation, what difficulties you met and how
to solve them.
d) Testing and results: testing plan and testing results (screenshot, curves)
e) Conclusion: what you did? Do you have a future plan to improve it?
f) Reference [IEEE format]
What should be submitted?
A development report (no more than 8 pages, single column, PDF format) including:
Python codes;
Please compress the report and codes to a ZIP file (not rar, not Kuaiya and not any
other formats please), the file name should be:
CSE205_P2_Last name_SURNAME.zip (eg. CSE205_P2_San_ZHANG.zip).
2
CSE205 Introduction to Networking Project 2
You are encouraged to use LaTeX to finish your report. The template is https://github.com/feimax/latex_template_for_xjtlu_eee . If you are still using MS word, please refer to the PDF file in the LaTeX template.
3
Reviews
There are no reviews yet.