Regular expressions and web scraping
In lectures we studied regular expressions and used <regex101.com> to test regexs interatively. Now lets practice using them in Python.
Figure 1: On https://nationalservice.ichec.ie/login/login.php, there is a list of all the ICHEC projects, Classes
A, B and C.
We can use Ctrl-A, Ctrl-C, Ctrl-V to put this data in a text file: data/ichec_projects_scrape.txt.
However, it is now unstructured plain text. Lets use regular expressions to extract the project codes. Each code is like ngcom018c or ulphy033a.
- import re
- Read the data: s = open(../data/ichec_projects_scrape.txt).read()
- Write a pattern p to match codes (maybe test on regex101.com)
- Call a Python re function to find all the project codes.
- Notice that the codes seem to have a specific encoding: ngcom018c is NUI Galway, Computer Science, 18, Class-C. ulphy033a is University of Limerick, Physics, 033, Class-A. Use grouping ( ) to extract the four individual parts in each code. Using this, how many Class-C Computer Science projects are there across all universities?
- Write a new pattern to match only NUI Galway projects, and test it.
(Solutions: code/count_ichec_projects.py.)
Generative art using grammars
We already have the following code which will generate an image given a string (the string representing an arithmetic expression). Notice here we are using x[0] and x[1] to represent the two axes (not x and y as in the notebook).
import numpy as npimport matplotlib.pyplot as plt import matplotlib.cm as cmn = 200xs = np.linspace(0, 1, n) ys = np.linspace(0, 1, n)x = np.meshgrid(xs, ys) # x contains x[0] and x[1]ps = np.sin(40 * x[0]) * np.sin(30 * (x[1]+0.5)) * x[0] * x[1] p = eval(lambda x: + ps)plt.imshow(p(x)) plt.axis(off) plt.show() |
- Change ps to make cooler/more complex images.
We also have the following code which will derive a new string we can use instead of ps:
from grammar import Grammar # assume we are in code/ directory fname = arithmetic.bnf g = Grammar(file_name=fname) ps = g.derive_string() print(ps)
- Use this to generate several images. If you sometimes see the error TypeError: Invalid shape () for image data, thats probably because the grammar generated a string like 0, i.e. a constant. There are ways to work around this, but we can just ignore it and generate a new one.
- If you like, put everything in a convenient function or in a loop to make the process of trying new ones quicker.
- Change arithmetic.bnf to allow some cooler/more complex images. Post your best images on the Discussion Board.
Optional ideas: try different colour maps (see matplotlib.cm), or create polar coordinate variables
(r,).
- Take a look at derive_string(), defined in grammar.py, to see the implementation of the simple algorithm that we defined in lectures for deriving a string from a grammar.
Reviews
There are no reviews yet.