Data Formats (1)
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By Assignmentchef assignmentchef
Data formats
COMP20008 Elements of Data Processing
Categories of data formats
Unstructured
Semi-Structured
Structured
Text files/documents
Spreadsheets
Social media data
CSV, NoSQL,
More Machine Readable
More Human Readable
COMP20008 Elements of Data Processing
Structured data
1. Relational databases 2. Spreadsheets
COMP20008 Elements of Data Processing
Relational database
https://clockwise.software/blog/relational-vs-non-relational-databases-advantages-and-disadvantages/
COMP20008 Elements of Data Processing
Relational database cont.
Linguistics
Supervisor
Linguistics
Linguistics
SupervisorName
SupervisorID
COMP20008 Elements of Data Processing
Relational database cont.
Linguistics
SupervisorName
SupervisorID
COMP20008 Elements of Data Processing
Relational database cont.
Linguistics
SupervisorName
Select StudentID, Grade
from grade_table, supervisor_table where grade_table.SupervisorID
= supervisor_table.ID
and Supervisor name = Prof.
SupervisorID
COMP20008 Elements of Data Processing
Database Systems (INFO20003)
INFO20003 covers related topics including SQL
Specification of integrity constraints
Data modelling and relational database management systems Transactions and concurrency control
Storage management
Web-based databases
Highly relevant to data wrangling! Useful to do INFO20003 as part of a data science specialisation
COMP20008 Elements of Data Processing
Joins relational algebra
COMP20008 Elements of Data Processing
https://medium.com/swlh/merging-dataframes-with- pandas-pd-merge-7764c7e2d46d
Joins in Pandas
INNER JOIN LEFT JOIN
RIGHT JOIN OUTER JOIN
COMP20008 Elements of Data Processing
Challenges
Once data is into a relational database, it is easier to wrangle. But may be difficult to load it there in the first place
COMP20008 Elements of Data Processing
https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html https://pandas.pydata.org/pandas-docs/version/0.22.0/merging.html
Further reading
Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf
COMP20008 Elements of Data Processing
Spreadsheets
Huge amounts of data is in spreadsheets
Businesses Hospitals
Tabular format (transactional, simple but many entries) with validation capability
Microsoft (Excel), OpenOffice (Calc), Google Sheets
COMP20008 Elements of Data Processing
What you should know
Why do we have different data formats and why do we wish to transform between different formats?
Motivation for using relational databases to manage information
What is a csv, what is a spreadsheet, what is the difference?
Difference between HTML and XML and when to use each
Be able to read and write data in XML (elements, attributes)
Be able to read and write data in JSON
Difference between XML and JSON; applications where each can
COMP20008 Elements of Data Processing
Data Formats cont. (2)
School of Computing and Information Systems
@University of Melbourne 2022
Semi-structured data
1. CSV 2. HTML 3. XML 4. JSON
COMP20008 Elements of Data Processing
CSV comma separated values
Tabular information, with extension .csv
Structured, but not like excel or a relational DB
Just a delimited text file, human readable.
Lacks formatting information
Does not contain formulas and macros for data verification, transformation
COMP20008 Elements of Data Processing
HTML Hypertext Markup language
Marked up with elements, correspond to logical units such as a heading, paragraph or itemised list.
defines that how web browser will format and display the content
Elements marked by tags.
Tags: keywords contained in pairs of angle brackets, not case sensitive
closed tags:
unclosed tag:
can have attributes; ordering of attributes is not significant
COMP20008 Elements of Data Processing
HTML example
Try it yourself: https://www.w3schools.com/html/tryit.asp?filename=tryhtml5_browsers_myhero HTML examples: https://www.w3schools.com/html/html_lists.asp
COMP20008 Elements of Data Processing
Limitations of HTML
HTML was designed for pure presentation
HTML is concerned with formatting not meaning
it doesnt matter what it is about, HTML will format it
HTML is not extensible
cant be modified to meet specific domain knowledge
browsers have developed their own tags (
HTML can be inconsistently applied almost everything is rendered somehow e.g., is this acceptable?
COMP20008 Elements of Data Processing
XML: eXtensible Markup Language
Extensible: user defined tags
Facilitate better encoding of semantics
A meta markup language (self-describing) Mathematical Markup Language (MathML)
ChemML (Chemical Markup Language)
FHIR (Health/Medical data: http://hl7.org/fhir) RSS, SOAP, SVG,
COMP20008 Elements of Data Processing
XML syntax well formed
begin with declaration, the XML prolog.
Elements
One root element
Properly nested
Attribute values must be quoted Must have a closing tag:
comments
COMP20008 Elements of Data Processing
XML syntax cont.
Preserves white spaces.
some characters have special meaning
< and & are strictly illegal inside an element
CDATA (character data) section may be used inside XML element to include large blocks of text, which may contain these special characters such as &, >
COMP20008 Elements of Data Processing
XML valid
Well-formed = valid; valid = well-formed + formal validation
We do not cover DTD and XML schema in this subject
Document Type Definition (DTD)
https://www.w3schools.com/xml/
XML Schema
https://www.w3schools.com/xml/
COMP20008 Elements of Data Processing
MathML example: ( + $)!
Presentation markup
Content markup
COMP20008 Elements of Data Processing
XML vs HTML
Extensible non-extensible
Case sensitive not case sensitive Focus on semantics display
COMP20008 Elements of Data Processing
Further reading
XML http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
COMP20008 Elements of Data Processing
Data Formats cont. (3)
School of Computing and Information Systems
@University of Melbourne 2022
Semi-structured data cont.
COMP20008 Elements of Data Processing
JSON JavaScript Object Notation
JSON (www.json.org)
(pretty much alone)
c.f the development of XML by committee
Javascript: the good parts O Reilly, Yahoo Press
COMP20008 Elements of Data Processing
JSON syntax rules
JSON data is in key/value pairs firstName:John
JSON values
A number (integer or floating point)
A string (in double quotes) A Boolean (true or false)
An array (in square brackets) An object (in curly braces) null
COMP20008 Elements of Data Processing
JSON syntax rules cont.
JSON Objects {firstName:John,
lastName:Doe}
JSON Arrays [
{firstName:John, lastName:Doe}, {firstName:Anna, lastName:Smith}, {firstName:Peter, lastName:Jones}
These objects repeat recursively down a hierarchy as needed. In terms of syntax thats pretty much it!
COMP20008 Elements of Data Processing
JSON format (from json.org)
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
https://www.w3schools.com/js/js_json_xml.asp
JSON compared to XML
JSON is simpler and more compact/lightweight than XML; easy to parse.
Which appeals to programmers looking for speed and efficiency
Widely used for storing data in noSQL databases
Common JSON application read and display data from a webserver using javascript. https://www.w3schools.com/js/js_json.asp
XML comes with a large family of other standards for querying and transforming (XQuery, XML Schema, XPATH, XSLT, namespaces, )
allows formal validation
makes you consider the data design more closely
COMP20008 Elements of Data Processing
Python modules for JSON and XML
json lxml
COMP20008 Elements of Data Processing
JSON: Summary
JavaScript Object Notation
Lightweight, streamlined, standard method of data exchange
Originally designed to speed up client/server interactions: By running in the client browser
Can be used to represent any kind of semi structured data Lacks context and schema definitions
COMP20008 Elements of Data Processing
Unstructured Data Intro
COMP20008 Elements of Data Processing
Unstructured data Text
Text files
No structure.
Lacks regularity and decomposable internal structure
Hard to index
Hard to organise
How can we process and search for textual information?
More on text data later.
COMP20008 Elements of Data Processing
What you should know
Categorising data formats based on their structural regularity
Why do we have different data formats and why do we wish to
transform between different formats?
Motivation for using relational databases to manage information
What is a csv, what is a spreadsheet, what is the difference?
Difference between HTML and XML and when to use each
Be able to read and write data in XML (elements, attributes)
Be able to read and write data in JSON
Difference between XML and JSON; applications where each can be used.
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
Questions:
Twitter: https://developer.twitter.com/en/docs/tweets/data- dictionary/overview/intro-to-tweet-json
Q: which object type can we find hashtags in Twitters object model?
COMP20008 Elements of Data Processing
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.