Data Formats – (1)
School of Computing and Information Systems
@University of Melbourne 2022

Data formats
COMP20008 Elements of Data Processing

Categories of data formats
Text files/documents
Social media data
More Machine Readable
More Human Readable
COMP20008 Elements of Data Processing

Structured data
1. Relational databases 2. Spreadsheets
COMP20008 Elements of Data Processing

Relational database
COMP20008 Elements of Data Processing

Relational database – cont.
COMP20008 Elements of Data Processing

Relational database – cont.
COMP20008 Elements of Data Processing

Relational database – cont.
Select StudentID, Grade
from grade_table, supervisor_table where grade_table.SupervisorID
= supervisor_table.ID
and “Supervisor name” = ‘Prof.’
COMP20008 Elements of Data Processing

Database Systems – (INFO20003)
• INFO20003 covers related topics including • SQL
• Specification of integrity constraints
• Data modelling and relational database management systems • Transactions and concurrency control
• Storage management
• Web-based databases
• Highly relevant to data wrangling! Useful to do INFO20003 as part of a data science specialisation
COMP20008 Elements of Data Processing

Joins – relational algebra
COMP20008 Elements of Data Processing
https://medium.com/swlh/merging-dataframes-with- pandas-pd-merge-7764c7e2d46d

Joins in Pandas
COMP20008 Elements of Data Processing

• Once data is into a relational database, it is easier to wrangle. • But may be difficult to load it there in the first place …
COMP20008 Elements of Data Processing

• https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
• https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html • https://pandas.pydata.org/pandas-docs/version/0.22.0/merging.html
• Further reading
• Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf
COMP20008 Elements of Data Processing

• Huge amounts of data is in spreadsheets
• Businesses • Hospitals
• Tabular format (transactional, simple but many entries) with validation capability
• Microsoft (Excel), OpenOffice (Calc), Google Sheets
COMP20008 Elements of Data Processing

What you should know
• Why do we have different data formats and why do we wish to transform between different formats?
• Motivation for using relational databases to manage information
• What is a csv, what is a spreadsheet, what is the difference?
• Difference between HTML and XML and when to use each
• Be able to read and write data in XML (elements, attributes)
• Be able to read and write data in JSON
• Difference between XML and JSON; applications where each can
COMP20008 Elements of Data Processing

Data Formats – cont. (2)
School of Computing and Information Systems
@University of Melbourne 2022

Semi-structured data
1. CSV 2. HTML 3. XML 4. JSON
COMP20008 Elements of Data Processing

CSV – comma separated values
• Tabular information, with extension .csv
• Structured, but not like excel or a relational DB
• Just a delimited text file, human readable.
• Lacks formatting information
• Does not contain formulas and macros for data verification, transformation
COMP20008 Elements of Data Processing

HTML – Hypertext Markup language
• Marked up with elements, correspond to logical units such as a heading, paragraph or itemised list.
• defines that how web browser will format and display the content
• Elements marked by tags.
• Tags: keywords contained in pairs of angle brackets, not case sensitive
• closed tags: content
• unclosed tag:
• can have attributes; ordering of attributes is not significant
COMP20008 Elements of Data Processing

HTML example
Try it yourself: https://www.w3schools.com/html/tryit.asp?filename=tryhtml5_browsers_myhero HTML examples: https://www.w3schools.com/html/html_lists.asp
COMP20008 Elements of Data Processing

Limitations of HTML
• HTML was designed for pure presentation
• HTML is concerned with formatting not meaning
it doesn’t matter what it is about, HTML will format it
• HTML is not extensible
• can’t be modified to meet specific domain knowledge
• browsers have developed their own tags (, )
• HTML can be inconsistently applied almost everything is rendered somehow e.g., is this acceptable?
COMP20008 Elements of Data Processing

XML: eXtensible Markup Language
• Extensible: user defined tags
• Facilitate better encoding of semantics
• A ‘meta’ markup language (self-describing) • Mathematical Markup Language (MathML)
• ChemML (Chemical Markup Language)
• FHIR (Health/Medical data: http://hl7.org/fhir) • RSS, SOAP, SVG, …
COMP20008 Elements of Data Processing

XML syntax – well formed
• begin with declaration, the XML prolog.
• Elements
• One root element
• Properly nested
• Attribute values must be quoted • Must have a closing tag:
(self closing tag with an attribute) • Case sensitive
• comments

COMP20008 Elements of Data Processing

XML syntax – cont.
• Preserves white spaces.
I think … therefore I am
• some characters have special meaning
• ‘<’ and ‘&’ are strictly illegal inside an element• allbooks&videosarenow
all books & videos are now < AUD 10
• CDATA (character data) section may be used inside XML element to include large blocks of text, which may contain these special characters such as &, >
• COMP20008 Elements of Data Processing

XML – valid
Well-formed ≠ valid; valid = well-formed + formal validation
We do not cover DTD and XML schema in this subject
Document Type Definition (DTD)
XML Schema
COMP20008 Elements of Data Processing

MathML example: (” + $)!
Presentation markup

Content markup
COMP20008 Elements of Data Processing

• Extensible — non-extensible
• Case sensitive — not case sensitive • Focus on semantics — display
COMP20008 Elements of Data Processing

Further reading
• XML http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html
COMP20008 Elements of Data Processing

Data Formats – cont. (3)
School of Computing and Information Systems
@University of Melbourne 2022

Semi-structured data – cont.
COMP20008 Elements of Data Processing

JSON – JavaScript Object Notation
• JSON (www.json.org)
• (pretty much alone)
• c.f the development of XML by committee
• “Javascript: the good parts” • O’ Reilly, Yahoo Press
COMP20008 Elements of Data Processing

JSON syntax rules
• JSON data is in key/value pairs “firstName”:”John”
• JSON values
•A number (integer or floating point)
•A string (in double quotes) •A Boolean (true or false)
•An array (in square brackets) •An object (in curly braces) •null
COMP20008 Elements of Data Processing

JSON syntax rules – cont.
• JSON Objects {“firstName”:”John”,
• JSON Arrays [
{“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Anna”, “lastName”:”Smith”}, {“firstName”:”Peter”, “lastName”:”Jones”}
• These objects repeat recursively down a hierarchy as needed. • In terms of syntax that’s pretty much it!
COMP20008 Elements of Data Processing

JSON format (from json.org)
COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

JSON compared to XML
• JSON is simpler and more compact/lightweight than XML; easy to parse.
• Which appeals to programmers looking for speed and efficiency
• Widely used for storing data in noSQL databases
• Common JSON application – read and display data from a webserver using javascript. https://www.w3schools.com/js/js_json.asp
• XML comes with a large family of other standards for querying and transforming (XQuery, XML Schema, XPATH, XSLT, namespaces, …)
• allows formal validation
• makes you consider the data design more closely
COMP20008 Elements of Data Processing

Python modules for JSON and XML
• json • lxml
COMP20008 Elements of Data Processing

JSON: Summary
• JavaScript Object Notation
• Lightweight, streamlined, standard method of data exchange
• Originally designed to speed up client/server interactions: • By running in the client browser
• Can be used to represent any kind of semi structured data • Lacks context and schema definitions
COMP20008 Elements of Data Processing

Unstructured Data – Intro
COMP20008 Elements of Data Processing

Unstructured data – Text
Text files…
• No structure.
• Lacks regularity and decomposable internal structure
• Hard to index
• Hard to organise
• How can we process and search for textual information?
More on text data later.
COMP20008 Elements of Data Processing

What you should know
• Categorising data formats based on their structural regularity
• Why do we have different data formats and why do we wish to
transform between different formats?
• Motivation for using relational databases to manage information
• What is a csv, what is a spreadsheet, what is the difference?
• Difference between HTML and XML and when to use each
• Be able to read and write data in XML (elements, attributes)
• Be able to read and write data in JSON
• Difference between XML and JSON; applications where each can be used.
COMP20008 Elements of Data Processing

COMP20008 Elements of Data Processing

• Twitter: https://developer.twitter.com/en/docs/tweets/data- dictionary/overview/intro-to-tweet-json
• Q: which object type can we find hashtags in Twitter’s object model?
COMP20008 Elements of Data Processing

