Lab: Semantic Lifting - XML

From info216
Revision as of 23:39, 18 March 2020 by Say004 (talk | contribs)
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Lab 10: Semantic Lifting - XML

Link to Discord server

https://discord.gg/t5dgPrK

Topics

Today's topic involves lifting data in XML format into RDF. XML stands for Extensible Markup Language and is used to commonly for data transfer, especially for websites. XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.


Relevant Libraries/Functions

import requests

import xml.etree.ElementTree as ET

  • ET.parse('xmlfile.xml')

All parts of the XML tree are considered Elements.

  • Element.getroot()
  • Element.findall("path_in_tree")
  • Element.find("name_of_tag")
  • Element.text
  • Element.attrib("name_of_attribute")


Tasks

Task 1

Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.

You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser.

For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue

Do this by parsing the XML using ElementTree (see import above). This means


Task 2

Parse trough the fictional XML data below and add the correct journalist as the writers of the news_articles from earlier. This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it. One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate". I recommend starting with the code at the bottom of the page and continuing on it.

<data>
    <news_publisher name="BBC News">
        <journalist whenWriting="Mon, Tue, Wed" >
            <firstname>Thomas</firstname>
            <lastname>Smith</lastname>
        </journalist>
        <journalist whenWriting="Thu, Fri" >
            <firstname>Joseph</firstname>
            <lastname>Olson</lastname>
        </journalist>
        <journalist whenWriting="Sat, Sun" >
             <firstname>Sophia</firstname>
             <lastname>Cruise</lastname>
        </journalist>
    </news_publisher>
</data


Task 3


If You have more Time

Code to Get Started

from rdflib import Graph, Literal, Namespace, URIRef
from rdflib.namespace import RDF, XSD
import xml.etree.ElementTree as ET
import requests
import re

g = Graph()
ex = Namespace("http://example.org/")
prov = Namespace("http://www.w3.org/ns/prov#")
g.bind("ex", ex)
g.bind("ex", prov)


# url of rss feed
url = 'http://feeds.bbci.co.uk/news/rss.xml'

# creating HTTP response object from given url
resp = requests.get(url)

# saving the xml file
with open('test.xml', 'wb') as f:
    f.write(resp.content)



Useful Reading

XML-parsing-python by geeksforgeeks.org