Lab: Semantic Lifting - XML
Lab 10: Semantic Lifting - XML
Link to Discord server
Today's topic involves lifting data in XML format into RDF. XML stands for Extensible Markup Language and is used to commonly for data transfer, especially for websites. XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.
import xml.etree.ElementTree as ET
All parts of the XML tree are considered Elements.
Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.
You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser.
The actual data about the news articles are stored under the <item></item> tags
For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue
Do this by parsing the XML using ElementTree (see import above).
I recommend starting with the code at the bottom of the page and continuing on it. This code retrieves the XML using a HTTPRequest and saves it to an XML_file, so that you can view and parse it easily.
You can use this regex (string matcher) to get only the ID's from the full url that is in the <guid> data.
news_id = re.findall('\d+$', news_id)
Parse trough the fictional XML data below and add the correct journalist as the writers of the news_articles from earlier. This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it. One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate".
<data> <news_publisher name="BBC News"> <journalist whenWriting="Mon, Tue, Wed" > <firstname>Thomas</firstname> <lastname>Smith</lastname> </journalist> <journalist whenWriting="Thu, Fri" > <firstname>Joseph</firstname> <lastname>Olson</lastname> </journalist> <journalist whenWriting="Sat, Sun" > <firstname>Sophia</firstname> <lastname>Cruise</lastname> </journalist> </news_publisher> </data
If You have more Time
Extend the graph using the PROV vocabulary to say that our articles are attributedTo BBC_News.
Also use the prov.Entity and prov.Agent class where it is relevant.
Code to Get Started
from rdflib import Graph, Literal, Namespace, URIRef from rdflib.namespace import RDF, XSD import xml.etree.ElementTree as ET import requests import re g = Graph() ex = Namespace("http://example.org/") prov = Namespace("http://www.w3.org/ns/prov#") g.bind("ex", ex) g.bind("ex", prov) # url of rss feed url = 'http://feeds.bbci.co.uk/news/rss.xml' # creating HTTP response object from given url resp = requests.get(url) # saving the xml file with open('test.xml', 'wb') as f: f.write(resp.content)
|Replacing characters with Dataframe: