Latest revision as of 07:57, 6 April 2021

Lab 11: Semantic Lifting - HTML

Topics

Today's topic involves lifting data in HTML format into RDF. HTML stands for HyperText Markup Language and is used to describe the structure and content of websites.

HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

For parsing of the HTML, we will use the python library: BeautifulSoup.

Relevant Libraries/Functions

from bs4 import BeautifulSoup as bs
import requests
import re

beautifulsoup.find()
beautifulsoup.find_all()

string.replace(), string.split()
re.findall()

Tasks

Task 1

pip install beautifulsoup4

Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"

The papers will be represented with their Corpus ID. (the subject of triples). For Example, a paper has a title, a year, authors and so on.

For parsing of the HTML, we will use BeautifulSoup.

I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.

Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.

For example, we can see that the main topic of the page "Knowlede Graph" is under a 'h1' tag with the attribute class: "entity-name".

Knowing this we can use BeautifulSoup to find this in python code e.g:

topic = html.find('h1', attrs={'class': 'entity-name'}).text

Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which I can then iterate through:

papers = html.find_all('div', attrs={'class': 'flex-container'})
for paper in papers:
    # e.g selecting title.
    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
    print(title.text)

You can use this regex to extract the id from the Corpus ID, or the topic ID (which is in the URL)

id = re.findall('\d+$', id)[0]

Task 2

Create triples for the Topic of the page ("Knowledge Graph").

For example, a topic has related topics (on the top-right of the page). It also has, "known as" values, and a description.

This is a good opportunity to use the SKOS vocabulary to describe Concepts.

If You have more Time

If you look at the web page, you can see that there are buttons for expanding the description, related topics and more.

This is a problem as beautiful soup won't find this additional information until these buttons are pressed.

Use the python library selenium to simulate a user pressing the 'expand' buttons to get all the triples you should get.

Code to Get Started (Make sure you understand it)

from bs4 import BeautifulSoup as bs
from rdflib import Graph, Literal, URIRef, Namespace
from rdflib.namespace import RDF, OWL, SKOS, RDFS, XSD
import requests
import re

g = Graph()
ex = Namespace("http://example.org/")
g.bind("ex", ex)

# Download html from URL and parse it with BeautifulSoup.
url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
page = requests.get(url)
html = bs(page.content, features="html.parser")
# print(html.prettify())

# Find the html that surrounds all the papers
papers = html.find_all('div', attrs={'class': 'flex-container'})

# Iterate through each paper to make triples:
for paper in papers:
    # e.g selecting title. 
    title = paper.find('div', attrs={'class': 'timeline-paper-title'}).text
    print(title)

Useful Reading

Dataquest.io - Web-scraping with Python

@@ Line 1: / Line 1: @@
-=Lab 9: Semantic Lifting - CSV=
-==Link to Live-stream==
+=Lab 11: Semantic Lifting - HTML=
-<syntaxhighlight>
+<!-- ALO: I think this was from 2020:
-https://teams.microsoft.com/dl/launcher/launcher.html?url=%2f_%23%2fl%2fmeetup-join%2f19%3ameeting_MGI1ZjcxNTUtODBjNy00ZjkxLWJlNGUtOTQ2Y2M3NjEwYzkx%40thread.v2%2f0%3fcontext%3d%257b%2522Tid%2522%253a%2522648a24bc-a98d-4025-9c60-48c19a142069%2522%252c%2522Oid%2522%253a%252252d6ac23-7c70-43f5-bc41-95254a3ac7f1%2522%252c%2522IsBroadcastMeeting%2522%253atrue%257d%26anon%3dtrue&type=meetup-join&deeplinkId=3b3b5d40-c010-45ca-a620-7108967fe3e3&directDl=true&msLaunch=true&enableMobilePage=true&suppressPrompt=true
+==Link to Discord server==
-</syntaxhighlight>
+https://discord.gg/t5dgPrK
+-->
 ==Topics==
-Today's topic involves lifting data in CSV format into RDF.
+Today's topic involves lifting data in HTML format into RDF.
-The goal is for you to learn an example of how we can convert unsemantic data into RDF.
+HTML stands for HyperText Markup Language and is used to describe the structure and content of websites.
-CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.
+HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on.
+The goal is for you to learn an example of how we can convert unsemantic data into RDF.
+For parsing of the HTML, we will use the python library: BeautifulSoup.
-Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
-We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.
+==Relevant Libraries/Functions==
-==Relevant Libraries/Functions==
+*from bs4 import BeautifulSoup as bs
-import pandas
+*import requests
+*import re
-pandas.read_csv
-dataframe.iterrows(), dataframe.fillna(), dataframe.replace()
+*beautifulsoup.find()
+*beautifulsoup.find_all()
-string.split(), string.title(), string.replace()
+*string.replace(), string.split()
+*re.findall()
-RDF concepts we have used earlier.
 ==Tasks==
-Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and '''write a program with a loop that reads each line from that file and adds it to your graph as triples''':
+'''Task 1'''
- "Name","Gender","Country","Town","Expertises","Interests"
+'''pip install beautifulsoup4'''
- "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
- "Achille Blaise","M","France","Nancy","","Chess, computer games"
- "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
- "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
-To get started you can use the code furhter down:
+'''Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858" '''
-When solving the task take note of the following:
+The papers will be represented with their Corpus ID. (the subject of triples).
+For Example, a paper has a title, a year, authors and so on.
-* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
+For parsing of the HTML, we will use BeautifulSoup.
-* Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.
+I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.
-* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
+Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.
-* Any case with missing data should not form a triple.
+For example, we can see that the main topic of the page "Knowlede Graph" is under a 'h1' tag with the attribute class: "entity-name".
-* For consistency, make sure all resources start with a Captital letter.
+Knowing this we can use BeautifulSoup to find this in python code e.g:
+<syntaxhighlight>
+topic = html.find('h1', attrs={'class': 'entity-name'}).text
+</syntaxhighlight>
+Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which I can then iterate through:
-==If You have more Time==
+<syntaxhighlight>
-* Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
+papers = html.find_all('div', attrs={'class': 'flex-container'})
+for paper in papers:
-* Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.
+    # e.g selecting title.
+    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
+    print(title.text)
-==Code to Get Started (You could also use your own approach if you want to) ==
+</syntaxhighlight>
+You can use this regex to extract the id from the Corpus ID, or the topic ID (which is in the URL)
 <syntaxhighlight>
-from rdflib import Graph, Literal, Namespace, URIRef
+id = re.findall('\d+$', id)[0]
+</syntaxhighlight>
-import pandas as pd
+==Task 2==
+Create triples for the Topic of the page ("Knowledge Graph").
-# Load the CSv data as a pandas Dataframe.
+For example, a topic has related topics (on the top-right of the page). It also has, "known as" values, and a description.
-csv_data = pd.read_csv("task1.csv")
-g = Graph()
+This is a good opportunity to use the SKOS vocabulary to describe Concepts.
-ex = Namespace("httph://example.org/")
-g.bind("ex", ex)
-# You should probably deal with replacing of characters or missing data here:
+==If You have more Time==
+If you look at the web page, you can see that there are buttons for expanding the description, related topics and more.
+This is a problem as beautiful soup won't find this additional information until these buttons are pressed.
-# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.
+Use the python library '''selenium''' to simulate a user pressing the 'expand' buttons to get all the triples you should get.
-for index, row in csv_data.iterrows():
-    # row['Name'] selects the name value of the current row.
-    subject = row['Name']
-     #Continue the loop here:
+==Code to Get Started (Make sure you understand it)==
-# Clean printing of end-results.
-print(g.serialize(format="turtle").decode())
-</syntaxhighlight>
-{| role="presentation" class="wikitable mw-collapsible mw-collapsed"
-| <strong>Hints</strong>
-|-
-| Replacing characters with Dataframe:
 <syntaxhighlight>
-csv_data = csv_data.replace(to_replace ="banana",
+from bs4 import BeautifulSoup as bs
-                 value ="apple", regex=True)
+from rdflib import Graph, Literal, URIRef, Namespace
-</syntaxhighlight>
+from rdflib.namespace import RDF, OWL, SKOS, RDFS, XSD
+import requests
+import re
-Fill missing/empty data of Dataframe with paramteter value.
+g = Graph()
+ex = Namespace("http://example.org/")
+g.bind("ex", ex)
-<syntaxhighlight>
+# Download html from URL and parse it with BeautifulSoup.
-csv_data = csv_data.fillna("missing")
+url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
-</syntaxhighlight>
+page = requests.get(url)
+html = bs(page.content, features="html.parser")
+# print(html.prettify())
-Make first letter of word Captial.
+# Find the html that surrounds all the papers
+papers = html.find_all('div', attrs={'class': 'flex-container'})
-<syntaxhighlight>
+# Iterate through each paper to make triples:
-name = "cade".title()
+for paper in papers:
-</syntaxhighlight>
+    # e.g selecting title.
+    title = paper.find('div', attrs={'class': 'timeline-paper-title'}).text
+    print(title)
-After creating the graph you can remove all triples that contained unknown data easily if you marked like above.
-<syntaxhighlight>
-g.remove((None, None, URIRef("http://example.org/missing")))
 </syntaxhighlight>
-|}
 ==Useful Reading==
+* [https://www.dataquest.io/blog/web-scraping-tutorial-python/ Dataquest.io - Web-scraping with Python]
-* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Useful Resource for working with Dataframes and CSV]