Difference between revisions of "Lab: Semantic Lifting - CSV"

 
(28 intermediate revisions by 4 users not shown)
Line 1: Line 1:
=Lab 9: Semantic Lifting - CSV=
+
= Lab 6: Semantic Lifting - CSV =
  
==Topics==
+
== Topic ==
Today's topic involves lifting the data in CSV format into RDF.
+
Today's topic involves lifting data in CSV format into RDF. The goal is for you to learn how we can convert non-semantic data into RDF as well as getting familiar with some common vocabularies.
The goal is for you to learn an example of how we can convert unsemantic data into RDF.
 
 
 
CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.
 
  
 
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
 
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
  
We will also use Pandas Dataframes which will contain our CSV data in python code.
+
We will also use Pandas Dataframes which will contain our CSV data in python code, and we'll do some basic data manipulation to improve our output data.
  
==Relevant Libraries/Functions==
+
== Relevant Libraries - Classes, Functions and Methods and Vocabularies==
import pandas
+
=== Libraries ===
 +
* RDFlib concepts from earlier (Graph, Namespace, URIRef, Literal, BNode)
 +
* Pandas: DataFrame, apply, iterrows, astype
 +
* DBpedia Spotlight
  
pandas.read_csv
+
=== Semantic Vocabularies ===
 +
You do not have to use the same ones, but these should be well suited.
 +
* RDF: type
 +
* RDFS: label
 +
* Simple Event Ontology (sem): Event, eventType, Actor, hasActor, hasActorType, hasBeginTimeStamp, EndTimeStamp, hasTime, hasSubEvent
 +
* TimeLine Ontology (tl): durationInt
 +
* An example-namespace to represent terms not found elsewhere (ex): IndictmentDays, Overturned, Pardoned
 +
* DBpedia
  
dataframe.iterrowns(), dataframe.fillna(), dataframe.replace()
+
== Tasks ==
 +
Today we will be working with FiveThirtyEight's russia-investigation dataset. It contains special investigations conducted by the United States since the Watergate-investigation with information about them to May 2017. If you found the last weeks exercice doable, I recommend trying to write this with object-oriented programming (OOP) structure, as this tends to make for cleaner code.
  
 +
It contains the following columns:
 +
* investigation
 +
* investigation-start
 +
* investigation-end
 +
* investigation-days
 +
* name
 +
* indictment-days
 +
* type
 +
* cp-date
 +
* cp-days
 +
* overturned
 +
* pardoned
 +
* american
 +
* president
  
 +
More information about the columns and the dataset here: https://github.com/fivethirtyeight/data/tree/master/russia-investigation
  
 +
Our goal is to convert this non-semantic dataset into a semantic one. To do this we will go row-by-row through the dataset and extract the content of each column.
 +
An investigation may have multiple rows in the dataset if it investigates multiple people, you can choose to represent these as one or multiple entities in the graph. Each investigation may also have a sub-event representing the result of the investigation, this could for instance be indictment or guilty-plea.
  
==Tasks==
+
For a row we will start by creating a resource representing the investigation. In this example we handle all investigations with the same name as the samme entity, and will therefore use the name of the investigation ("investigation"-column) to create the URI:
  
Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project folder and write a program with a loop that reads each line from that file (except the initial header line) and adds it to your graph as triples:
+
<syntaxhighlight>
 +
name = row["investigation"]
  
"Name","Gender","Country","Town","Expertises","Interests"
+
investigation = URIRef(ex + name)
"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
+
g.add((investigation, RDF.type, sem.Event))
"Achille Blaise","M","France","Nancy","","Chess, computer games"
+
</syntaxhighlight>
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks","Hiking, botany"
 
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
 
  
When solving the task take note of the following:
+
Further we will create a relation between the investigation and all its associated columns. For when the investigation started we'll use the "investigation-start"-column and we can use the property sem:hasBeginTimeStamp:
  
*  The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
+
<syntaxhighlight>
* Some columns like expertise have multiple values for one person. You should create unique triple for each of these expertises.
+
investigation_start = row["investigation-start"]
  
* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.  
+
g.add((investigation, sem.hasBeginTimeStamp, Literal(investigation_start, datatype=XSD.date)))
 
+
</syntaxhighlight>
* Any case with missing data should not form a triple.
 
 
 
* For consistency, make sure all resources start with a Captital letter.
 
 
 
 
 
==If You have more Time==
 
* Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
 
 
 
* Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.
 
  
 +
To represent the result of the investigation, if it has one, We can create another entity and connect it to the investigation using the sem:hasSubEvent. If so the following columns can be attributed to the sub-event:
 +
* type
 +
* indictment-days
 +
* overturned
 +
* pardon
 +
* cp_date
 +
* cp_days
 +
* name (the name of the investigatee, not the name of the investigation)
  
==Code to Get Started (You could also use your own approach if you want to) ==
+
=== Code to get you started ===
 
 
 
<syntaxhighlight>
 
<syntaxhighlight>
from rdflib import Graph, Literal, Namespace, URIRef
 
  
 
import pandas as pd
 
import pandas as pd
 +
import rdflib
  
# Load the CSv data as a pandas Dataframe.
+
from rdflib import Graph, Namespace, URIRef, Literal, BNode
csv_data = pd.read_csv("task1.csv")
+
from rdflib.namespace import RDF, RDFS, XSD
 +
 
 +
ex = Namespace("http://example.org/")
 +
dbr = Namespace("http://dbpedia.org/resource/")
 +
sem = Namespace("http://semanticweb.cs.vu.nl/2009/11/sem/")
 +
tl = Namespace("http://purl.org/NET/c4dm/timeline.owl#")
  
 
g = Graph()
 
g = Graph()
ex = Namespace("httph://example.org/")
 
 
g.bind("ex", ex)
 
g.bind("ex", ex)
 +
g.bind("dbr", dbr)
 +
g.bind("sem", sem)
 +
g.bind("tl", tl)
  
 +
df = pd.read_csv("data/investigations.csv")
 +
# We need to correct the type of the columns in the DataFrame, as Pandas assigns an incorrect type when it reads the file (for me at least). We use .astype("str") to convert the content of the columns to a string.
 +
df["name"] = df["name"].astype("str")
 +
df["type"] = df["type"].astype("str")
  
# You should probably deal with replacing of characters or missing data here:
+
# iterrows creates an iterable object (list of rows)
 +
for index, row in df.iterrows():
 +
# Do something here to add the content of the row to the graph
 +
pass
  
 
+
g.serialize("output.ttl", format="ttl")
 
 
# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.
 
 
 
for index, row in csv_data.iterrows():
 
    subject = row['Name']
 
 
 
    #Continue the Code here:
 
 
 
 
 
# Clean printing of end-results.
 
print(g.serialize(format="turtle").decode())
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
==Hints==
+
== If you have more time ==
 +
If you have not already you should include some checks to assure that you don't add any empty columns to your graph.
  
Replacing characters with Dataframe:
+
If you have more time you can implement DBpedia Spotlight to link the people mentioned in the dataset to DBpedia resources.
 +
You can use the same code example as in the last lab, but you will need some error-handling for when DBpedia is unable to find a match. For instance:
  
 
<syntaxhighlight>
 
<syntaxhighlight>
csv_data = csv_data.replace(to_replace ="banana",
+
# Parameter given to spotlight to filter out results with confidence lower than this value
                value ="apple", regex=True)
+
CONFIDENCE = 0.5
</syntaxhighlight>
 
  
Fill missing data of Dataframe with paramteter value.
+
def annotate_entity(entity, filters={"types":"DBpedia:Person"}):
 
+
annotations = []
<syntaxhighlight>
+
try:
csv_data = csv_data.fillna("missing")
+
annotations = spotlight.annotate(SERVER, entity, confidence=CONFIDENCE, filters=filters)
 +
    # This catches errors thrown from Spotlight, including when no resource is found in DBpedia
 +
except SpotlightException as e:
 +
print(e)
 +
# Implement some error handling here
 +
return annotations
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Make first letter of word Captial.
+
Here we use the types-filter with DBpedia:Person, as we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well. An issue here is that
  
<syntaxhighlight>
+
== Useful readings ==
name = "cade".title()
+
* [https://github.com/fivethirtyeight/data/tree/master/russia-investigation Information about the dataset]
</syntaxhighlight>
+
* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Article about working with pandas.DataFrames and CSV]
 +
* [https://pandas.pydata.org/pandas-docs/stable/reference/frame.html Pandas DataFrame documentation]
 +
* [https://semanticweb.cs.vu.nl/2009/11/sem/#sem:eventType Simple Event Ontology Descripiton]
 +
* [http://motools.sourceforge.net/timeline/timeline.html The TimeLine Ontology Description]
 +
* [https://www.dbpedia-spotlight.org/api Spotlight Documentation]

Latest revision as of 16:16, 1 March 2022

Lab 6: Semantic Lifting - CSV

Topic

Today's topic involves lifting data in CSV format into RDF. The goal is for you to learn how we can convert non-semantic data into RDF as well as getting familiar with some common vocabularies.

Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.

We will also use Pandas Dataframes which will contain our CSV data in python code, and we'll do some basic data manipulation to improve our output data.

Relevant Libraries - Classes, Functions and Methods and Vocabularies

Libraries

  • RDFlib concepts from earlier (Graph, Namespace, URIRef, Literal, BNode)
  • Pandas: DataFrame, apply, iterrows, astype
  • DBpedia Spotlight

Semantic Vocabularies

You do not have to use the same ones, but these should be well suited.

  • RDF: type
  • RDFS: label
  • Simple Event Ontology (sem): Event, eventType, Actor, hasActor, hasActorType, hasBeginTimeStamp, EndTimeStamp, hasTime, hasSubEvent
  • TimeLine Ontology (tl): durationInt
  • An example-namespace to represent terms not found elsewhere (ex): IndictmentDays, Overturned, Pardoned
  • DBpedia

Tasks

Today we will be working with FiveThirtyEight's russia-investigation dataset. It contains special investigations conducted by the United States since the Watergate-investigation with information about them to May 2017. If you found the last weeks exercice doable, I recommend trying to write this with object-oriented programming (OOP) structure, as this tends to make for cleaner code.

It contains the following columns:

  • investigation
  • investigation-start
  • investigation-end
  • investigation-days
  • name
  • indictment-days
  • type
  • cp-date
  • cp-days
  • overturned
  • pardoned
  • american
  • president

More information about the columns and the dataset here: https://github.com/fivethirtyeight/data/tree/master/russia-investigation

Our goal is to convert this non-semantic dataset into a semantic one. To do this we will go row-by-row through the dataset and extract the content of each column. An investigation may have multiple rows in the dataset if it investigates multiple people, you can choose to represent these as one or multiple entities in the graph. Each investigation may also have a sub-event representing the result of the investigation, this could for instance be indictment or guilty-plea.

For a row we will start by creating a resource representing the investigation. In this example we handle all investigations with the same name as the samme entity, and will therefore use the name of the investigation ("investigation"-column) to create the URI:

name = row["investigation"]

investigation = URIRef(ex + name)
g.add((investigation, RDF.type, sem.Event))

Further we will create a relation between the investigation and all its associated columns. For when the investigation started we'll use the "investigation-start"-column and we can use the property sem:hasBeginTimeStamp:

investigation_start = row["investigation-start"]

g.add((investigation, sem.hasBeginTimeStamp, Literal(investigation_start, datatype=XSD.date)))

To represent the result of the investigation, if it has one, We can create another entity and connect it to the investigation using the sem:hasSubEvent. If so the following columns can be attributed to the sub-event:

  • type
  • indictment-days
  • overturned
  • pardon
  • cp_date
  • cp_days
  • name (the name of the investigatee, not the name of the investigation)

Code to get you started

import pandas as pd
import rdflib

from rdflib import Graph, Namespace, URIRef, Literal, BNode
from rdflib.namespace import RDF, RDFS, XSD

ex = Namespace("http://example.org/")
dbr = Namespace("http://dbpedia.org/resource/")
sem = Namespace("http://semanticweb.cs.vu.nl/2009/11/sem/")
tl = Namespace("http://purl.org/NET/c4dm/timeline.owl#")

g = Graph()
g.bind("ex", ex)
g.bind("dbr", dbr)
g.bind("sem", sem)
g.bind("tl", tl)

df = pd.read_csv("data/investigations.csv")
# We need to correct the type of the columns in the DataFrame, as Pandas assigns an incorrect type when it reads the file (for me at least). We use .astype("str") to convert the content of the columns to a string.
df["name"] = df["name"].astype("str")
df["type"] = df["type"].astype("str")

# iterrows creates an iterable object (list of rows)
for index, row in df.iterrows():
	# Do something here to add the content of the row to the graph 
	pass

g.serialize("output.ttl", format="ttl")

If you have more time

If you have not already you should include some checks to assure that you don't add any empty columns to your graph.

If you have more time you can implement DBpedia Spotlight to link the people mentioned in the dataset to DBpedia resources. You can use the same code example as in the last lab, but you will need some error-handling for when DBpedia is unable to find a match. For instance:

# Parameter given to spotlight to filter out results with confidence lower than this value
CONFIDENCE = 0.5

def annotate_entity(entity, filters={"types":"DBpedia:Person"}):
	annotations = []
	try:
		annotations = spotlight.annotate(SERVER, entity, confidence=CONFIDENCE, filters=filters)
    # This catches errors thrown from Spotlight, including when no resource is found in DBpedia
	except SpotlightException as e:
		print(e)
		# Implement some error handling here
	return annotations

Here we use the types-filter with DBpedia:Person, as we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well. An issue here is that

Useful readings