Difference between revisions of "Lab: Semantic Lifting - CSV"

From Info216
 
(23 intermediate revisions by 5 users not shown)
Line 1: Line 1:
=Lab 9: Semantic Lifting - CSV=
+
== Topic ==
 +
* Reading non-semantic data tables into semantic knowledge graphs
 +
* Specifically, reading data in CSV format via Pandas dataframes into RDF graphs
  
==Topics==
+
== Useful materials ==
Today's topic involves lifting data in CSV format into RDF.
+
The textbook (Allemang, Hendler & Gandon):
The goal is for you to learn an example of how we can convert unsemantic data into RDF.
+
* chapter 3 on RDF (from section 3.1 on ''Distributing Data across the Web'')
 +
* [https://github.com/fivethirtyeight/data/tree/master/russia-investigation Information about the dataset]
  
CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.
+
Pandas:
 +
* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Article about working with pandas.DataFrames and CSV]
 +
* [https://pandas.pydata.org/docs/user_guide/io.html#parsing-options User Guide for Pandas]
 +
* class: DataFrame (methods: read_csv, set_index, apply, iterrows, astype)
  
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
+
rdflib:
 +
* classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode)
 +
* also vocabulary classes like RDF (e.g., type), RDFS (e.g., label) and XSD (for various datatypes)
  
We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.
+
== Tasks ==
 +
We will be working with the same dataset as in the SPARQL exercise: [https://projects.fivethirtyeight.com/russia-investigation/ FiveThirtyEight's Russia Investigation]. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. [[Russian investigation KG | This page explains]] the Russia Investigation dataset a bit more.
  
==Relevant Libraries/Functions==
+
'''Task:'''
import pandas
+
In the SPARQL exercise, you downloaded the data as a Turtle file ([[File:russia_investigation_kg.txt]], which you renamed to ''.ttl''). This time you will [https://github.com/fivethirtyeight/data/tree/master/russia-investigation download the data as a CSV file from GitHub].
  
pandas.read_csv
+
''(Unfortunately, the earlier exercises used a wrong and very unfinished version of the russia_investigation_kg.txt file. It has been updated now with the correct version!)''
  
dataframe.iterrows(), dataframe.fillna(), dataframe.replace()
+
'''Task:'''
 +
Install Pandas in your virtual environment, for example
 +
pip install pandas
 +
Write a Python program that imports the ''pandas'' API and uses Pandas' ''read_csv'' function to load the ''russia-investigation.csv'' dataset into a Pandas ''dataframe''.
  
string.split(), string.title(), string.replace()
+
'''Task:'''
 +
''(Pandas basics)'' Inspect the Pandas dataframe. If you have called your dataframe ''df'', you can check out the expressions below. Use the [https://pandas.pydata.org/docs/user_guide/io.html#parsing-options documentation] to understand what each of them does.
 +
df.shape
 +
df.index  # ...and list(df.index)
 +
df.columns
 +
df['name']
 +
df.name
 +
df.loc[3]
 +
df.loc[3]['president']
 +
(Pandas offers many ways of picking out rows, columns, and values. These are just examples to get started.)
  
RDF concepts we have used earlier.
+
'''Task:'''
 +
''(Pandas basics)'' Pandas' apply method offers a compact way to process all the rows in a dataframe. This line lists all the rows in your dataframe as a Python dict():
 +
df.apply(lambda row: print(dict(row)), axis=1)
 +
What happens if you drop the ''axis'' argument, or set ''axis=0''?
  
==Tasks==
+
'''Task:'''
 +
Instead of the ''lambda'' function, you can use a named function. Write a function that prints out only the ''name'' and ''indictment-days'' in a row, and use it to print out the ''name'' and ''indictment-days'' for all rows in the dataframe.
  
Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and write a program with a loop that reads each line from that file and adds it to your graph as triples:
+
''Alternative to df.apply():''
 +
Pandas offers several ways to iterate through data. You can also use the ''itertuples'' methods in a simple ''for''-loop to iterate through rows.
  
"Name","Gender","Country","Town","Expertises","Interests"
+
'''Task:'''
"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
+
Modify your function so it adds ''name'' and ''indictment-days'' triples to a global rdflib ''Graph'' for each row in the dataframe. The subject in each triple could be the numeric index of the row.
"Achille Blaise","M","France","Nancy","","Chess, computer games"
 
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
 
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
 
  
To get started you can use the code furhter down:
+
You can use standard terms from RDF, RDFS, XSD, and other vocabularies when you see fit. Otherwise, just use an example-prefix.
  
When solving the task take note of the following:
+
Things may be easier if you copy ''df.index'' into an ordinary column of the dataframe:
 +
df['id'] = df.index
 +
You can use this index, along with a prefix, as the subject in your triples.
  
* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
+
'''Task:'''
 +
Continue to extend your function to convert the non-semantic CSV dataset into a semantic RDF one. Here is an example of how the data for one investigation could look like in the end:
 +
muellerkg:investigation_0 a muellerkg:Indictment ;
 +
    muellerkg:american true ;
 +
    muellerkg:cp_date "1973-01-30"^^xsd:date ;
 +
    muellerkg:cp_days -109 ;
 +
    muellerkg:indictment_days -246 ;
 +
    muellerkg:investigation muellerkg:watergate ;
 +
    muellerkg:investigation_days 1492 ;
 +
    muellerkg:investigation_end "1977-06-19"^^xsd:date ;
 +
    muellerkg:investigation_start "1973-05-19"^^xsd:date ;
 +
    muellerkg:name muellerkg:James_W._McCord ;
 +
    muellerkg:outcome muellerkg:conviction ;
 +
    muellerkg:overturned false ;
 +
    muellerkg:pardoned false ;
 +
    muellerkg:president muellerkg:Richard_Nixon .
  
* Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.
+
== If you have more time ==
  
* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
+
'''Task:'''
 
+
If you have not done so already, you should include checks to ensure that you do not add empty columns to your graph.
* Any case with missing data should not form a triple.
 
 
 
* For consistency, make sure all resources start with a Captital letter.
 
 
 
 
 
==If You have more Time==
 
* Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
 
 
 
* Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.
 
 
 
 
 
==Code to Get Started (You could also use your own approach if you want to) ==
 
  
 +
'''Task:'''
 +
If you have more time, you can use DBpedia Spotlight to try to link the people (and other "named entities") mentioned in the dataset to DBpedia resources.
 +
pip install pyspotlight
 +
You can start with the code example below, but you will need exception-handling when DBpedia is unable to find a match. For instance:
 
<syntaxhighlight>
 
<syntaxhighlight>
from rdflib import Graph, Literal, Namespace, URIRef
+
import spotlight
 
 
import pandas as pd
 
 
 
# Load the CSv data as a pandas Dataframe.
 
csv_data = pd.read_csv("task1.csv")
 
 
 
g = Graph()
 
ex = Namespace("httph://example.org/")
 
g.bind("ex", ex)
 
  
 +
ENDPOINT = 'https://api.dbpedia-spotlight.org/en/annotate'
 +
CONFIDENCE = 0.5  # filter out results with lower confidence
  
# You should probably deal with replacing of characters or missing data here:
+
def annotate_entity(entity_name, filters={'types': 'DBpedia:Person'}):
 
+
     annotations = []
 
+
    try:
 
+
annotations = spotlight.annotate(ENDPOINT, entity_name, confidence=CONFIDENCE, filters=filters)
# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.
+
    except spotlight.SpotlightException as e:
 
+
        # catch exceptions thrown from Spotlight, for example when no DBpedia resource is found
for index, row in csv_data.iterrows():
+
print(e)
    # row['Name'] selects the name value of the current row.
+
# handle exceptions here
     subject = row['Name']
+
    return annotations
 
 
    #Continue the loop here:
 
 
 
 
 
# Clean printing of end-results.
 
print(g.serialize(format="turtle").decode())
 
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
The example uses the types-filter with DBpedia:Person, because we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well.
  
 
+
Useful materials:
{| role="presentation" class="wikitable mw-collapsible mw-collapsed"
+
* [https://www.dbpedia-spotlight.org/api Spotlight Documentation]
| <strong>Hints</strong>
+
* [https://pypi.org/project/pyspotlight/ pyspotlight 0.7.2 at PyPi.org]
|-
 
| Lorem ipsum dolor sit amet
 
|}
 

Latest revision as of 19:51, 18 February 2023

Topic

  • Reading non-semantic data tables into semantic knowledge graphs
  • Specifically, reading data in CSV format via Pandas dataframes into RDF graphs

Useful materials

The textbook (Allemang, Hendler & Gandon):

Pandas:

rdflib:

  • classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode)
  • also vocabulary classes like RDF (e.g., type), RDFS (e.g., label) and XSD (for various datatypes)

Tasks

We will be working with the same dataset as in the SPARQL exercise: FiveThirtyEight's Russia Investigation. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. This page explains the Russia Investigation dataset a bit more.

Task: In the SPARQL exercise, you downloaded the data as a Turtle file (File:Russia investigation kg.txt, which you renamed to .ttl). This time you will download the data as a CSV file from GitHub.

(Unfortunately, the earlier exercises used a wrong and very unfinished version of the russia_investigation_kg.txt file. It has been updated now with the correct version!)

Task: Install Pandas in your virtual environment, for example

pip install pandas

Write a Python program that imports the pandas API and uses Pandas' read_csv function to load the russia-investigation.csv dataset into a Pandas dataframe.

Task: (Pandas basics) Inspect the Pandas dataframe. If you have called your dataframe df, you can check out the expressions below. Use the documentation to understand what each of them does.

df.shape
df.index  # ...and list(df.index)
df.columns
df['name']
df.name
df.loc[3]
df.loc[3]['president']

(Pandas offers many ways of picking out rows, columns, and values. These are just examples to get started.)

Task: (Pandas basics) Pandas' apply method offers a compact way to process all the rows in a dataframe. This line lists all the rows in your dataframe as a Python dict():

df.apply(lambda row: print(dict(row)), axis=1)

What happens if you drop the axis argument, or set axis=0?

Task: Instead of the lambda function, you can use a named function. Write a function that prints out only the name and indictment-days in a row, and use it to print out the name and indictment-days for all rows in the dataframe.

Alternative to df.apply(): Pandas offers several ways to iterate through data. You can also use the itertuples methods in a simple for-loop to iterate through rows.

Task: Modify your function so it adds name and indictment-days triples to a global rdflib Graph for each row in the dataframe. The subject in each triple could be the numeric index of the row.

You can use standard terms from RDF, RDFS, XSD, and other vocabularies when you see fit. Otherwise, just use an example-prefix.

Things may be easier if you copy df.index into an ordinary column of the dataframe:

df['id'] = df.index

You can use this index, along with a prefix, as the subject in your triples.

Task: Continue to extend your function to convert the non-semantic CSV dataset into a semantic RDF one. Here is an example of how the data for one investigation could look like in the end:

muellerkg:investigation_0 a muellerkg:Indictment ;
    muellerkg:american true ;
    muellerkg:cp_date "1973-01-30"^^xsd:date ;
    muellerkg:cp_days -109 ;
    muellerkg:indictment_days -246 ;
    muellerkg:investigation muellerkg:watergate ;
    muellerkg:investigation_days 1492 ;
    muellerkg:investigation_end "1977-06-19"^^xsd:date ;
    muellerkg:investigation_start "1973-05-19"^^xsd:date ;
    muellerkg:name muellerkg:James_W._McCord ;
    muellerkg:outcome muellerkg:conviction ;
    muellerkg:overturned false ;
    muellerkg:pardoned false ;
    muellerkg:president muellerkg:Richard_Nixon .

If you have more time

Task: If you have not done so already, you should include checks to ensure that you do not add empty columns to your graph.

Task: If you have more time, you can use DBpedia Spotlight to try to link the people (and other "named entities") mentioned in the dataset to DBpedia resources.

pip install pyspotlight

You can start with the code example below, but you will need exception-handling when DBpedia is unable to find a match. For instance:

import spotlight

ENDPOINT = 'https://api.dbpedia-spotlight.org/en/annotate'
CONFIDENCE = 0.5  # filter out results with lower confidence

def annotate_entity(entity_name, filters={'types': 'DBpedia:Person'}):
    annotations = []
    try:
	annotations = spotlight.annotate(ENDPOINT, entity_name, confidence=CONFIDENCE, filters=filters)
    except spotlight.SpotlightException as e:
        # catch exceptions thrown from Spotlight, for example when no DBpedia resource is found
	print(e)
	# handle exceptions here
    return annotations

The example uses the types-filter with DBpedia:Person, because we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well.

Useful materials: