Lab: Semantic Lifting - CSV: Difference between revisions

From info216
No edit summary
No edit summary
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Lab 9: Semantic Lifting - CSV=
=Lab 9: Semantic Lifting - CSV=
==Link to discord server==
https://discord.gg/t5dgPrK


==Topics==
==Topics==
Today's topic involves lifting the data in CSV format into RDF.
Today's topic involves lifting data in CSV format into RDF.
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  


Line 9: Line 12:
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.


We will also use Pandas Dataframes which will contain our CSV data in python code.
We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.


==Relevant Libraries/Functions==
==Relevant Libraries/Functions==
Line 24: Line 27:
==Tasks==
==Tasks==


Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project folder and write a program with a loop that reads each line from that file (except the initial header line) and adds it to your graph as triples:
Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and '''write a program with a loop that reads each line from that file and adds it to your graph as triples''':


  "Name","Gender","Country","Town","Expertises","Interests"
  "Name","Gender","Country","Town","Expertises","Interests"
  "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
  "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
  "Achille Blaise","M","France","Nancy","","Chess, computer games"
  "Achille Blaise","M","France","Nancy","","Chess, computer games"
  "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks","Hiking, botany"
  "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
  "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
  "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
To get started you can use the code furhter down:


When solving the task take note of the following:
When solving the task take note of the following:


* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
* Some columns like expertise have multiple values for one person. You should create unique triple for each of these expertises.  
 
* Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.  


* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.  
* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.  
Line 54: Line 60:
<syntaxhighlight>
<syntaxhighlight>
from rdflib import Graph, Literal, Namespace, URIRef
from rdflib import Graph, Literal, Namespace, URIRef
import pandas as pd
import pandas as pd


# Load the CSv data as a pandas Dataframe.
# Load the CSV data as a pandas Dataframe.
csv_data = pd.read_csv("task1.csv")
csv_data = pd.read_csv("task1.csv")


Line 72: Line 77:


for index, row in csv_data.iterrows():
for index, row in csv_data.iterrows():
    # row['Name'] selects the name value of the current row.
     subject = row['Name']
     subject = row['Name']


     #Continue the Code here:
     #Continue the loop here:




Line 81: Line 87:
</syntaxhighlight>
</syntaxhighlight>


==Hints==


Replacing characters with Dataframe:
{| role="presentation" class="wikitable mw-collapsible mw-collapsed"
| <strong>Hints</strong>
|-
| Replacing characters with Dataframe:


<syntaxhighlight>
<syntaxhighlight>
Line 90: Line 98:
</syntaxhighlight>
</syntaxhighlight>


Fill missing data of Dataframe with paramteter value.  
Fill missing/empty data of Dataframe with paramteter value.  


<syntaxhighlight>
<syntaxhighlight>
Line 101: Line 109:
name = "cade".title()
name = "cade".title()
</syntaxhighlight>
</syntaxhighlight>
After creating the graph you can remove all triples that contained unknown data easily if you marked like above.
<syntaxhighlight>
g.remove((None, None, URIRef("http://example.org/missing")))
</syntaxhighlight>
|}
==Useful Reading==
* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Useful Resource for working with Dataframes and CSV]

Revision as of 13:06, 18 March 2020

Lab 9: Semantic Lifting - CSV

Link to discord server

https://discord.gg/t5dgPrK

Topics

Today's topic involves lifting data in CSV format into RDF. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.

Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.

We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.

Relevant Libraries/Functions

import pandas

pandas.read_csv

dataframe.iterrows(), dataframe.fillna(), dataframe.replace()

string.split(), string.title(), string.replace()

RDF concepts we have used earlier.

Tasks

Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and write a program with a loop that reads each line from that file and adds it to your graph as triples:

"Name","Gender","Country","Town","Expertises","Interests"
"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
"Achille Blaise","M","France","Nancy","","Chess, computer games"
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"

To get started you can use the code furhter down:

When solving the task take note of the following:

  • The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
  • Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.
  • Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
  • Any case with missing data should not form a triple.
  • For consistency, make sure all resources start with a Captital letter.


If You have more Time

  • Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
  • Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.


Code to Get Started (You could also use your own approach if you want to)

from rdflib import Graph, Literal, Namespace, URIRef
import pandas as pd

# Load the CSV data as a pandas Dataframe.
csv_data = pd.read_csv("task1.csv")

g = Graph()
ex = Namespace("httph://example.org/")
g.bind("ex", ex)


# You should probably deal with replacing of characters or missing data here:



# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.

for index, row in csv_data.iterrows():
    # row['Name'] selects the name value of the current row.
    subject = row['Name']

     #Continue the loop here:


# Clean printing of end-results.
print(g.serialize(format="turtle").decode())



Useful Reading