Lab: Semantic Lifting - CSV: Difference between revisions

From info216
No edit summary
No edit summary
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
=Lab 9: Semantic Lifting - CSV=
=Lab 9: Semantic Lifting - CSV=
==Link to discord server==
https://discord.gg/t5dgPrK


==Topics==
==Topics==
Today's topic involves lifting the data in CSV format into RDF.
Today's topic involves lifting data in CSV format into RDF.
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  


Line 9: Line 12:
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.


==Relevant Libraries==
We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.
 
==Relevant Libraries/Functions==
import pandas
 
pandas.read_csv


* Pandas
dataframe.iterrows(), dataframe.fillna(), dataframe.replace()
 
string.split(), string.title(), string.replace()
 
RDF concepts we have used earlier.


==Tasks==
==Tasks==


Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project folder and write a program with a loop that reads each line from that file (except the initial header line) and adds it to your graph as triples:
Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and '''write a program with a loop that reads each line from that file and adds it to your graph as triples''':


  "Name","Gender","Country","Town","Expertises","Interests"
  "Name","Gender","Country","Town","Expertises","Interests"
  "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
  "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
  "Achille Blaise","M","France","Nancy","","Chess, computer games"
  "Achille Blaise","M","France","Nancy","","Chess, computer games"
  "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks","Hiking, botany"
  "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
  "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
  "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
To get started you can use the code furhter down:


When solving the task take note of the following:
When solving the task take note of the following:


* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
* Some columns like expertise have multiple values for one person. You should create unique triple for each of these expertises.  
 
* Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.  


* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.  
* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.  
Line 45: Line 60:
<syntaxhighlight>
<syntaxhighlight>
from rdflib import Graph, Literal, Namespace, URIRef
from rdflib import Graph, Literal, Namespace, URIRef
import pandas as pd
import pandas as pd


# Load the CSv data as a pandas Dataframe.
# Load the CSV data as a pandas Dataframe.
csv_data = pd.read_csv("task1.csv")
csv_data = pd.read_csv("task1.csv")


Line 63: Line 77:


for index, row in csv_data.iterrows():
for index, row in csv_data.iterrows():
     subject = row['Name'].replace(" ", "_")
    # row['Name'] selects the name value of the current row.
     subject = row['Name']
 
    #Continue the loop here:
 
 
# Clean printing of end-results.
print(g.serialize(format="turtle").decode())
</syntaxhighlight>
 
 
{| role="presentation" class="wikitable mw-collapsible mw-collapsed"
| <strong>Hints</strong>
|-
| Replacing characters with Dataframe:
 
<syntaxhighlight>
csv_data = csv_data.replace(to_replace ="banana",
                value ="apple", regex=True)
</syntaxhighlight>
 
Fill missing/empty data of Dataframe with paramteter value.
 
<syntaxhighlight>
csv_data = csv_data.fillna("missing")
</syntaxhighlight>


    #Continue the Code here:
Make first letter of word Captial.


<syntaxhighlight>
name = "cade".title()
</syntaxhighlight>


After creating the graph you can remove all triples that contained unknown data easily if you marked like above.


print(g.serialize(format="turtle").decode())
<syntaxhighlight>
g.remove((None, None, URIRef("http://example.org/missing")))
</syntaxhighlight>
</syntaxhighlight>
|}
==Useful Reading==


==Examples==
* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Useful Resource for working with Dataframes and CSV]

Revision as of 13:06, 18 March 2020

Lab 9: Semantic Lifting - CSV

Link to discord server

https://discord.gg/t5dgPrK

Topics

Today's topic involves lifting data in CSV format into RDF. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.

Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.

We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.

Relevant Libraries/Functions

import pandas

pandas.read_csv

dataframe.iterrows(), dataframe.fillna(), dataframe.replace()

string.split(), string.title(), string.replace()

RDF concepts we have used earlier.

Tasks

Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and write a program with a loop that reads each line from that file and adds it to your graph as triples:

"Name","Gender","Country","Town","Expertises","Interests"
"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
"Achille Blaise","M","France","Nancy","","Chess, computer games"
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"

To get started you can use the code furhter down:

When solving the task take note of the following:

  • The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
  • Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.
  • Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
  • Any case with missing data should not form a triple.
  • For consistency, make sure all resources start with a Captital letter.


If You have more Time

  • Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
  • Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.


Code to Get Started (You could also use your own approach if you want to)

from rdflib import Graph, Literal, Namespace, URIRef
import pandas as pd

# Load the CSV data as a pandas Dataframe.
csv_data = pd.read_csv("task1.csv")

g = Graph()
ex = Namespace("httph://example.org/")
g.bind("ex", ex)


# You should probably deal with replacing of characters or missing data here:



# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.

for index, row in csv_data.iterrows():
    # row['Name'] selects the name value of the current row.
    subject = row['Name']

     #Continue the loop here:


# Clean printing of end-results.
print(g.serialize(format="turtle").decode())



Useful Reading