|
|
Line 1: |
Line 1: |
| =Lab 9: Semantic Lifting - CSV= | | =Lab 10: Semantic Lifting - HTML= |
|
| |
|
| ==Link to Live-stream== | | ==Link to Live-stream== |
|
| |
|
| <syntaxhighlight> | | <syntaxhighlight> |
| https://teams.microsoft.com/dl/launcher/launcher.html?url=%2f_%23%2fl%2fmeetup-join%2f19%3ameeting_MGI1ZjcxNTUtODBjNy00ZjkxLWJlNGUtOTQ2Y2M3NjEwYzkx%40thread.v2%2f0%3fcontext%3d%257b%2522Tid%2522%253a%2522648a24bc-a98d-4025-9c60-48c19a142069%2522%252c%2522Oid%2522%253a%252252d6ac23-7c70-43f5-bc41-95254a3ac7f1%2522%252c%2522IsBroadcastMeeting%2522%253atrue%257d%26anon%3dtrue&type=meetup-join&deeplinkId=3b3b5d40-c010-45ca-a620-7108967fe3e3&directDl=true&msLaunch=true&enableMobilePage=true&suppressPrompt=true
| | |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
| ==Topics== | | ==Topics== |
| Today's topic involves lifting data in CSV format into RDF. | | Today's topic involves lifting data in HTML format into RDF. |
| The goal is for you to learn an example of how we can convert unsemantic data into RDF. | | The goal is for you to learn an example of how we can convert unsemantic data into RDF. |
|
| |
|
| CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.
| | HTML is the coding language used to describe structure and content of websites. |
| | |
| Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
| |
| | |
| We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.
| |
|
| |
|
| ==Relevant Libraries/Functions== | | ==Relevant Libraries/Functions== |
| import pandas
| |
|
| |
| pandas.read_csv
| |
|
| |
| dataframe.iterrows(), dataframe.fillna(), dataframe.replace()
| |
|
| |
| string.split(), string.title(), string.replace()
| |
|
| |
| RDF concepts we have used earlier.
| |
|
| |
|
| ==Tasks== | | ==Tasks== |
|
| |
|
| Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and '''write a program with a loop that reads each line from that file and adds it to your graph as triples''':
| |
|
| |
| "Name","Gender","Country","Town","Expertises","Interests"
| |
| "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
| |
| "Achille Blaise","M","France","Nancy","","Chess, computer games"
| |
| "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
| |
| "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
| |
|
| |
| To get started you can use the code furhter down:
| |
|
| |
| When solving the task take note of the following:
| |
|
| |
| * The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
| |
|
| |
| * Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.
| |
|
| |
| * Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
| |
|
| |
| * Any case with missing data should not form a triple.
| |
|
| |
|
| * For consistency, make sure all resources start with a Captital letter.
| |
|
| |
|
|
| |
|
| ==If You have more Time== | | ==If You have more Time== |
| * Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
| |
|
| |
| * Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.
| |
|
| |
|
| |
|
| ==Code to Get Started (You could also use your own approach if you want to) == | | ==Code to Get Started (You could also use your own approach if you want to) == |
|
| |
|
| <syntaxhighlight>
| |
| from rdflib import Graph, Literal, Namespace, URIRef
| |
|
| |
| import pandas as pd
| |
|
| |
| # Load the CSv data as a pandas Dataframe.
| |
| csv_data = pd.read_csv("task1.csv")
| |
|
| |
| g = Graph()
| |
| ex = Namespace("httph://example.org/")
| |
| g.bind("ex", ex)
| |
|
| |
|
| |
| # You should probably deal with replacing of characters or missing data here:
| |
|
| |
|
| |
|
| |
| # Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.
| |
|
| |
| for index, row in csv_data.iterrows():
| |
| # row['Name'] selects the name value of the current row.
| |
| subject = row['Name']
| |
|
| |
| #Continue the loop here:
| |
|
| |
|
| |
| # Clean printing of end-results.
| |
| print(g.serialize(format="turtle").decode())
| |
| </syntaxhighlight>
| |
|
| |
|
|
| |
|
Line 97: |
Line 31: |
| | Replacing characters with Dataframe: | | | Replacing characters with Dataframe: |
|
| |
|
| <syntaxhighlight>
| |
| csv_data = csv_data.replace(to_replace ="banana",
| |
| value ="apple", regex=True)
| |
| </syntaxhighlight>
| |
|
| |
|
| Fill missing/empty data of Dataframe with paramteter value.
| |
|
| |
|
| <syntaxhighlight>
| |
| csv_data = csv_data.fillna("missing")
| |
| </syntaxhighlight>
| |
|
| |
| Make first letter of word Captial.
| |
|
| |
| <syntaxhighlight>
| |
| name = "cade".title()
| |
| </syntaxhighlight>
| |
|
| |
| After creating the graph you can remove all triples that contained unknown data easily if you marked like above.
| |
|
| |
| <syntaxhighlight>
| |
| g.remove((None, None, URIRef("http://example.org/missing")))
| |
| </syntaxhighlight>
| |
| |} | | |} |
|
| |
|
|
| |
|
| ==Useful Reading== | | ==Useful Reading== |
|
| |
| * [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Useful Resource for working with Dataframes and CSV]
| |