Lab: Using Graph Embeddings

Topics

Using knowledge graph embeddings with TorchKGE.

Useful readings

Welcome to TorchKGE’ s documentation!
The following TorchKGE classes are central:
- KnowledgeGraph - contains the knowledge graph (KG)
- Model - contains the embeddings (entity and relation vectors) for some KG
PyTorch Tensor Documentation

Tasks

Task: knowledge graph:

Use a dataset loader to load a KG you want to work with. Freebase FB15k237 is a good choice. (You will need a pre-trained model for your KG later, to choose one of FB15k, FB15k237, WDV5, WN18RR, or Yago3-10. This lab has mostly been tested on FB15k.)
Use the methods provided by the KnolwedgeGraph class to inspect the KG.
- Print out the numbers of entities, relations, and facts in the training, validation, and testing sets.
- Print the identifiers for the first 10 entities and relations (tip: ent2ix and rel2ix).

Task: external identifiers:

Download a dataset that provides more understandable labels for the entities (and perhaps relations) in your KnowledgeGraph
- If you use FB15k, the relation names are not so bad, but the entity identifiers do not give much meaning. Same with WordNet. This repository contains mappings for the Freebase and WordNet datasets.
- If you use a Wikidata graph, the entities and relations are all P- and Q-codes. To get labels, you can try a combination of SPARQL queries and this API.
Create mappings from external labels to entity ids (and perhaps relation ids) in the KnowledgeGraph. Also create the inverse mappings.

Task: test entities and relations:

Get the KG indexes for a few entities and relations. If you use the Freebase or Wikidata graphs, you can try 'J. K. Rowling' and 'WALL·E' as entities (note that the dot in 'WALL·E' is not a hyphen or usual period.) For relations you can try 'influenced by' and 'genre'. (tip: to check names of entites and relations, open the train.txt file you cloned)

Task: model:

Load a pre-trained TransE model that matches your KnowledgeGraph.
- Print out the numbers of entities, relations, and the dimensions of the entity and relation vectors. Do they match your KnowledgeGraph.
Get the vectors for your test entities and relations (for example, 'J. K. Rowling' and 'influenced by').
Find vectors for a few more entities (both unrelated and related ones, e.g., 'J. R. R. Tolkien', 'C. S. Lewis', ...). Use the model.dissimilarity()-method to estimate how semantically close your entities are. Do the distances make sense?

Task: K-nearest neighbours:

Find the indexes of the 10 entity vectors that are nearest neighbours to your entity of choice. You can use sciKit-learn's sklearn.neighbors.NearestNeighbors.kneighbors()-method for this.
Map the indexes of the 10-nearest neighbouring entities back into human-understandable labels. Does this make sense? Try the same thing with another entity (e.g., 'WALL·E').

Task: translation:

Add together the vectors for an entity and a relation that that gives meaning for the entity (e.g., 'J. K. Rowling' - 'influenced by', 'WALL·E' - 'genre'). Find the 10-nearest neighbouring entities for the vector sum. Does this make sense? Try more entities and relations. Try to find examples that work and that do not work well.

Code to get started

With graph embeddings, we ideally want to work with ipynb files. The code below is prepared in the following link: https://colab.research.google.com/drive/1gS2D1XYSviAmhkS8moJIpY0N8ltJFM3C

!pip install torchkge
!pip install sklearn
!git clone https://github.com/villmow/datasets_knowledge_embedding.git

from torchkge.utils.datasets import load_fb15k237

kg_train, kg_val, kg_test = load_fb15k237()

print(list(kg_train.ent2ix.keys())[-10:])
print(list(kg_train.rel2ix.keys())[-10:])


"""Download files with human-readable labels for (most) Freebase entities used in the dataset. 
Labels seem to be missing for some entities used in FB15k-237."""

import json

TEXT_TRIPLES_DIR = 'datasets_knowledge_embedding/FB15k-237/'
with open(TEXT_TRIPLES_DIR+'entity2wikidata.json') as file:
    _entity2wikidata = json.load(file)

 ent2lbl = {
    ent: wd['label']
    for ent, wd in _entity2wikidata.items()
}
lbl2ent = {lbl: ent for ent, lbl in ent2lbl.items()}

print([
    ent2lbl[ent] 
    for ent in kg_train.ent2ix.keys()
    if ent in ent2lbl][-10:])

If You Have More Time

Try it out with different datasets, for example one you create youreself using SPARQL queries on an open KG.