0

I am working on an NLP pipeline that takes a collection of textual records as input and extracts entities and relationships within the text of each record. The pipeline utilizes the spaCy library for named entity extraction and BLINK for linking entities to an external data source (wikidata). The pipeline currently outputs a ttl file in the following format:

@prefix : <http://cna.outwebsite.ac.uk/our_Text_Collection/> .
@prefix cna: <http://cna.outwebsite.ac.uk/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix DBpedia: <http://dbpedia.org/ontology/> .
@prefix Schema: <http://schema.org/> .
@prefix Wikidata: <https://www.wikidata.org/wiki/> .
@prefix DUL: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#> .

:15606 cna:text "Floods Newlyn Coombe St Peters Church visible centre rear built in 1866"^^xsd:string .
<<:15606 cna:mentions <https://en.wikipedia.org/wiki/Newlands_Church>>> cna:similarity 79.5187759399414;
         cna:start 7 ;
         cna:end 37 ;
         cna:support 1 .
<<:15606 cna:mentions <https://en.wikipedia.org/wiki/1866_in_architecture>>> cna:similarity 78.223876953125;
         cna:start 67 ;
         cna:end 71 ;
         cna:support 1 .
:15608 cna:text "View of beach and foreshore near the bowling green pavilion"^^xsd:string .
@prefix : <http://cna.outwebsite.ac.uk/our_Text_Collection/> .
@prefix cna: <http://cna.outwebsite.ac.uk/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix DBpedia: <http://dbpedia.org/ontology/> .
@prefix Schema: <http://schema.org/> .
@prefix Wikidata: <https://www.wikidata.org/wiki/> .
@prefix DUL: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#> .

@prefix : <http://cna.outwebsite.ac.uk/our_Text_Collection/> .
@prefix cna: <http://cna.outwebsite.ac.uk/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix DBpedia: <http://dbpedia.org/ontology/> .
@prefix Schema: <http://schema.org/> .
@prefix Wikidata: <https://www.wikidata.org/wiki/> .
@prefix DUL: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#> .

@prefix : <http://cna.outwebsite.ac.uk/our_Text_Collection/> .
@prefix cna: <http://cna.outwebsite.ac.uk/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix DBpedia: <http://dbpedia.org/ontology/> .
@prefix Schema: <http://schema.org/> .
@prefix Wikidata: <https://www.wikidata.org/wiki/> .
@prefix DUL: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#> .

:15620 cna:text "Location bottom of Morrab Road at the junction with the promenade London"^^xsd:string .
<<:15620 cna:mentions <https://en.wikipedia.org/wiki/Morchard_Road>>> cna:similarity 82.10499572753906;
         cna:start 19 ;
         cna:end 30 ;
         cna:support 1 .
<<:15620 cna:mentions <https://en.wikipedia.org/wiki/London>>> cna:similarity 83.20065307617188;
         cna:start 66 ;
         cna:end 74 ;
         cna:support 1 .
:15640 cna:text "Damage to the Bolitho Gardens at Wherrytown Bijou House in view"^^xsd:string .
<<:15640 cna:mentions <https://en.wikipedia.org/wiki/Bolitho,_Cornwall>>> cna:similarity 79.88461303710938;
         cna:start 14 ;
         cna:end 29 ;
         cna:support 1 .
<<:15640 cna:mentions <https://en.wikipedia.org/wiki/Merriville_House_and_Gardens>>> cna:similarity 79.99214935302734;
         cna:start 33 ;
         cna:end 55 ;
         cna:support 1 .

I need to extract Subject Predicate Object triples in RDF format from this ttl file to be uploaded as an input to BlazeGraph. I initially attempted to achieve this through string manipulation, but I faced challenges due to variations in file content across different collections.

I was advised to use Displacy code to extract the desired triples. However, the current code I am using does not provide the exact relationship I need. I want the triples in the simple format of Subject, Predicate, Object, like this example :

<http://cna.outwebsite.ac.uk/our_Text_Collection/15606> 
<http://cna.outwebsite.ac.uk/mentions>  
<https://en.wikipedia.org/wiki/Newlands_Church>

Here is the Displacy code I am currently using:

import spacy
from spacy import displacy
from rdflib import Graph, Literal, Namespace, RDF, URIRef

# Load English model
nlp = spacy.load('en_core_web_sm')

# Create RDF graph
graph = Graph()

# Define prefixes
tanc = Namespace('http://cna.outwebsite.ac.uk/')
dbr = Namespace('http://dbpedia.org/resource/')
DBpedia = Namespace('http://dbpedia.org/ontology/')
Schema = Namespace('http://schema.org/')
Wikidata = Namespace('https://www.wikidata.org/wiki/')
DUL = Namespace('http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#')

# Process the content
with open("our_collection.ttl","r") as ff:
    content = ff.read()

# Split the content into statements
statements = content.split('\n\n')

# Process each statement
triples = []
for statement in statements:
    if statement.strip():
        # Parse the statement using spaCy
        doc = nlp(statement)
        
        # Extract the subject, predicate, and object
        subject = doc[0].text.strip(':')
        predicate = doc[1].text.strip()
        obj = doc[2].text.strip('"')
        
        # Create RDF triples
        triples.append((subject, predicate, obj))

# Create a new Doc object from the triples
text = ' '.join([f'{subj} {pred} {obj}' for subj, pred, obj in triples])
doc = nlp(text)

# # Generate the displacy visualization
# displacy.serve(doc, style='dep')


# Generate the displacy visualization
html = displacy.render(doc, style='dep', options={'compact': True, 'bg': '#ffffff'})

# Save the visualization to a file
with open('visualization.html', 'w', encoding='utf-8') as file:
    file.write(html)

# Generate the displacy visualization
displacy.serve(graph, style='dep', port=8000, auto_select_port=True)

I would appreciate any guidance on a more effective approach to extracting the triples from the ttl file, as the above code doesn't achieve what I want . Is there a better way to achieve this?

Youcef
  • 1,103
  • 2
  • 11
  • 26
  • Looks like RDF*. Try to load into Blazegraph in RDR mode, then process there. – Stanislav Kralin Jun 01 '23 at 16:17
  • you could also load it into a RDF Star capable triple store or use e.g. Jena or RDF4J CLI and then dump just the subject as an RDF graph via a `CONSTRUCT` query using the `subject, predicate and object: – UninformedUser Jun 02 '23 at 08:16
  • `CONSTRUCT { ?s ?p ?o } WHERE { ?_s ?_p ?_o FILTER(ISTriple(?_s)) BIND(subject(?_s) as ?s) BIND(predicate(?_s) as ?p) BIND(object(?_s) as ?o) }` using Apache Jena CLI with `sparql --data --query ` – UninformedUser Jun 02 '23 at 08:16

0 Answers0