-1

To import XML data into a neo4j DB I first parse the XML to a python dictionary and then use CYPHER queries:

WITH $pubmed_dict as pubmed_article
UNWIND pubmed_article as particle
...
FOREACH (author IN particle.MedlineCitation.Article.AuthorList.Author |
  MERGE (a:Author {last_name: COALESCE(author.LastName, 'LAST NAME MISSING!')})
  ON CREATE SET a.first_name = author.ForeName, a.affiliation = author.AffiliationInfo.Affiliation
  ON MATCH SET a.first_name = author.ForeName, a.affiliation = author.AffiliationInfo.Affiliation
  MERGE (p)<-[:WROTE]-(a)      
)

Unfortunately, Authors don't have unique IDs in the database, so it might be that different authors have the same last names but different initials or affiliations.

...
                <Author ValidYN="Y">
                    <LastName>Smith</LastName>
                    <ForeName>A L</ForeName>
                    <Initials>AL</Initials>
                    <AffiliationInfo>
                        <Affiliation>University X</Affiliation>
                    </AffiliationInfo>
                </Author>
...
                <Author ValidYN="Y">
                    <LastName>Smith</LastName>
                    <ForeName>A L</ForeName>
                    <Initials>AL</Initials>
                    <AffiliationInfo>
                        <Affiliation>University BUMBABU</Affiliation>
                    </AffiliationInfo>
                </Author>

My intention was to MERGE on author.LastName but ON MATCH check if the author has the same ForeName OR the same Affiliation and if not create a new node instead.

How would I do that using CYPHER queries?

EDIT 1

Node Key constraints are the solution, which is an Enterprise Edition feature, though. Looking for a workaround for that.

EDIT 2

This code is working almost perfectly:

WITH $pubmed_dict as pubmed_article
    UNWIND pubmed_article as particle
        MERGE (p:Publication {pmid: particle.MedlineCitation.PMID.text})
        ON CREATE SET p.title = COALESCE (particle.MedlineCitation.Article.Journal.Title, particle.MedlineCitation.Article.ArticleTitle)
        ON MATCH SET p.title = COALESCE (particle.MedlineCitation.Article.Journal.Title, particle.MedlineCitation.Article.ArticleTitle)

    FOREACH (author IN particle.MedlineCitation.Article.AuthorList.Author |
      MERGE (a:Author {last_name: COALESCE(author.LastName, 'LAST NAME MISSING!'), first_name: COALESCE(author.ForeName, 'FIRST NAME MISSING!')})
      MERGE (p)<-[:WROTE]-(a)      
    )

To sum it up: For every author I want to create a new author IF LastName OR ForeName OR Affiliation are different. I also need NEW Nodes for authors where LAST NAME MISSING! and FIRST NAME MISSING!

Is it possible to achieve this result WITHOUT Key Node Constraints? (because this is an Enterprise Edition feature...)

Rich Steinmetz
  • 1,020
  • 13
  • 28
  • 1
    Wouldn't it be nice to have a unique identifier for the authors. NLM has tried to tackle this, but with limited success: https://www.nlm.nih.gov/pubs/techbull/nd10/nd10_pm_author_id.html. In the mean time, your strategy seems best. – David A Stumpf Jun 17 '18 at 14:52
  • Thanks a lot for your article! Indeed, the ID issue for pubmed is quite a pain and that a lot of ForeNames where added as just initials, especially at the beginning, makes it even worse. – Rich Steinmetz Jun 19 '18 at 04:43

2 Answers2

1

You can use constraints, then neo4j will check uniqueness for you.

From documentation:

To create a Node Key ensuring that all nodes with a particular label have a set of defined properties whose combined value is unique, and where all properties in the set are present

CREATE CONSTRAINT ON (author:Author)  ASSERT (author.first_name, author.last_name, author.affiliation) IS NODE KEY
mastisa
  • 1,875
  • 3
  • 21
  • 39
  • The problem with just using constraints is that author can -1.- Have equal first name but different last name. With constraints like in your example there would be created a node for each equal ForeName, right? -2.- have equal ForeName AND LastName but different affiliation, in this case I also want to create a new node – Rich Steinmetz Jun 17 '18 at 05:37
  • While this approach works fine with a database started by neo4j Desktop, it fails on our Debian server, because this is a Enterprise Edition feature. – Rich Steinmetz Jun 23 '18 at 08:29
  • Unfortunately it is. Just search for "Enterprise" on the page you just referenced. – Rich Steinmetz Jun 23 '18 at 09:02
  • but yeah, it seems almost crazy that such a seemingly basic feature is Enterprise – Rich Steinmetz Jun 23 '18 at 12:49
  • I also don't why it works with my neo4j Desktop database version, although I obviously don't have a license.. – Rich Steinmetz Jun 23 '18 at 13:07
1

The authors do have a unique ID in Neo4j, the node ID. That can be used to identify the node and then the set the properties. Maybe something like this:

Match (a:Author{LastName:'xxx',ForeName:'yyy'}) 
with a, id(a) as ID
where ID > -1
match (b) where id(b)=ID set b.first_name = author.ForeName, b.affiliation = author.AffiliationInfo.Affiliation

The node's ID is not necessarily stable or predictable, so you have to access it directly before using it.

Because you are using python code, you might to better with a global query to pull down the author node data:

match (a:Author{LastName:'xxx',ForeName:'yyy'})  return a.LastName,a.ForeName,id(a) as ID

then, you can write a csv file to bulk upload the desired info. The csv could look like this:

> "ID","ForeName","LastName","Affiliation" 
"26","David","Smith","Johns Hopkins" 
etc.

The python code could do the filtering of nodes that do not need processing.

Then load the file:

LOAD CVS with HEADER file:///'xxx.csv' as line 
match (a) where id(a)=toInteger(line.ID) 
set a.Affiliation=line.toString(line.Affiliation")
David A Stumpf
  • 753
  • 5
  • 13
  • Interesting approach. What would you say which advantage does it provide against using constraints? – Rich Steinmetz Jun 19 '18 at 06:13
  • It's not avoiding constraints that concerned me. The FOR EACH may be slower; but verifying this would require testing. The LOAD CSV is intended to be fast in bulk uploads or updates. If your application is not changing the nodes in a manner than alters their ID (that is you assure their stability), then you can also use these ids as surrogates for a PubMed author id. The other thought I had was whether the schema is ideal. Authors can have more than one or a changing affiliation. So might the affiliation be a separate set of nodes? That would have other analytic value too. – David A Stumpf Jun 20 '18 at 14:59
  • You are right, using affiliation as an identifier is far from ideal, but it gives a hint for authors who might have the same name. I thought, an author is "John Doe" with affiliation A can be identified as another entity against author "John Doe" with affiliation B. But it's a great idea as well to have affiliations as separate nodes, thanks! – Rich Steinmetz Jun 23 '18 at 08:36