0

Using straightforward Cypher to load data from a CSV and just create nodes.

Code is as follows:


    :auto LOAD CSV WITH HEADERS FROM 'file:///registrants.csv' AS row
    CALL {
        WITH row 
        MERGE (r:Registrant {row_wid: toInteger(row.ROW_WID)})
        ON CREATE SET
            r.row_wid = toInteger(row.ROW_WID), 
            r.w_insert_dt = row.W_INSERT_DT,
            r.w_update_dt = row.W.UPDATE_DT,
            r.email_address = row.EMAIL_ADDRESS,
            r.attendee_contact_wid = toInteger(row.ATTENDEE_CONTACT_WID),
            r.attendee_account_wid = toInteger(row.ATTENDEE_ACCOUNT_WID),
            r.reg_contact_wid = toInteger(row.REG_CONTACT_WID),
            r.reg_account_wid = toInteger(row.REG_ACCOUNT_WID),
            r.event_wid = toInteger(row.EVENT_WID),
            r.tkt1_wid = toInteger(row.TKT1_WID),
            r.tkt2_wid = toInteger(row.TKT2_WID),
            r.tkt3_wid = toInteger(row.TKT3_WID),
            r.tkt4_wid = toInteger(row.TKT4_WID),
            r.tkt5_wid = toInteger(row.TKT5_WID),
            r.tkt6_wid = toInteger(row.TKT6_WID),
            r.current_flg = row.CURRENT_FLG,
            r.delete_flg = row.DELETE_FLG,
            r.created_on_dt = row.CREATED_ON_DT,
            r.updated_on_dt = row.UPDATED_ON_DT,
            r.reg_dt = row.REG_DT,
            r.attend_dt = row.ATTEND_DT,
            r.cancel_dt = row.CANCEL_DT,
            r.alumni = row.ALUMNI,
            r.reg_channel = row.REG_CHANNEL
            
    } IN TRANSACTIONS of 1000 ROWS

Did this with 100 rows and it worked seamlessly. Trying to create with 700K rows and it has been running over 12 hours.

I also have an index for creation of this node in the DB.

I'm a newbie so please excuse if I'm doing something wrong.

From my research this looks right.

Not getting any errors.

Insights appreciated

Thank you.

Windstorm1981
  • 2,564
  • 7
  • 29
  • 57
  • Can you do a simple "CREATE" statement without the "SET" clause? It is much faster. – jose_bacoy Mar 20 '23 at 17:24
  • I could. Once I have the properties loaded in the Nodes is it faster to change their type separately in a subsequent query? – Windstorm1981 Mar 20 '23 at 17:35
  • `CSV LOAD` is not recommended for large CSV files. You may want to consider using [neo4j-admin import](https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import/). – cybersam Mar 20 '23 at 18:42
  • @cybersam thanks for this - unfortunately per the documentation admin import only good for a completely new database and can only be used once – Windstorm1981 Mar 20 '23 at 20:24
  • Actually [neo4j-admin import](https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import/) supports "incremental" mode for existing DBs. – cybersam Mar 20 '23 at 20:45
  • One small insight, in `ON CREATE SET` clause you are setting `r.row_wid = toInteger(row.ROW_WID)`, this is redundant. `MERGE` will automatically set it, as you have specified it there. Setting it again here might again traverse the index, which is not required. – Charchit Kapoor Mar 23 '23 at 07:18

1 Answers1

0

Make sure you have a uniqueness constraint on Registrant.row_wid:

CREATE CONSTRAINT FOR (r:Registraint) REQUIRE r.row_wid IS UNIQUE;

Examine the query plan to make sure the index is being used by prepending EXPLAIN to the query and make sure there are no "eager" operations that would prevent batching (given the query I wouldn't expect there to be).

Increase the number of rows per transaction. It depends on how much memory is allocated for Neo4j transactions, but typically around 100k is what I use.

William Lyon
  • 8,371
  • 1
  • 17
  • 22
  • Yes I have the uniqueness constraint. I just stopped the query after 16 hours and in fact had done about 360k nodes of the 700k. I also did the explain - tells me it is returning 3.8 mm items on the 'row' which i don't understand. Is it best to load everything in default string type and then subsequently do 'set' to change to integer, datetime, etc? – Windstorm1981 Mar 20 '23 at 18:20
  • 1
    I would try increasing the number of rows per transaction next. Each transaction has some overhead associated with it. With 700k rows and only 1000 rows per transaction that's 700 transactions. Try something like 50k or 100k rows per transaction. – William Lyon Mar 20 '23 at 19:57
  • Any chance this issue has to do with allocated memory? Tried to alter it but not seeing much difference – Windstorm1981 Mar 20 '23 at 20:54