Home » Java » Enterprise Java » Neo4j: Cypher – Deleting duplicate nodes

About Mark Needham

Neo4j: Cypher – Deleting duplicate nodes

I had a problem on a graph I was working on recently where I’d managed to create duplicate nodes because I hadn’t applied any unique constraints.

I wanted to remove the duplicates, and came across Jimmy Ruts’ excellent post which shows some ways to do this.

Let’s first create a graph with some duplicate nodes to play with:

UNWIND range(0, 100) AS id
CREATE (p1:Person {id: toInteger(rand() * id)})
MERGE (p2:Person {id: toInteger(rand() * id)})
MERGE (p3:Person {id: toInteger(rand() * id)})
MERGE (p4:Person {id: toInteger(rand() * id)})
CREATE (p1)-[:KNOWS]->(p2)
CREATE (p1)-[:KNOWS]->(p3)
CREATE (p1)-[:KNOWS]->(p4)
 
Added 173 labels, created 173 nodes, set 173 properties, created 5829 relationships, completed after 408 ms.

How do we find the duplicate nodes?

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒══════════════════════╤═════════════╕
│"ids"                 │"size(nodes)"│
╞══════════════════════╪═════════════╡
│[1,1,1,1,1,1,1,1]     │8            │
├──────────────────────┼─────────────┤
│[0,0,0,0,0,0,0,0]     │8            │
├──────────────────────┼─────────────┤
│[17,17,17,17,17,17,17]│7            │
├──────────────────────┼─────────────┤
│[4,4,4,4,4,4,4]       │7            │
├──────────────────────┼─────────────┤
│[2,2,2,2,2,2]         │6            │
├──────────────────────┼─────────────┤
│[5,5,5,5,5,5]         │6            │
├──────────────────────┼─────────────┤
│[19,19,19,19,19,19]   │6            │
├──────────────────────┼─────────────┤
│[11,11,11,11,11]      │5            │
├──────────────────────┼─────────────┤
│[25,25,25,25,25]      │5            │
├──────────────────────┼─────────────┤
│[43,43,43,43,43]      │5            │
└──────────────────────┴─────────────┘

Let’s zoom in on all the people with ‘id:1’ and work out how many relationships they have. Our plan is keep the node that has the most connections and get rid of the others.

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
WITH nodes ORDER BY size(nodes) DESC
LIMIT 1
UNWIND nodes AS n 
RETURN n.id, id(n) AS internalId, size((n)--()) AS rels
ORDER BY rels DESC
 
╒══════╤════════════╤══════╕
│"n.id"│"internalId"│"rels"│
╞══════╪════════════╪══════╡
│1     │175         │1284  │
├──────┼────────────┼──────┤
│1     │184         │721   │
├──────┼────────────┼──────┤
│1     │180         │580   │
├──────┼────────────┼──────┤
│1     │2           │391   │
├──────┼────────────┼──────┤
│1     │195         │361   │
├──────┼────────────┼──────┤
│1     │199         │352   │
├──────┼────────────┼──────┤
│1     │302         │5     │
├──────┼────────────┼──────┤
│1     │306         │1     │
└──────┴────────────┴──────┘

So in this example we want to keep the node that has 210 relationships and delete the rest.

To make things easy we need the node with the highest cardinality to be first or last in our list. We can ensure that’s the case by ordering the nodes before we group them.

MATCH (p:Person)
WITH p 
ORDER BY p.id, size((p)--()) DESC
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | {id: n.id,rels: size((n)--()) } ] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒══════════════════════════════════════════════════════════════════════╤═════════════╕
│"ids"                                                                 │"size(nodes)"│
╞══════════════════════════════════════════════════════════════════════╪═════════════╡
│[{"id":1,"rels":1284},{"id":1,"rels":721},{"id":1,"rels":580},{"id":1,│8            │
│"rels":391},{"id":1,"rels":361},{"id":1,"rels":352},{"id":1,"rels":5},│             │
│{"id":1,"rels":1}]                                                    │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":0,"rels":2064},{"id":0,"rels":2059},{"id":0,"rels":1297},{"id":│8            │
│0,"rels":1124},{"id":0,"rels":995},{"id":0,"rels":928},{"id":0,"rels":│             │
│730},{"id":0,"rels":702}]                                             │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":17,"rels":153},{"id":17,"rels":105},{"id":17,"rels":81},{"id":1│7            │
│7,"rels":31},{"id":17,"rels":15},{"id":17,"rels":14},{"id":17,"rels":1│             │
│}]                                                                    │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":4,"rels":394},{"id":4,"rels":320},{"id":4,"rels":250},{"id":4,"│7            │
│rels":201},{"id":4,"rels":162},{"id":4,"rels":162},{"id":4,"rels":14}]│             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":2,"rels":514},{"id":2,"rels":329},{"id":2,"rels":318},{"id":2,"│6            │
│rels":241},{"id":2,"rels":240},{"id":2,"rels":2}]                     │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":5,"rels":487},{"id":5,"rels":378},{"id":5,"rels":242},{"id":5,"│6            │
│rels":181},{"id":5,"rels":158},{"id":5,"rels":8}]                     │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":19,"rels":153},{"id":19,"rels":120},{"id":19,"rels":84},{"id":1│6            │
│9,"rels":53},{"id":19,"rels":45},{"id":19,"rels":1}]                  │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":11,"rels":222},{"id":11,"rels":192},{"id":11,"rels":172},{"id":│5            │
│11,"rels":152},{"id":11,"rels":89}]                                   │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":25,"rels":133},{"id":25,"rels":107},{"id":25,"rels":98},{"id":2│5            │
│5,"rels":15},{"id":25,"rels":2}]                                      │             │
├──────────────────────────────────────────────────────────────────────┼─────────────┤
│[{"id":43,"rels":92},{"id":43,"rels":85},{"id":43,"rels":9},{"id":43,"│5            │
│rels":5},{"id":43,"rels":1}]                                          │             │
└──────────────────────────────────────────────────────────────────────┴─────────────┘

Now it’s time to delete the duplicates:

MATCH (p:Person)
WITH p 
ORDER BY p.id, size((p)--()) DESC
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
UNWIND nodes[1..] AS n
DETACH DELETE n
 
Deleted 143 nodes, deleted 13806 relationships, completed after 29 ms.

Now if we run our duplicate query:

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
WHERE size(nodes) >  1
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
(no changes, no records)

What about if we remove the WHERE clause?

MATCH (p:Person)
WITH p.id as id, collect(p) AS nodes 
RETURN [ n in nodes | n.id] AS ids, size(nodes)
ORDER BY size(nodes) DESC
LIMIT 10
 
╒═════╤═════════════╕
│"ids"│"size(nodes)"│
╞═════╪═════════════╡
│[23] │1            │
├─────┼─────────────┤
│[86] │1            │
├─────┼─────────────┤
│[77] │1            │
├─────┼─────────────┤
│[59] │1            │
├─────┼─────────────┤
│[50] │1            │
├─────┼─────────────┤
│[32] │1            │
├─────┼─────────────┤
│[41] │1            │
├─────┼─────────────┤
│[53] │1            │
├─────┼─────────────┤
│[44] │1            │
├─────┼─────────────┤
│[8]  │1            │
└─────┴─────────────┘

Hoorah, no more duplicates! Finally, let’s check that we kept the node we expected to keep. We expect it to have an ‘internalId’ of 175:

MATCH (p:Person {id: 1})
RETURN size((p)--()), id(p) AS internalId
 
╒═══════════════╤════════════╕
│"size((p)--())"│"internalId"│
╞═══════════════╪════════════╡
│242            │175         │
└───────────────┴────────────┘

Which it does! There are many fewer relationships than we had before because a lot of those relationships were to duplicate nodes that we’ve now deleted.

If we want to go a step further we could ‘merge’ the duplicate node’s relationships onto the nodes that we did keep, but that’s for another post!

Published on Java Code Geeks with permission by Mark Needham, partner at our JCG program. See the original article here: Neo4j: Cypher – Deleting duplicate nodes

Opinions expressed by Java Code Geeks contributors are their own.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

 

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

 

and many more ....

 

Receive Java & Developer job alerts in your Area

I have read and agree to the terms & conditions

 

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments