How to speedup the creation of nodes to a Neo4j database with py2neo from Python

How to speedup the creation of nodes to a Neo4j database with py2neo from Python

  • @Tsartsaris

    Posted on 2015-01-17

    ‚ÄčLet's see how we can speedup the creation of nodes with py2neo from Python to the Neo4j database. For the purpose of the test we will use the standard creation techniques derived from the documentation of the py2neo library.

    For our case we have 8 lists, each containing 10000 records each (80000 records) that we want to write in the database. For the creation method we use a transaction. Our code will look like this.

     from py2neo import Graph

    import time


    time1 = time.time()

    graph = Graph()

    statement = "CREATE (n:Person {name:{N}}) RETURN n"

    tx = graph.cypher.begin()


    def add_names(names):

            for name in names:

                    tx.append(statement, {"N": name})

            tx.process()


    my_list1=list(range(1,10000))

    my_list2=list(range(10000, 20000))

    my_list3=list(range(20000, 30000))

    my_list4=list(range(30000, 40000))

    my_list5=list(range(40000, 50000))

    my_list6=list(range(50000, 60000))

    my_list7=list(range(60000, 70000))

    my_list8=list(range(70000, 80000))


    add_names(my_list1)

    add_names(my_list2)

    add_names(my_list3)

    add_names(my_list4)

    add_names(my_list5)

    add_names(my_list6)

    add_names(my_list7)

    add_names(my_list8)


    tx.commit()

    time2 = time.time()

    print time2-time1

    After running the above code the time printed was #85.8 seconds.

    Lets try and make things a bit faster now. We will use Pools from multiprocessing. And after combining all the lists to one list of lists we will map each list to the function. The commit now will be transferred in the function since the process will take place in there. Our code will look like this.

    from py2neo import Graph

    from multiprocessing import Pool

    import time


    time1 = time.time()

    graph = Graph()

    statement = "CREATE (n:Person {name:{N}}) RETURN n"


    tx = graph.cypher.begin()


    def add_names(names):

            for name in names:

                    tx.append(statement, {"N": name})

            tx.process()

            tx.commit()


    my_list1=list(range(1,10000))

    my_list2=list(range(10000, 20000))

    my_list3=list(range(20000, 30000))

    my_list4=list(range(30000, 40000))

    my_list5=list(range(40000, 50000))

    my_list6=list(range(50000, 60000))

    my_list7=list(range(60000, 70000))

    my_list8=list(range(70000, 80000))

    newlis = []

    newlis.append(my_list1)

    newlis.append(my_list2)

    newlis.append(my_list3)

    newlis.append(my_list4)

    newlis.append(my_list5)

    newlis.append(my_list6)

    newlis.append(my_list7)

    newlis.append(my_list8)


    p = Pool(4)

    p.map(add_names, newlis)

    time2 = time.time()

    print time2-time1

    At this point if we run the code we have an error from the Http lib.


    httplib.IncompleteRead: IncompleteRead(189503 bytes read)


    To tackle this we will import httplib and make the request with httplib 1.0 which will not handle chunks and our request will process normally. So at the top we add this 3 lines.


    import httplib

    httplib.HTTPConnection._http_vsn = 10

    httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'


    And after that we will run the script again and the time printed now is #42.8. That is about half the time needed without multiprocessing our job. All tests took place on a Sony Vaio laptop with 2 cores and 4 gigs of ram, running ubuntu 14.04.  

Tag-cloud

webNeo4Jphpd3jsubuntuworkcypherinternetbootstrapdevelopmentflaskpython

Social Me!

Twitter Logo LinkedIn Logo Google+ Logo Tumblr Logo