0

I am working with server who's configurations are as:

RAM - 56GB Processor - 2.6 GHz x 16 cores How to do parallel processing using shell? How to utilize all the cores of processor?

I have to load data from text file which contains millions of entries for example one file contains half million lines data. I am using django python script to load data in postgresql database. But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data. Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it. My django python script is as below:

import re
import collections
    def SystemType():
        filename = raw_input("Enter file Name:")
        in_file = file(filename,"r")
        out_file = file("SystemType.txt","w+")
        for line in in_file:
            line = line.decode("unicode_escape")
            line = line.encode("ascii","ignore")
            values = line.split("\t")
            if values[1]:
                for list in values[1].strip("wordnetyagowikicategory"):
                        out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))

    # Eliminate Duplicate Entries from extracted data using regular expression

def FSystemType():
    lines_seen = set()
    outfile = open("Output.txt","w+")
    infile = open("SystemType.txt","r+")
    for line in infile:
        if line not in lines_seen:
                l = line.lstrip()
# Below reg exp is used to handle Camel Case.
                outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())
                lines_seen.add(line)
    infile.close()
    outfile.close()




 sylist=[]
        def create_system_type(stname):
            syslist=Systemtype.objects.all()
            for i in syslist:
                sylist.append(str(i.title))
            if not stname in sylist:
                slu=slugify(stname)
                st=Systemtype()
                st.title=stname
                st.slug=slu
        #   st.sites=Site.objects.all()[0]
                st.save()
            print "one ST added."
user2423706
  • 159
  • 4
  • 15

2 Answers2

0

if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.

e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for

for list in values[1]....

what are you trying to strip? individual characters, whole words? ...

then i'd suggest "awk".

Marty McGowan
  • 356
  • 2
  • 10
  • I am only using "SystemType.txt" file to enter data and that file i have already created i am just fetching data from "SystemType.txt" file and add it to database which is postgresql. Instead of outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower()) line in FSystemType() i am calling create_system_type(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower()) so that it will add line by line entry in database. – user2423706 Jun 20 '13 at 15:59
  • so it is more of a postgresql question. i'm way out of my water here. have you tried a two-step process by parsing the file into a sequence of commands in a separate file and feeding that file to the postgresql process? in that way you could at least verify the input commands are proper. – Marty McGowan Jun 22 '13 at 14:19
0

If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.

There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

rorycl
  • 1,344
  • 11
  • 19