0

I Have a Script for data extraction from some CSV files and bifurcating the Data into different excel files. I using Ipython for the that and I m sure it using CPython as the Default interpreter.

But the script is taking too much time for the whole process to finish. Can someone please help to how use that script using the PyPy as i heard it is much faster than CPython. Script is something like this:

import pandas as pd
import xlsxwriter as xw
import csv
import pymsgbox as py

file1 = "vDashOpExel_Change_20150109.csv"
file2 = "vDashOpExel_T3Opened_20150109.csv"

path = "C:\Users\Abhishek\Desktop\Pandas Anlaysis"

def uniq(words):
    seen = set()
    for word in words:
        l = word.lower()
        if l in seen:
            continue
        seen.add(l)
        yield word

def files(file_name):
    df = pd.read_csv( path + '\\' + file_name, sep=',', encoding = 'utf-16')

    final_frame = df.dropna(how='all')

    file_list = list(uniq(list(final_frame['DOEClient'])))

    return file_list, final_frame

def fill_data(f_list, frame1=None, frame2=None):
    if f_list is not None:
        for client in f_list:
            writer = pd.ExcelWriter(path + '\\' + 'Accounts'+ '\\' + client + '.xlsx', engine='xlsxwriter')
            if frame1 is not None:
                data1 = frame1[frame1.DOEClient == client]                  # Filter the Data
                data1.to_excel(writer,'Change',index=False, header=True)    # Importing the Data to Excel File

            if frame2 is not None:
                data2 = frame2[frame2.DOEClient == client]                   # Filter the Data
                data2.to_excel(writer,'Opened',index=False, header=True)  # Importing the Data to Excel File

    else:
        py.alert('Please enter the First Parameter !!!', 'Error')
list1, frame1 = files(file1)
list2, frame2 = files(file2)

final_list = set(list1 + list2)
jmcnamara
  • 38,196
  • 6
  • 90
  • 108
Abhishek Jain
  • 65
  • 1
  • 2
  • 8
  • Abishek, Have you profiled the code at all? https://github.com/rkern/line_profiler – Michael WS Feb 09 '15 at 07:52
  • 1
    Note: this line: `file_list = list(uniq(list(final_frame['DOEClient'])))` can be replaced by `list(final_frame['DOEClient'].str.lower().unique())` this will be much faster than what you are doing – EdChum Feb 09 '15 at 10:04
  • Thanks Ed for the Suggestion :) – Abhishek Jain Feb 09 '15 at 13:53
  • Ed : I Guess the maximum time which is taken is in exporting the dataframe data to different excel sheet. Is there any faster way to do that because `to_excel()` is taking too much time I suppose. – Abhishek Jain Feb 09 '15 at 13:54
  • I Have 2.5 GB of data which I have to process using above code and export to the excel sheets. And it s taking nearly 2 hours for the process to finish. Please tell some faster way to do that. – Abhishek Jain Feb 09 '15 at 14:07
  • @EdChum : Hi Ed Can you please suggest me something on this please ? – Abhishek Jain Feb 10 '15 at 06:42

0 Answers0