0

The code works properly upto entering for loop and the date values is fetched. after that it returns an empty list of values for rest of the variables like time, ref1, seriel and all.

import pandas as pd
import re

# Create a Dataframe from CSV
my_dataframe = pd.read_csv('C:/Users/WI/Desktop/file.csv')

# Drop rows with any empty cells
my_dataframe.dropna(axis=0, how='any', thresh=None, subset=['date'], inplace=False)

with open("C:/Users/WDSI/Desktop/OutputFile.txt", "w") as F:
    F.write("%s" %my_dataframe)  

fin = open("C:/Users/WDSI/Desktop/OutputFile.txt", "r")
# print("Input file is taken")
fout = open("C:/Users/WDSI/Desktop/OutputFile1.txt", "w")
# print("Output file is taken")
for line in fin:
    date = re.findall(r'(\d{4}-\d{2}-\d{2})', fin.read())
    time = re.findall(r'(\s\d{2}:\d{2}:\d{2})',fin.read())
    seriel=re.findall(r'(\s[A-Z][A-Z][A-Z][0-9])',fin.read())
    part=re.findall(r'(\s[0-9][0-9][0-9][A-Z][0-9][0-9][0-9][0-9][0-9])',fin.read())
    ref1=re.findall(r'(\s\d{16})',fin.read())
    ref3=re.findall(r'(\d{9})+$',fin.read())
    #print(date)
    #print(time)
    #print(seriel)
    #print(part)
    #print(ref1)
    #print(ref3)
    fout.write("%10s,%8s" %((date,time)))
    fout.close()

when we run this code only date variable gets the value other variables like time, ref1 and all goes empty. also please help me to write date,time,serial,part,ref1,ref3 from each row of csv file. in this format the output file should be written.

NiKS
  • 377
  • 3
  • 15
Sharvaa
  • 41
  • 9
  • As you use 'w' when write, your for loop is writing the first line and then immediately closed it. So what will happen to other lines? Your closing file method should sit out of the loop also you should use 'w+' or append after first loop.. – DataPsycho Nov 15 '19 at 09:28
  • Do you absolutely need to get data in this manner? or is it acceptable for you to get it from the dataframe? – Petru Tanas Nov 15 '19 at 09:31
  • dataframe is written into a file. from that i am trying to get it. Is there any other way to get it directly from dataframe? – Sharvaa Nov 15 '19 at 09:42

2 Answers2

1

You are reading line by line with the for line in fin but the first all your findall read the whole file content with fin.read().

You either process line by line (replace those fin.read() with line):

for line in fin:
    date = re.findall(r'(\d{4}-\d{2}-\d{2})', line)
    ...

Or read the whole file and remove the for:

content = f.read()
date = re.findall(r'(\d{4}-\d{2}-\d{2})', content)
...
luis.parravicini
  • 1,214
  • 11
  • 19
  • Thanks luis. Even after removing the for loop line from the code i get the same problem. – Sharvaa Nov 15 '19 at 09:39
  • or part works luis, thank you. and please help me to get only 9 digit number from the file. where my ref3 line fetches another 9digit values too from 16 digits – Sharvaa Nov 15 '19 at 09:46
  • Now i fetched all the values at once. Now i just wanna write them into a file , i tried with #fout.write("%10s,%8s" %((date,time,seriel,part,ref1,ref2))) #fout.write("'{0}','{1}','{2}','{3}','{4}','{5}'" .format((date,time,seriel,part,ref1,ref2))) but both doesnt work – Sharvaa Nov 15 '19 at 09:48
0

It is not exact replica of your solution but that how you can open a file and take whatever you need from each line then write the new data to a new file.

I have catered a csv file with the following lines:

This is a date 2019-08-05, 2019-09-03
This is a email asdfasdf@abc.com

Solution 1:

with open("./Datalake/output.txt", "w+") as wf:
    with open("./Datalake/test.csv") as f:
        for line in f:
            dates = re.findall(r"\d{4}-\d{1,2}-\d{1,2}", line)
            dates = "|".join(dates)
            emails = re.findall(r'[\w\.-]+@[\w\.-]+', line)
            emails = "|".join(emails)
            extracted_line = "{}, {}\n".format(dates, emails)
            wf.write(extracted_line)
            print(extracted_line)

Solution 2:

You can extract directly from data frame. Apply the same search using a lambda function which will execute for each row. But be careful you might need some error handling lambda function will through error if there is None value in the column. Convert the column to str before applying lambda.

df = pd.read_csv("./Datalake/test.csv", sep='\n', header=None, names=["string_col"])
df['dates'] = df["string_col"].apply(lambda x: re.findall(r"\d{4}-\d{1,2}-\d{1,2}", x))
df['emails'] = df["string_col"].apply(lambda x: re.findall(r"[\w\.-]+@[\w\.-]+", x))

In that case the calculated column will a python list so you might consider to use ''.join() in the lambda to make them text.

DataPsycho
  • 958
  • 1
  • 8
  • 28
  • what if we want to check in the first value of each row should contain something and 2nd value of row shouldnt empty? before fetching by using regex – Sharvaa Nov 15 '19 at 11:38
  • Can you give an example with data. I do not understand what you are trying to achieve. – DataPsycho Nov 15 '19 at 15:16
  • 1st ROW---> Name Date seriel ref1 ref2 2nd ROW--->ABC 2019-10-14 aaa 161651 151111 3rd ROW--->BBG EMPTY hju 656565 565626 4th ROW--->ABC 2019-12-09 jjj 656566 232323 5th ROW--->ABC EMPTY kjjfk 986547 321455 in this just want to fetch only ABC row , but if the date is missing in the relevant row we have to omit that row. hope u have got it now. – Sharvaa Nov 16 '19 at 04:22
  • i just wanna add the condition with your solution1 – Sharvaa Nov 16 '19 at 04:30
  • @sharvaa after extracting date you can put a if else statement before to use the join function with date: `if len(date) > 0 and 'abc' in line: ...` and put rest of the code under the if `else pass` . The `if` will go the the second line in the for loop. – DataPsycho Nov 26 '19 at 16:49