2

I'm confronted with a problem as below and hoping some body could give some advice.
I need to convert a lot of excel tables in different shapes into constructed data, the excel tables are as below.

|--------------------|----|----|
|user:Sam            |    |    |
|--------------------|----|----|  
|mail:sam@example.com|    |    |  


|-------|----------------|-----|
|user   |Jack            |     |
|-------|----------------|-----|
|mail   |jack@example.com|     | 
|-------|----------------|-----|


|-------|-----|---------------|---------|
|user   |May  |               |         |
|-------|-----|---------------|---------|
|       |mail |may@example.com|         |
|-------|-----|---------------|---------|


|user   | Alex  |mail  |alex@example.com|

The target result would be like the following format.

|-------|-------------------|
|user   | email             |
|-------|-------------------|
|Jack   | jack@example.com  |
|-------|-------------------|
|Sam    | sam@example.com   |
|-------|-------------------|
|Alex   | alex@example.com  |
|-------|-------------------|
|May    | may@example.com   |
|-------|-------------------|

My current solution is to define a function for each type of excel table. But there would be thousands of different excel files so I would have to repeat write similar code. So my question is whether there is common solution for it.

I found one similar question about this but there is no more information.I think machine learning may help to solve the problem, but I know little about that. Is there any one who could share some thoughts? Thanks very much!

LuFFy
  • 8,799
  • 10
  • 41
  • 59
zhangjpn
  • 355
  • 1
  • 3
  • 6

2 Answers2

1

Looking at the patterns you have provided in your question we see that the data is sometimes in a separate cell, other times encoded in the text with a ':' separator. I'd flatten it out and parse the assembled text for a linear pattern.

I suggest you read the excel file using something like xlrd. Then step through the cells pulling out the text and parse out the fields you are interested in.

<cell>'user'<cell|':'>user_name<cell>'mail'<cell|':'>email_address<cell>

where <cell> is one or more cell boundaries, possibly spread over rows.

Once you have the user email pairs you can write them out using xlwt.

Mike Robins
  • 1,733
  • 10
  • 14
-1

You have 4 types of files.

If that is all you can write 1 function with 4 if statements.

def table_sort(file):

    If file == condition:
         extract_data_this_way 

   elif file == other_condition:
        extract_data_this_way

   elif file == other_condition:
        extract_data_this_way

   else:
        extract_data_this_way

If you use pandas to do this it will make it a lot easier to code.

I'd you have a lot of files. You can pass in a list and use a for loop to iterate. Or use glob to load all excel files in a directory and loop that way .

johnashu
  • 2,167
  • 4
  • 19
  • 44
  • Thist would only fit the 4 example formats. The OP stated that he would have to deal with many more different formats ... – Peter Pesch Dec 11 '17 at 08:43
  • He said there would be thousands of files and presented what the 4 types were. My solution was adequate for the question. – johnashu Dec 11 '17 at 10:10
  • 1
    Thanks for your advice and sorry for my poor description of the question. The condition is that there are many more formats that I don't know for sure. And what's more complicated is that there are other fields in the same table to extract. – zhangjpn Dec 12 '17 at 01:19
  • It's no problem. It only makes the question more interesting. – johnashu Dec 12 '17 at 08:18