0

I have a data set with various Region map variables(around 1000). Sample data looks like:

Userid   regionmap1 regionmap2 regionmap3      and so on.
 78       7           na         na
 45       na          na         na
 67       1           na         na

Here the number in regionmap variables represent the number of views. Now I have an external file with only 10 region map entries. The file contains 10 entries/rows with 10 different region map variables:

Regionmap1
Regionmap3
Regionmap7
.....
.....
Regionmap856.

So my task is to keep only these regionmap variables as columns in the original file and delete all the other 990 columns. So the final data should look like:

Userid  Regionmap1 regionmap3 regionmap7   ........    regionmap856
78       7          na          na                       na
45       na         na          na                       na
67       1          na          na                       na

It would be great if anyone can provide me help in this regard in Python.

LeonH
  • 1,039
  • 2
  • 22
  • 45
user2007506
  • 79
  • 2
  • 8

3 Answers3

0

This is pretty easy to do. What have you tried?

Here's a general procedure to help you get started: 1 - open the smaller file w/ the regionmaps you want to keep and readline those into a list. 2 - open the larger file and create a dictionary of lists to contain the data. You can think of the dict's keys as basically column headers. The values are lists that represent the column values for all your records. 3 - now, remove kvps from your dict where the key is not in your list from step 1 or is not userid. 4 - use resulting dict to write out a new file.

Definitely not the only approach, buts it's a simple one that you should be able to start with. Hope that helps :)

Sean Azlin
  • 886
  • 7
  • 21
  • Thanks for the detailed instructions. Can you provide the code for the dictionary part? I mean how we can assign the values as lists that represent the column values for all the records? – user2007506 Aug 27 '14 at 09:45
  • MyDict['regionmap4'] = [2645, 8593, 3964, 3927]. Then, to get record one's regionmap4 value, you'd use MyDict['regionmap4'][0]. For the second record, use MyDict['regionmap4'][1], etc. – Sean Azlin Aug 27 '14 at 19:12
0

I have a solution adapted for your problem. You can perform to make the file look better.

import StringIO
import numpy as np

# Preparing an object that simulates a file (f is the file)
f = StringIO.StringIO()
f.write("""Userid   regionmap1 regionmap2 regionmap3
  78       7           na         na
  45       na          na         na
  67       1           na         na""")
f.seek(0)

# Reading file and getting the header (1st line)
head = f.readline().strip("\n").split()
data = []
for a in f:
    data.append([float(e) for e in a.replace('na', 'NaN').split()])
#
data = np.array(data)

# Columns to keep
s = ("Regionmap1", "Regionmap3")
s = map(lambda e: e.lower(), s)
s = ["Userid",] + s

# Index of the columns to keep
idx, = np.where([e in s for e in head])

# Saving the new data in a file (simulated with StringIO)
ff = StringIO.StringIO()
ff.write(' '.join(tuple(s)) + '\n')
np.savetxt(ff, data[:, idx])

The rendered file looks like:

Userid regionmap1 regionmap3
7.800000000000000000e+01 7.000000000000000000e+00 nan
4.500000000000000000e+01 nan nan
6.700000000000000000e+01 1.000000000000000000e+00 nan
Taha
  • 709
  • 5
  • 10
  • Doesn't this read the whole file into memory? I'm sure you could do the read and write in the same loop. – Holloway Aug 27 '14 at 09:56
  • As I pointed out in my question. I have about 1000 columns. So in the f.write function I would not supply all the 1000 columns. I am new to Python. So there can we pass the entire data set? – user2007506 Aug 27 '14 at 10:03
  • A table of 1000x1000 can be easily handled with an average computer. Indeed, the reading and writing can be done in the same loop. The given code contains all the elements needed to do so. I can add such a solution if needed. – Taha Aug 27 '14 at 11:16
0

Try dis! Dis code is to form the dictionary with headers as key and the list of column values as values

    f = open('2.txt', 'r') #opening the large file
    data = f.readlines()
    f.close()

    hdrs = data[0].split('\t') #assuming that large file is tab separated, and the first line is header line
    data_dict = {} #main data
    for each_line in data[1:]: #starting from second line as the first line is header line
        splitdata = each_line.split('\t') #splitting the line with tab
        for i, d in enumerate(splitdata): 
            tmpval = data_dict.get(hdrs[i], []) 
            tmpval.append(d)
            data_dict[hdrs[i]] = tmpval        #appending the column value for its respective header

    for k, v in data_dict.items(): #printing the final data dict
         print k, v