1

I am biologist and very very new to Python and before, i learnt a bit of R.

So I have a very big text file (3 GB, too big to handle in R), all values are comma seperated but the extension is .txt (I don't know if it is necessary information). what i wanted to do is to:

read it into python as an object which is equivalent of dataframe in R, get rid of columns in the middle reduce the size of the object write it as txt file

take the rest to R.

If you can help me i would be very happy. thank you

  • 3
    I recommend the [CSV module](http://docs.python.org/2/library/csv.html). – GreenMatt Feb 20 '13 at 15:40
  • 1
    To me this looks more a job for `perl` or even `sed`... hard to tell without seeing at least one line and understanding exactly what are the rules for removing internal columns... – 6502 Feb 20 '13 at 15:43
  • Perhaps `read.csv.sql` from the `sqldf` package in R might be useful: http://code.google.com/p/sqldf/. You can pull out only the required fields from a csv using SQL. I've had some luck with large files, but not as large as you have. – James Feb 20 '13 at 15:48
  • 1
    Or unix command line: `cut -f 1-3,8-12 -d, < bigfile.txt >smallerfile.txt` (probably fails if you have commas in quotes though) – Spacedman Feb 20 '13 at 16:05

5 Answers5

3

There is no real need to go into python first. Your question looks a lot like this question. The answer marked as the correct answer iteratively reads the large file, and creates a new, smaller file. Other good alternatives are using sqlite and the sqdf package, or use the ff package. This last approach works particularly well is the number of columns is small compared to the number of rows.

Community
  • 1
  • 1
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
2

This will take minimal memory as it does not load the whole file at once.

import csv
with open('in.txt', 'rb') f_in, open('out.csv', 'wb') as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)
    for row in reader:
        # keep first two columns and last three columns
        writer.writerow(row[:2] + row[-3:])

Note: If using Python 3 change the file modes to 'r' and 'w', respectively.

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • 1
    For python3 you also have to add `newline=''` for the output file. – Voo Feb 20 '13 at 17:18
  • thank you for your answer. I am using python 2.7 with EPD free distribution on Mac OS x 10.6. this gave me an error in the last line saying it is a syntax error i dont know what is wrong about it though... here what is says: File "", line 5 writer.writerow(row[:2] + [-3:]) ^ – user2091290 Feb 21 '13 at 10:31
  • @user2091290: Whoops. That should have been `writer.writerow(row[:2] + row[-3:])`. I forgot the second reference to the row. – Steven Rumbalski Feb 21 '13 at 18:28
1

i am not familiar with r dataframe, but pandas provides helpers to read csv into pandas dataframe:

from pandas import read_csv    
df = read_csv('yourfile.txt')
print df
print df['Line']

If that is not what you need you can use csv module to iterate through each line of your csv as a python list and put it into whatever data structure you want.

dm03514
  • 54,664
  • 18
  • 108
  • 145
  • 2
    Reading the whole file without iterating will probably also use too much memory. – Paul Hiemstra Feb 20 '13 at 15:46
  • thank you for your answer. i use Mac OS 10.6 and recently i solved my compatibility problem between computer(OS)-python version-module by installing EPD free distribution of several packages + python2.7 I dont know if pandas is compatible with what i have. i will check it now. – user2091290 Feb 21 '13 at 10:38
0

Per CRAN (new features and bug fixes re: development) the new development build 3.0.0 should allow for R to use the pagefile/swap. In windows you will need to set R_MAX_MEM_SIZE to a suitably large value.

russellpierce
  • 4,583
  • 2
  • 32
  • 44
  • This general CRAN link is not really helpful, could you provide a more concrete link? – Paul Hiemstra Feb 20 '13 at 16:31
  • my os is Mac OS X 10.6.8 i tried to load a simplified version of that text file (~400 MB) and R was frozen. – user2091290 Feb 21 '13 at 09:31
  • What I was recommending was the development build of R. It isn't a 'stable' release, so some packages might not be available for it yet. However, it seems like it should be able to load your initial file, you can modify it, then save it back out as a CSV, then load it back into a stable version of R. Then you don't need to learn new skills. Of course, this is all speculation on my part based on what they claim 3.0.0 can do. The link for OS X is here: http://r.research.att.com/R-devel-leopard.pkg – russellpierce Feb 21 '13 at 11:50
  • Er... but I don't know how the Mac releases work. What is above is listed as a framework... they also list separate downloads for the GUI. 64 bit http://r.research.att.com/R-GUI-6443-3.0-leopard-Leopard64.dmg and 32 bit http://r.research.att.com/R-GUI-6443-3.0-leopard-Leopard.dmg. All of this is available at http://r.research.att.com/. Perhaps a Mac native can decipher. – russellpierce Feb 21 '13 at 11:53
0

If you insist on using a preprocessing step, using the linux command tools is a really good and fast option. If you use Linux, these tools are already installed, under Windows you'll need to first install MinGW or Cygwin. This SO question already provides some nice pointers. In essence you use the awk tool to iteratively process the text file, creating an output text file as you go. Copying form the accepted answer of the SO question I linked:

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt 

This read the file, grabs the eight column, and dumps it to a file. See the answer for more details.

Community
  • 1
  • 1
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149