0

I have a large file (3*10^7 rows) of call detail records (CDRs) with 9 columns ("|" as delimiter). Each row is a communication instance with the following attributes:

Date|Time|Duration|Caller|Receiver|serviceType|junk|cellReceiver|cellCaller|CallerLAC

I need to split this file into smaller chunks based on users. So each file will be all the communication by the user regardless whether the user is a caller or receiver (i.e., if A called B, then this row should appear in two files, the file of user A and the file of user B).

What would be the best way to do this efficiently? (I am using OS X Yosemite‎).

amaatouq
  • 2,297
  • 5
  • 29
  • 50

2 Answers2

2

bash and awk - I know you asked for python in the title. Unless this is homework shell will suffice.

awk -F '|' {u1=$4 
            u2=$5
            arr[u1]=arr[u1] $0 "\n"
            if( u2==u1 ) continue;
            arr[u2]=arr[u2] $0 "\n"
           }
           END {
               for (i in arr) {fname=i
                               print arr[i] > fname
                               close(fname)
                              }
           } ' inputfile

Some named variables were used to make it more readable. Your data has the potential of generating many more than 30 million lines total in all the output files. I agree with the database suggestion. Be sure to check ulimit for memory allowed, this will use lots. Remember to watch file inode limits on your filesystem.

jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
1

Does it absolutely have to be separate files?

Since you did not tag with a specific language: Personally, I'd import it into an SQL database as pipe-delimited ('|') ASCII (assuming ASCII since unspecified in question).
Advantages:

  1. Parsing is not your problem
  2. You can output it however you want
  3. Query the data in any way you want
  4. Complex queries are possible without writing code more complex than simple SQL SELECT statements
  5. Approach supported across almost any database or platform
frasnian
  • 1,973
  • 1
  • 14
  • 24
  • Unfortunately, it absolutely have to be separate files (the already in-place system expects one file per user). – amaatouq Dec 26 '14 at 22:45
  • 1
    ah, well cancel that idea then! (leaving answer, though, in case anyone else has a similar problem) – frasnian Dec 26 '14 at 22:46