process a text file using various delimiters

Question

My text file (unfortunately) looks like this...

<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}
<akbar>[akbar-1000#Fem$$$_Y](1){}
<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}

It contain the customer name followed by some information. The sequence is...

text string followed by list, set and then dictionary

<> [] () {}

This is not python compatible file so the data is not as expected. I want to process the file and extract some information.

amar 1000 | 1000  | 1000
akbar 1000  
john 0000  | 0100 | 0100

1) name between <>

2) The number between - and # in the list

3 & 4) split dictionary on comma and the numbers between | and # (there can be more than 2 entries here)

I am open to using any tool best suited for this task.

Martin Evans · Accepted Answer · 2015-08-14T09:21:43.430

3

The following Python script will read your text file and give you the desired results:

import re, itertools

with open("input.txt", "r") as f_input:
    for line in f_input:
        reLine = re.match(r"<(\w+)>\[(.*?)\].*?{(.*?)\}", line) 
        lNumbers = [re.findall(".*?(\d+).*?", entry) for entry in  reLine.groups()[1:]]
        lNumbers = list(itertools.chain.from_iterable(lNumbers))
        print reLine.group(1), " | ".join(lNumbers)

This prints the following output:

amar 1000 | 1000 | 1000
akbar 1000
john 0000 | 0100 | 0100

edited Aug 14 '15 at 09:21

answered Aug 14 '15 at 09:11

Martin Evans

45,791
17
81
97

sed/ shell is good. awk is awesome. But I am accepting this solution because the actual data is more complicated and I can manage it only after splitting data using re shown here. – shantanuo Aug 14 '15 at 13:02

score 3 · Answer 2 · answered Aug 14 '15 at 09:42

As the grammer is quite complex you might find a proper parser the best solution.

#!/usr/bin/env python

import fileinput
from pyparsing import Word, Regex, Optional, Suppress, ZeroOrMore, alphas, nums


name = Suppress('<') + Word(alphas) + Suppress('>')
reclist = Suppress('[' + Optional(Word(alphas)) + '-') + Word(nums) + Suppress(Regex("[^]]+]"))
digit = Suppress('(' + Word(nums) + ')')
dictStart = Suppress('{')
dictVals = Suppress(Word(alphas) + '|') + Word(nums) + Suppress('#' + Regex('[^,}]+') + Optional(','))
dictEnd = Suppress('}')

parser = name + reclist + digit + dictStart + ZeroOrMore(dictVals) + dictEnd

for line in fileinput.input():
    print ' | '.join(parser.parseString(line))

This solution uses the pyparsing library and running produces:

$ python parse.py file
amar | 1000 | 1000 | 1000
akbar | 1000
john | 0000 | 0100 | 0100

score 2 · Answer 3 · answered Aug 14 '15 at 08:51

You can add all delimiters to the FS variable in awk and count fields, like:

awk -F'[<>#|-]' '{ print $2, $4, $6, $8 }' infile

In case you have more than two entries between curly braces, you could use a loop to traverse all fields until the last one, like:

awk -F'[<>#|-]' '{ 
    printf "%s %s ", $2, $4
    for (i = 6; i <= NF; i += 2) { 
        printf "%s ", $i 
    }
    printf "\n" 
}' infile

Both commands yield same results:

amar 1000 1000 1000 
akbar 1000 
john 0000 0100 0100

The6thSense · Answer 4 · 2015-08-14T09:14:13.487

You could use regex to catch the arguments

sample:

a="<john>[-0000#$$$_N](0){USA|0100#$avi$$,NJ|0100#$avi$$}"
name=" ".join(re.findall("<(\w+)>[\s\S]+?-(\d+)#",a)[0])
others=re.findall("\|(\d+)#",a)
print name+" | "+" | ".join(others) if others else " "

output:

'john 0000 | 0100 | 0100'

Full code:

with open("input.txt","r") as inp:
     for line in inp:
          name=re.findall("<(\w+)>[\s\S]+?-(\d+)#",line)[0]
          others=re.findall("\|(\d+)#",line)
          print name+" | "+" | ".join(others) if others else " "

Bertrand Martel · Answer 5 · 2015-08-14T09:27:01.653

For one line of your file :

test='<amar>[amar-1000#Fem$$$_Y](1){india|1000#Fem$$$,mumbai|1000#Mas$$$}'

replace < with empty character and remove everything after > for getting the first name

echo $test | sed -e 's/<//g' | sed -e 's/>.*//g'

get all 4 digit characters suites :

echo $test |  grep -o '[0-9]\{4\}'

replace space with your favorite separator

sed -e 's/ /|/g'

This will make :

echo $(echo $test | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $test |  grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'

This will output :

amar|1000|1000|1000

with a quick script you got it : your_script.sh input_file output_file

#!/bin/bash

IFS=$'\n' #line delimiter

#empty your output file
cp /dev/null "$2"

for i in $(cat "$1"); do
    newline=`echo $(echo $i | sed -e 's/<//g' | sed -e 's/>.*//g') $(echo $i |  grep -o '[0-9]\{4\}') | sed -e 's/ /|/g'`
    echo $newline >> "$2"
done

cat "$2"

process a text file using various delimiters

5 Answers5