2

I am having difficulty in formatting some code in Python: My code is here:

keys = ['(Lag)=(\d+\.?\d*)','\t','(Autocorrelation Index): (\d+\.?\d*)',       '(Autocorrelation Index): (\d+\.?\d*)',     '(Semivariance): (\d+\.?\d*)']

import re
string1 = ''.join(open("dummy.txt").readlines())
found = []
for key in keys:
found.extend(re.findall(key, string1))
for result in found:
    print '%s  =  %s' % (result[0],result[1])
raw_input()

So far, I am getting this output:

Lag = 1

Lag = 2

Lag = 3

Autocorrelation Index = #value

......

......

Semivariance = #value

But the desired output I want is:

 Lag        AutoCorrelation Index   AutoCorrelation Index   Semivariance
  1              #value                   #value               #value
  2              #value                   #value               #value
  3              #value                   #value               #value

If this output can be possible in a CSV file or a txt file, that would be great!

I think this is a way how you should output the loops, but I am not that great with loops.

My updated code (OLD version)

based on @mutzmatron answer

keys = ['(Lag)=(\d+\.?\d*)',
    '(Autocorrelation Index): (\d+\.?\d*)',
    '(Semivariance): (\d+\.?\d*)']

import re
string1 = open("dummy.txt").readlines().join()
found = []
for key in keys:
    found.extend(re.findall(key, string1))
raw_input()
for result in found:
    print '%s  =  %s' % (result[0], result[1])

raw_input()

not yet compiling! I am using IDLE python 2.6 , don't know the error messages since I don't know the pause command in the prompt!

Original Question

I am totally new to python and have a question. I am trying to process a large text file. Here is just a snippet of it:

Band: WDRVI20((0.2*b4-b3)/((0.2*b4)+b3))
Basic Statistics:
  Min: -0.963805
  Max: 0.658219
  Mean: 0.094306
  Standard Deviation: 0.131797
Spatial Statistics, ***Lag=1***:
  Total Number of Observations (Pixels): 769995
  Number of Neighboring Pairs: 1538146
  Moran's I:
    ***Autocorrelation Index: 0.8482564597***
    Expected Value, if band is uncorrelated: -0.000001
    Standard Deviation of Expected Value (Normalized): 0.000806
    Standard Deviation of Expected Value (Randomized): 0.000806
    Z Significance Test (Normalized): 1052.029088
    Z Significance Test (Randomized): 1052.034915
  Geary's C:
    ***Autocorrelation Index: 0.1517324729***
    Expected Value, if band is uncorrelated: 1.000000
    Standard Deviation of Expected Value (Normalized): 0.000807
    Standard Deviation of Expected Value (Randomized): 0.000809
    Z Significance Test (Normalized): 1051.414163
    Z Significance Test (Randomized): 1048.752451
  ***Semivariance: 0.0026356529***
Spatial Statistics, Lag=2:
  Total Number of Observations (Pixels): 769995
  Number of Neighboring Pairs: 3068924
  Moran's I:
 Autocorrelation Index: 0.6230691635
   Expected Value, if band is uncorrelated: -0.000001
   Standard Deviation of Expected Value (Normalized): 0.000571
   Standard Deviation of Expected Value (Randomized): 0.000571
 Z Significance Test (Normalized): 1091.521976
 Z Significance Test (Randomized): 1091.528022
  Geary's C:
Autocorrelation Index: 0.3769372504
  Expected Value, if band is uncorrelated: 1.000000
  Standard Deviation of Expected Value (Normalized): 0.000574
  Standard Deviation of Expected Value (Randomized): 0.000587
 Z Significance Test (Normalized): 1085.700399
 Z Significance Test (Randomized): 1061.931158
Semivariance: 0.0065475488

I need to extract the information in between the star *** values ( eg : Autocorrelation Index, Semivariance values ) and process it , maybe write it to a different text file or excel file. Can I do that? Help would be really appreciated.

Community
  • 1
  • 1
dsinha
  • 19
  • 7
  • sorry !! i edited the information again , its the values between the 3 stars(***) – dsinha Jul 31 '12 at 15:20
  • so far , i only figured out how to separate out chunks , each words seperately using split!! But cannot use it to any information as I dont knw how to search and keep the whole value eg: " Autocorrelation Index : 0.23423" aside – dsinha Jul 31 '12 at 15:23
  • 1
    I have cleaned up the question and added the code at the top. I am sure that this will work... Maybe it is an issue of indentation. The code pasted by OP had two missing indents. – daedalus Jul 31 '12 at 16:26
  • thnx @gauden , so did you fix my indentation ?? can that be a problem in python ??? – dsinha Jul 31 '12 at 16:28
  • @dsinha: Indentation is **everything** in python - it's how python determines scopes for function definitions, loops, etc etc. See http://docs.python.org/reference/lexical_analysis.html#indentation – jmetz Jul 31 '12 at 16:33

2 Answers2

1

Populate a list of keys (regular expressions) you want to find. For example,

keys = ['(Lag)=(\d+\.?\d*)',
        '(Autocorrelation Index): (\d+\.?\d*)',
        '(Semivariance): (\d+\.?\d*)']

And then search for these using a regular expression,

import re
string1 = ''.join(open(FILE).readlines())
found = []
for key in keys:
    found.extend(re.findall(key, string1))

for result in found:
    print '%s  =  %s' % (result[0], result[1])

You should then have a list of the entries you want, with which you can do what you need to next!

Result:

Lag  =  1
Autocorrelation Index  =  0.8482564597
Autocorrelation Index  =  0.1517324729
Semivariance  =  0.0026356529

CSV

To output to CSV, use the csv module;

import csv
outfile = open('fileout.csv', 'w')
wrt = csv.writer(outfile)
wrt.writerows(found)
outfile.close()
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jmetz
  • 12,144
  • 3
  • 30
  • 41
  • i copied what you had to see if it worked. but it's giving invalid syntax error !! – dsinha Jul 31 '12 at 15:31
  • this is what i had in the entire file !!! raw_input() keys = ['Lag=\d+.\d+', 'Autocorrelation Index: \d+.\d+', 'Semivariance: \d+.\d+'] string1 = open("dummy.txt").readlines().join() found[] for key in keys: found.extend(re.findall(key,string1) – dsinha Jul 31 '12 at 15:32
  • @gauden: cheers and +1 for separating the variable names from the values! – jmetz Jul 31 '12 at 15:37
  • wow thanx a lot guys , this is such an active community, really appreciate the help. both @mutzmatron and gauden .. *thumbs up!!** one quick question , i should be able to do this for a large no. of information ryt ?? eg : I have around 100 of these blocks of data that I pasted and I would be needing to do the same with all of them ! – dsinha Jul 31 '12 at 15:42
  • Remove the `raw_input()` line and see what you get. Edit the original question with the code you are using and with the error. I will delete my extra comments here to remove clutter... – daedalus Jul 31 '12 at 15:52
  • i managed to catch a glimpse of the error , i coded again properly with the automatic indentations of the shell. And now the error i barely got time to see is " no attribute join" , thats what i managed , again is there somehow i can use any pause command on the prompt?? is there any specific one? – dsinha Jul 31 '12 at 16:34
  • thanx a lot for the continuous help gauden and @mutzmatron . Worked nicely , tweaked it a bit , found what I was looking for pretty much , I had one more question --> Is there anyway I can write these values to an excel file or csv ??? in the order Row --------------> AutoIndex AutoIndex Semivariance Columns---------> Lag1 Lag2 Lag3 Lag4 – dsinha Jul 31 '12 at 16:59
  • never mind found a typing mistake ... can i get the desired format with coding ??? @mutzmatron – dsinha Jul 31 '12 at 18:48
  • @dsinha - see my new answer below (this one was getting too long and included too many sub-answers) – jmetz Jul 31 '12 at 21:18
1

In order to format the data by section perhaps it's easiest to work on the segments as follows

keys =['(Lag)=(\d+\.?\d*)',
    '(Autocorrelation Index): (\d+\.?\d*)',
    '(Semivariance): (\d+\.?\d*)']

import re
string1 = ''.join(open("dummy.txt").readlines())

sections = string1.split('Spatial Statistics')

output = []
heads = []

for isec, sec in enumerate(sections):
    found = []
    output.append([])
    for key in keys:
        found.extend(re.findall(key, sec))
    for result in found:
        print '%s  =  %s' % (result[0],result[1])
        output[-1].append(result[1])
    if len(found) > 0 & len(heads) == 0:
        heads = [result[0] for result in found]    

fout = open('output.csv', 'w')
wrt = csv.writer(fout)
wrt.writerow(heads)
wrt.writerows(outputs)
fout.close()
jmetz
  • 12,144
  • 3
  • 30
  • 41
  • there's no difference , when we try to do for more than 1 set of data ..It doesnt seem to work when more than one set is being considered . – dsinha Jul 31 '12 at 21:40
  • sir , i edited the question where I have included Lag=2 set of datas in the datablock section ( see old part ) . Can you make sure it somehow works in including that data ? – dsinha Jul 31 '12 at 21:55
  • You're right I made a typo - I forgot to change `stringq` to `sec` in the `for` loop - I just corrected it and can confirm it works - maybe you can accept my answers now ;) – jmetz Aug 01 '12 at 14:21
  • yes it seems to work. but can i get it to output to a txt file like i wanted sir ? @mutzmatron – dsinha Aug 01 '12 at 16:22
  • @dsinha - while I was tempted to say "you have to do _some_ work yourself..." I just added the few lines of code needed. – jmetz Aug 01 '12 at 16:53
  • thanx a ton sir , really appreciate your help. i am 16 and a total newbie to programming itself , had difficult time to understand even some tutorials. lots of respect for u sir @mutzmatron – dsinha Aug 01 '12 at 18:44