0

I'm dealing with a csv table converted from a pdf with Adobe acrobat pro. For some reason the software creates recurrent error each 117 rows. It "duplicates & concatenates the numbers" e.g. a row

7307 1 87.1

is transformed into something of this sort:

73077307 11 87187.1

how I can "correct" these rows with python? I would need to split the float in the middle and erase the first half.

I have read several threads about truncation but most of them need to split floats by the decimal point or deal just with integers. The data type would be float64 because I'm using the pandas read_csv function to read the csv.

df = pd.read_csv('path/file.csv',sep=';',index_col='Rang', na_values=['NA'])
df.dropna(how="all", inplace=True) # drop empty rows (an additional issue)
df[(df.index >10000)]

EDIT1: Code added, I thought I could identify the wrong ones because I have 1 row per hour of the year. Any line with index bigger than 365*24=8760 is wrong. But I see now that is not sufficient. One could loop the dataframe and if the index of row(i+1)-index of row (i) is greater than one then it needs correction. But I'm a begginer in python.. I'm not sure how to write that, but that's kind of a different problem.

I'm using python version 2.7.8 pandas v. 0.14.1

many thanks!

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
Nabla
  • 1,509
  • 3
  • 20
  • 35
  • You must treat it as a String. – merlin2011 Feb 11 '15 at 00:38
  • Please [edit] your question to include the code you're using for this at present. Thanks for improving the question's reference value! – Nathan Tuggy Feb 11 '15 at 00:39
  • 2
    The first and foremost question - how do you **differentiate** such data from valid numbers? – ivan_pozdeev Feb 11 '15 at 00:56
  • @ivan_pozdeev, it can be identified because to hourly production during the year, for every hour of the year. the index is a number that should be between 0 and 8760. for the "wrong ones" is not necesarely the case. – Nabla Feb 12 '15 at 03:12
  • @merlin2011 I will give a try with the string aproach, I'll give an update with the progress – Nabla Feb 12 '15 at 03:13
  • Please post a sample of the data, and indicate the problematic lines. – unutbu Feb 12 '15 at 12:28

1 Answers1

1

Grab each space-delimited word into a list as a string. For each item in that list, check if len of word is even or odd. If even, replace word with right half of word. If odd (because of '.' on right side) grab right half rounding up (so 5 right-most chars in 9 char word, for example). Add conversion to float64 when replacing each word.

Sean Azlin
  • 886
  • 7
  • 21