1

I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes.

Notice how each cell is separated by a , ... is there any really efficient method to split all of these cells and create 2 dataframes, one for the labels and one for the ( value ) in parenthesis?

NumPy has all those ufuncs, is there a way I can use them on string dtypes since they can be converted to np.array with DF.as_matrix()? I'm trying to steer clear of for loops, I could iterate through all the indices and populate an empty array but that's pretty barbaric.

I'm using Beaker Notebook btw, it's really cool (HIGHLY RECOMMENDED)

enter image description here

#Set URL Destination
url = "http://www.reef.org/print/db/stats"

#Process raw table
DF_raw = pd.pandas.read_html(url)[0]

#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()

#Process table
DF_freqSpecies = pd.DataFrame(
                              DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
                              columns = DF_raw.iloc[0,:]
)
DF_freqSpecies

#Split these into 2 separate DataFrames

Here's my naive way of doing such:

import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)

dims = DF_freqSpecies.shape
for i in range(dims[0]):
    for j in range(dims[1]):
        #Parse current dataframe
        species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
        freq = float(freq[:-1])
        #Populate split DataFrames
        DF_species.iloc[i,j] = species
        DF_freq.iloc[i,j] = freq

I want these 2 dataframes as my output:

(1) Species; enter image description here and (2) Frequencies enter image description here

O.rka
  • 29,847
  • 68
  • 194
  • 309

1 Answers1

2

you can do it this way:

DF1:

In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)

In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska  \
0                  Bluehead                          Copper Rockfish
1                 Blue Tang                                  Lingcod
2      Stoplight Parrotfish                        Painted Greenling
3        Bicolor Damselfish                           Sunflower Star
4              French Grunt                          Plumose Anemone

0                      Hawaii Tropical Eastern Pacific  \
0               Saddle Wrasse           King Angelfish
1  Hawaiian Whitespotted Toby          Mexican Hogfish
2       Raccoon Butterflyfish               Barberfish
3            Manybar Goatfish            Flag Cabrilla
4                Moorish Idol   Panamic Sergeant Major

0              South Pacific Northeast US and Eastern Canada  \
0            Regal Angelfish                          Cunner
1  Bluestreak Cleaner Wrasse                 Winter Flounder
2           Manybar Goatfish                     Rock Gunnel
3             Brushtail Tang                         Pollock
4       Two-spined Angelfish                  Grubby Sculpin

0 South Atlantic States       Central Indo-Pacific
0         Slippery Dick               Moorish Idol
1       Belted Sandfish       Three-spot Dascyllus
2        Black Sea Bass  Bluestreak Cleaner Wrasse
3               Tomtate     Blacklip Butterflyfish
4                Cubbyu        Clark's Anemonefish

and DF2

In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)

In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii  \
0                        85                                     54.6     92
1                      84.8                                     53.2   85.8
2                        81                                     50.8   85.7
3                      79.9                                     50.2   85.7
4                      74.8                                     49.7   82.9

0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada  \
0                     85.7            79                            67.4
1                     82.5          77.3                            46.6
2                     75.2          73.9                            26.2
3                     68.9          73.3                            25.2
4                     67.9          72.8                            23.7

0 South Atlantic States Central Indo-Pacific
0                  79.7                 80.1
1                  78.5                 75.6
2                  78.5                 73.5
3                  72.7                 71.4
4                  65.7                 70.2

RegEx debugging and explanation:

we basically want to remove everything, except number in parentheses:

(\d+\.*\d*) - group(1) - it's our number

\((\d+\.*\d*)\) - our number in parentheses

.*\((\d+\.*\d*)\).* - the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell

it will be replaced with the group(1) - our number

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 2 lines! Awesome. Can you explain what's going on in `r'.*\((\d+\.*\d*)\).*', r'\1'` I understand the other one but this one is confusing. – O.rka Apr 19 '16 at 23:27
  • @O.rka, i've added explanation for the RegEx part – MaxU - stand with Ukraine Apr 19 '16 at 23:38
  • 1
    @MaxU @O.rka Be careful, the first replace eats up everything what's in parentheses, for example `DF_freqSpecies.loc[9,'Hawaii']`. – ptrj Apr 19 '16 at 23:39
  • 1
    @MaxU And one more tiny thing :-) In the first replace you could add a space to the regex: `r' \(\d+\.*\d*\)'`. It's trailing in the output. – ptrj Apr 19 '16 at 23:49