I found an awesome way to parse
html
with pandas
. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes
.
Notice how each cell
is separated by a ,
... is there any really efficient method to split all of these cells and create 2 dataframes, one for the labels and one for the ( value )
in parenthesis?
NumPy
has all those ufuncs
, is there a way I can use them on string
dtypes
since they can be converted to np.array
with DF.as_matrix()
? I'm trying to steer clear of for loops
, I could iterate through all the indices and populate an empty array but that's pretty barbaric.
I'm using Beaker Notebook
btw, it's really cool (HIGHLY RECOMMENDED)
#Set URL Destination
url = "http://www.reef.org/print/db/stats"
#Process raw table
DF_raw = pd.pandas.read_html(url)[0]
#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()
#Process table
DF_freqSpecies = pd.DataFrame(
DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
columns = DF_raw.iloc[0,:]
)
DF_freqSpecies
#Split these into 2 separate DataFrames
Here's my naive way of doing such:
import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)
dims = DF_freqSpecies.shape
for i in range(dims[0]):
for j in range(dims[1]):
#Parse current dataframe
species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
freq = float(freq[:-1])
#Populate split DataFrames
DF_species.iloc[i,j] = species
DF_freq.iloc[i,j] = freq
I want these 2 dataframes as my output: