I would like to be able to make a new column in a pandas dataframe that comprises the number of hydrogens from a chemical formula. Getting the number hydrogens from this chemical formula C18H36P1S1, would give 36. The Chemical formulas are general so you can't just feed the exact chemical formula, there are thousands of them in a column.
Asked
Active
Viewed 121 times
2 Answers
1
import re
REGEX = re.compile(r'H(?P<hydrogens>\d+)')
REGEX.search('C18H36P1S1').group('hydrogens')
returns:
'36'

vurmux
- 9,420
- 3
- 25
- 45
1
You can use str.extract
df = pd.DataFrame({'formula':['C18H36P1S1']})
df['No Hydrogens'] = df['formula'].str.extract('.*H(\d+)[A-Za-z].*')
formula No Hydrogens
0 C18H36P1S1 36
-
It gives the answer in a new column. – glhr May 07 '19 at 16:24
-
What if you wanted to feed in arbitrary formulas for thousands of rows? – Rusty Denton May 07 '19 at 16:26
-
This implementation works for an arbitrary number of rows, not just 1 – glhr May 07 '19 at 16:52
-
how is it general if you feed in 'formula':['C18H36P1S1']? Wouldn't you need to write something different? Or is it just a matter of when using an actuall dataframe, the first line isn't needed? Thanks for the help BTW – Rusty Denton May 07 '19 at 17:04
-
You can fill the dataframe with the original data. Like `df = pd.DataFrame({'formula':['C18H36P1S1', 'C18H36P1S2', 'C18H36P1S3']})` (assuming you don't have a dataframe already) – glhr May 07 '19 at 17:32
-
1@RustyDenton, What is the structure of your Dataframe. If you can give the first few lines, it will be easier to understand. This code will work with a column containing similar formula - as long as the hydrogens are followed by `H` – Vaishali May 07 '19 at 17:37
-
Or just use `df = ` your own dataframe. It'll work as long as you adjust the column name `formula` to match your dataframe. – glhr May 07 '19 at 17:37