I need to work out information about the structure of branched alkanes using the IUPAC (International Union of Pure and Applied Chemistry) name in Excel. I have broken the problem into steps, and I have included examples for each step. Help with any of the steps would be greatly appreciated. I need to automate the method because I potentially have thousands of these names to process.
Step 1
I need work out the number of carbon atoms in the main chain. By convention, the name of the main chain appears last in the overall name. For example, the main chain in 2,2,3-trimethyloctane is octane. Could regular expressions be used to look for the name of main chain and then be used to look up the carbon atoms in the main chain? Further examples are provided in the table below.
Branched alkane name | Main chain | Carbon atoms in the main chain |
---|---|---|
2-methylpropane | Propane | 3 |
2-methylbutane | Butane | 4 |
3-ethylpentane | Pentane | 5 |
3-methylhexane | Hexane | 6 |
4-propulheptane | Heptane | 7 |
2-methyloctane | Octane | 8 |
3-ethylnonane | Nonane | 9 |
5-methyldecane | Decane | 10 |
6-butylundecane | Undecane | 11 |
3-ethyldodecane | Dodecane | 12 |
The custom UDF in Excel could be named MainChainRegex and =MainChainRegex(2,2,3-trimethyloctane) would return a value of 8.
Step 2
I also need work out the number of side-chain carbon atoms attached to each of the main chain carbon atoms. For example, 3-ethylpentane has two sidechain carbon atoms attached to the third carbon atom in the main chain, see image below. The “ethyl” term refers to a sidechain containing two carbon atoms. The “3-” in front of “ethyl” tells us that the sidechain is attached to carbon atom 3 in the main chain.
The custom UDF in Excel could be named SideChainRegex and =SideChainRegex(3-ethylpentane) would return an array of 0, 0, 2, 0, 0 in five cells. The number of carbon atoms in the main chain could be used to size the array, and the first and last values in the array will always be zero.
Step 3
The problem becomes more complicated when there are multiple sidechains. For example, 2,2-dimethylbutane has two sidechains containing a combined total of two carbon atoms, see image below. The “methyl” term refers to a sidechain containing one carbon atom. The “di” refers to the fact that there are two of these sidechains and means that we will need to look for two numbers proceeding the “dimethyl” term. The “2,2-” in front of “dimethyl” tells us that both sidechains are attached to carbon atom 2 in the main chain. If the custom UDF in Excel is named SideChainRegex then =SideChainRegex(2,2-dimethylbutane) would need to return an array of 0, 2, 0, 0 in four cells – again the array is sized by the number of carbon atoms in the main chain.
Step 4
A further complication is when there is more than one type of side chain. For example, 5-ethyl-2,2-dimethylheptane has two different types of sidechains, see image below. If the custom UDF in Excel is named SideChainRegex then =SideChainRegex(5-ethyl-2,2-dimethylheptane) would need to return an array of 0, 2, 0, 0, 2, 0, 0 in seven cells to total up the number of side-chain carbon atoms attached to each of the main chain carbon atoms.
The tables below provide further information that might be helpful.
Term | How many numbers to look for before term |
---|---|
Di | 2 |
Tri | 3 |
Tetra | 4 |
Penta | 5 |
Hexa | 6 |
Hepta | 7 |
Octa | 8 |
Term | Number of carbon atoms in sidechain |
---|---|
Methyl | 1 |
Ethyl | 2 |
Propyl | 3 |
Butyl | 4 |
Pentyl | 5 |
Hexyl | 6 |
Heptyl | 7 |
Octyl | 8 |