I have pandas dataframe with loosely formatted data which I have managed to clean quite well. Dataframe has some duplicate columns which dont all have the same values. Columns can be identified by looking at the three first rows which contain the header names.
Columns which have the least amount of non numerical cells should be dropped.
I have so far created the following table which I would like to combine as highlited.
Sample data is below
df = pd.read_html('<table style="border-collapse:collapse;display:inline-table;margin-bottom:5pt;vertical-align:text-bottom;width:100.000%"><tr><td style="width:1.0%"></td><td style="width:37.788%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:13.761%"></td><td style="width:0.1%"></td><td style="width:0.1%"></td><td style="width:0.355%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:13.761%"></td><td style="width:0.1%"></td><td style="width:0.1%"></td><td style="width:0.355%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:13.761%"></td><td style="width:0.1%"></td><td style="width:0.1%"></td><td style="width:0.355%"></td><td style="width:0.1%"></td><td style="width:1.0%"></td><td style="width:13.764%"></td><td style="width:0.1%"></td></tr><tr><td colspan="3" style="padding:0 1pt"></td><td colspan="9" style="padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:700;line-height:100%">Three Months Ended<br/>September 30,</span></td><td colspan="3" style="padding:0 1pt"></td><td colspan="9" style="padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:700;line-height:100%">Nine Months Ended<br/>September 30,</span></td></tr><tr><td colspan="3" style="padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:700;line-height:100%">2022</span></td><td colspan="3" style="border-top:1pt solid #000;padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:700;line-height:100%">2021</span></td><td colspan="3" style="padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:700;line-height:100%">2022</span></td><td colspan="3" style="border-top:1pt solid #000;padding:0 1pt"></td><td colspan="3" style="border-top:1pt solid #000;padding:2px 1pt;text-align:center;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:700;line-height:100%">2021</span></td></tr><tr><td colspan="3" style="background-color:#cceeff;padding:2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">Net income (loss)</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><ix:nonfraction contextref="i95d88e296a034449aefc3c865b10cb6b_D20220701-20220930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMi0xLTEtMS04NTQ3Nw_57ac491e-0454-41ac-9ff0-10ead2e6e248" name="us-gaap:NetIncomeLoss" scale="3" unitref="usd">40,822</ix:nonfraction>\xa0</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="i3af3f33d4677405e83d2066ac1c28161_D20210701-20210930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMi0zLTEtMS04NTQ3Nw_2d67a9ab-6c52-4f5f-b100-be70be483fc8" name="us-gaap:NetIncomeLoss" scale="3" sign="-" unitref="usd">27,561</ix:nonfraction>)</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><ix:nonfraction contextref="id256e164cdd84b09bd8f3ae9a5c3ee11_D20220101-20220930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMi01LTEtMS04NTQ3Nw_7e9e91ec-da49-4317-ae1a-2583e193fd82" name="us-gaap:NetIncomeLoss" scale="3" unitref="usd">13,542</ix:nonfraction>\xa0</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="i8ba80972764f40478145735ed1a45cf6_D20210101-20210930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMi03LTEtMS04NTQ3Nw_726ed6d6-93ef-4381-85fd-800faac4afd8" name="us-gaap:NetIncomeLoss" scale="3" sign="-" unitref="usd">28,502</ix:nonfraction>)</span></td><td style="background-color:#cceeff;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td></tr><tr><td colspan="3" style="background-color:#ffffff;padding:2px 1pt 2px 7pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">Change in foreign currency translation adjustments</span></td><td colspan="2" style="background-color:#ffffff;padding:2px 0 2px 1pt;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="i95d88e296a034449aefc3c865b10cb6b_D20220701-20220930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMy0xLTEtMS04NTQ3Nw_afb0fde3-4030-416b-b247-fa7a2545dcb8" name="us-gaap:OtherComprehensiveIncomeForeignCurrencyTransactionAndTranslationAdjustmentNetOfTaxPortionAttributableToParent" scale="3" sign="-" unitref="usd">34,106</ix:nonfraction>)</span></td><td style="background-color:#ffffff;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#ffffff;padding:0 1pt"></td><td colspan="2" style="background-color:#ffffff;padding:2px 0 2px 1pt;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><ix:nonfraction contextref="i3af3f33d4677405e83d2066ac1c28161_D20210701-20210930" decimals="-3" format="ixt:fixed-zero" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMy0zLTEtMS04NTQ3Nw_b06e1ff9-48c3-4bef-bb8a-32c4a82d67bf" name="us-gaap:OtherComprehensiveIncomeForeignCurrencyTransactionAndTranslationAdjustmentNetOfTaxPortionAttributableToParent" scale="3" unitref="usd">—</ix:nonfraction>\xa0</span></td><td style="background-color:#ffffff;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#ffffff;padding:0 1pt"></td><td colspan="2" style="background-color:#ffffff;padding:2px 0 2px 1pt;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="id256e164cdd84b09bd8f3ae9a5c3ee11_D20220101-20220930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMy01LTEtMS04NTQ3Nw_60ae3027-c6d3-4463-90c5-aa9c1d70db6c" name="us-gaap:OtherComprehensiveIncomeForeignCurrencyTransactionAndTranslationAdjustmentNetOfTaxPortionAttributableToParent" scale="3" sign="-" unitref="usd">72,835</ix:nonfraction>)</span></td><td style="background-color:#ffffff;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#ffffff;padding:0 1pt"></td><td colspan="2" style="background-color:#ffffff;padding:2px 0 2px 1pt;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><ix:nonfraction contextref="i8ba80972764f40478145735ed1a45cf6_D20210101-20210930" decimals="-3" format="ixt:fixed-zero" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfMy03LTEtMS04NTQ3Nw_8b1eda62-1f4a-4dbb-8164-968d522380af" name="us-gaap:OtherComprehensiveIncomeForeignCurrencyTransactionAndTranslationAdjustmentNetOfTaxPortionAttributableToParent" scale="3" unitref="usd">—</ix:nonfraction>\xa0</span></td><td style="background-color:#ffffff;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td></tr><tr><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td><td colspan="3" style="display:none"></td></tr><tr><td colspan="3" style="background-color:#cceeff;padding:2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">Comprehensive income (loss)</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%"><ix:nonfraction contextref="i95d88e296a034449aefc3c865b10cb6b_D20220701-20220930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfNS0xLTEtMS04NTQ3Nw_1a548c06-aecd-4a52-b997-fe65c181419a" name="us-gaap:ComprehensiveIncomeNetOfTax" scale="3" unitref="usd">6,716</ix:nonfraction>\xa0</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="i3af3f33d4677405e83d2066ac1c28161_D20210701-20210930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfNS0zLTEtMS04NTQ3Nw_0e25d9eb-e063-4948-aeae-f370edc05d78" name="us-gaap:ComprehensiveIncomeNetOfTax" scale="3" sign="-" unitref="usd">27,561</ix:nonfraction>)</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="id256e164cdd84b09bd8f3ae9a5c3ee11_D20220101-20220930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfNS01LTEtMS04NTQ3Nw_fd318eae-dbad-4a88-bfca-216579dd9288" name="us-gaap:ComprehensiveIncomeNetOfTax" scale="3" sign="-" unitref="usd">59,293</ix:nonfraction>)</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td><td colspan="3" style="background-color:#cceeff;padding:0 1pt"></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0 2px 1pt;text-align:left;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">$</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 0;text-align:right;vertical-align:bottom"><span style="color:#000000;font-family:\'Arial\',sans-serif;font-size:11pt;font-weight:400;line-height:100%">(<ix:nonfraction contextref="i8ba80972764f40478145735ed1a45cf6_D20210101-20210930" decimals="-3" format="ixt:num-dot-decimal" id="id3VybDovL2RvY3MudjEvZG9jOjMzNjc1ZGM5YjBmMTQ5NThhNDgxODFkNzhhNTRiNDFhL3NlYzozMzY3NWRjOWIwZjE0OTU4YTQ4MTgxZDc4YTU0YjQxYV8yMi9mcmFnOjJmZTQ1ZWY0NjlhYjRjYjlhYzAyY2QzY2VjNzhhNTIzL3RhYmxlOjJkMWE5NmQ1YWZjNDQ4MTA4MjEwM2I2MTczYzI4ZTMwL3RhYmxlcmFuZ2U6MmQxYTk2ZDVhZmM0NDgxMDgyMTAzYjYxNzNjMjhlMzBfNS03LTEtMS04NTQ3Nw_a21491d4-f0cf-48bf-9921-027e7e04efdf" name="us-gaap:ComprehensiveIncomeNetOfTax" scale="3" sign="-" unitref="usd">28,502</ix:nonfraction>)</span></td><td style="background-color:#cceeff;border-bottom:3pt double #000;border-top:1pt solid #000;padding:2px 1pt 2px 0;text-align:right;vertical-align:bottom"></td></tr></table>', header=0)
Data has three level headers and duplicate index values. With the following code I was able to check which are the first column and header indexes. Note these do not match actual iloc references directly.
test_df = pd.DataFrame(df[0])
header_rows = test_df.isna().cummin().sum()[0]
index_rows = test_df.isna().cummin(axis=1).sum()[header_rows-1]
This results in following dataset which I will clean next
#Then some cleaning of data to get rid of non numeric values
test_df.iloc[header_rows:, index_rows:] = test_df.iloc[header_rows:, index_rows:].replace(r'^\s*$', np.nan, regex=True)
test_df.iloc[header_rows:, index_rows:] = test_df.iloc[header_rows:, index_rows:].replace("\(",'-', regex=True)
#test_df.iloc[header_rows:, index_rows:] = test_df.iloc[header_rows:, index_rows:].replace("/[^0-9.]/g",'', regex=True)
test_df.iloc[header_rows:, index_rows:] = test_df.iloc[header_rows:, index_rows:].replace('[^\w\s-]','',regex=True) #### Minus should be now fidex by adding it to regex.
test_df.iloc[header_rows:, index_rows:] = test_df.iloc[header_rows:, index_rows:].replace(r'^\s*$', np.nan, regex=True)
At last step I calculate how many values in column are non numeric
all_a = []
for (columnName, columnData) in test_df.iteritems():
len(test_df.loc[test_df[columnName].astype(str).str.isnumeric()])
mask = pd.to_numeric(test_df[columnName][header_rows:], errors='coerce').isna()
a = mask.sum()
all_a.append(a)
#print('Column Name : ', columnName)
#print('Column Contents : ', columnData.values)
test_df = test_df.append(pd.DataFrame([all_a], columns=list(test_df.columns)), ignore_index=True)
This results in following dataset with calculated non numeric values per column. From the results I should be able to leave such columns to dataframe which have least number of non numeric characters.
- Non numeric characters are on last row in pink color
- Groups in this data are first three rows. These can be concatenated to one row
- Groups of which rows which have all NaN values beside the headers can be dropped from dataframe. Example is highlited in red.
- Duplicates in first three columns could be solved by using index_rows variable but forgot to do it for this example.
I tried concatenating the first three rows for label names but had no ideas on how to proceed after that.