Parsing a tag in HTML

Question

I know that the question has been asked but I think not in this specific situation. If it's the case feel free to show me the case.

I have a HTML file hierarchized (you can view the original here) that way :

<h5 id="foo1">Title 1</h5>
               <table class="foo2">
                  <tbody>
                     <tr>
                        <td>
                           <h3 class="foo3">SomeName1</h3>
                           <img src="Somesource" alt="SomeName2" title="SomeTitle"><br>
                              <p class="textcode">
                                    Some precious text here
                              </p>
                        </td>
                        ...
               </table>

I would like to extract the name, the image and the text contained in the <p> each table data in each h5 separately meaning I would like to save each one of these items in a separate folder named after the h5 therein.

I tried this :

# coding: utf-8
import os
import re
from bs4 import BeautifulSoup as bs

os.chdir("WorkingDirectory")
# Sélection du HTML et remplissage de son contenu dans la variable éponyme
with open("TheGoodPath.htm","r") as html:
    html = bs(html,'html.parser')
    # Sélection des hearders, restriction des résultats aux six premiers et création des dossiers
    h5 = html.find_all("h5",limit=6)
    for h in h5:
        # Création des fichiers avec le nom des headers
        chemin = u"../Résulat/"
        nom = str(h.contents[0].string)
        os.makedirs(chemin + nom,exist_ok=True)
        # Sélection de la table soeur située juste après le header
        table = h.find_next_sibling(name = 'table')
        for t in table:
            # Sélection des headers contenant les titres des documents
            h3 = t.find_all("h3")
            for k in h3:
                titre = str(k.string)
                # Création des répertoires avec les noms des figures
                os.makedirs(chemin + nom + titre,exist_ok=True)
                os.fdopen(titre.tex)
                # Récupération de l'image située dans la balise soeur située juste après le header précédent
                img = k.find_next_sibling("img")
                chimg = img.img['src']
                os.fdopen(img.img['title'])
                # Récupération du code TikZ située dans la balise soeur située juste après le header précédent
                tikz = k.find_next_sibling('p')
                # Extraction du code TikZ contenu dans la balise précédemment récupérée
                code = tikz.get_text()
                # Définition puis écriture du préambule et du code nécessaire à la production de l'image précédemment enregistrée
                preambule = r"%PREAMBULE \n  \usepackage{pgfplots} \n  \usepackage{tikz} \n  \usepackage[european resistor, european voltage, european current]{circuitikz} \n  \usetikzlibrary{arrows,shapes,positioning} \n  \usetikzlibrary{decorations.markings,decorations.pathmorphing, decorations.pathreplacing} \n  \usetikzlibrary{calc,patterns,shapes.geometric} \n  %FIN PREAMBULE"
                with open(chemin + nom + titre,'w') as result:
                    result.write(preambule + code)

But it prints AttributeError: 'NavigableString' object has no attribute 'find_next_element' for h3 = t.find_all("h3"), line 21

do dir(table) I think you will find that table is not what you think it is - given that it seems to be defined as an element — PyNEwbie, Mar 24 '16 at 20:08

alecxe · Answer 1 · 2016-03-24T20:27:27.667

0

It looks like (judging by the for t in table loop) you meant to find multiple "table" elements. Use find_next_siblings() instead of find_next_sibling():

table = h.find_next_siblings(name='table') 
for t in table:

edited Mar 24 '16 at 20:27

answered Mar 24 '16 at 20:19

alecxe

462,703
120
1,088
1,195

There is one table between each h5 tag, if you use `find_next_siblings(name='table')` you will find the same tables more than once, `print(len(h.find_next_siblings("table"))` will output 9,8,7,6,5,4, the OP should just be using the table returned from `find_next_sibling` – Padraic Cunningham Mar 24 '16 at 20:43

Padraic Cunningham · Accepted Answer · 2016-03-27T17:47:49.873

0

This seems to be what you want, there only seems to be one table between each h5 so don't iterate over it just use find_next and use the table returned:

from bs4 import BeautifulSoup

import requests

cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text

soup = BeautifulSoup(cont)

h5s = soup.find_all("h5",limit=6)
for h5 in h5s:
    # find first table after
    table = h5.find_next("table")
    # find all h3 elements in that table
    for h3 in table.select("h3"):
        print(h3.text)
        img = h3.find_next("img")
        print(img["src"])
        print(img["title"])
        print(img.find_next("p").text)
    print()

Which gives you output like:

repere-plan.svg

\begin{tikzpicture}[scale=1]
\draw (0,0) --++ (1,1) --++ (3,0) --++ (-1,-1) --++ (-3,0);
\draw [thick] [->] (2,0.5) --++(0,2) node [right] {z};
%thick : gras ; very thick : trÃ¨s gras ; ultra thick : hyper gras
\draw (2,0.5) node [left] {O};
\draw [thick] [->] (2,0.5) --++(-1,-1) node [left] {x};
\draw [thick] [->] (2,0.5) --++(2,0) node [below] {y};
\end{tikzpicture}

Lignes de champ et Ã©quipotentielles
images/cours-licence/em3/ligne-champ-equipot.svg

ligne-champ-equipot.svg

\begin{tikzpicture}[scale=0.8]
\draw[->] (-2,0) -- (2,0);
\draw[->] (0,-2) -- (0,2);
\draw node [red] at (-2,1.25) {\scriptsize{Lignes de champ}};
\draw node [blue] at (2,-1.25) {\scriptsize{Equipotentielles}};
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sin(\x r)*3*sin(\x r)*5});
%r = angle en radian
%domain permet de dÃ©finir le domaine dans lequel la fonction sera tracÃ©e
%samples=200 permet d'augmenter le nombre de points pour le tracÃ©
%smooth amÃ©liore Ã©galement la qualitÃ© de la trace
\draw[color=red,domain=-3.14:3.14,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sin(\x r)*2*sin(\x r)*5});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={3*sqrt(abs(cos(\x r)))*15});
\draw[color=blue,domain=-pi:pi,samples=200,smooth] plot (canvas polar cs:angle=\x r,radius={2*sqrt(abs(cos(\x r)))*15});
\end{tikzpicture}

Fonction arctangente
images/schemas/math/arctan.svg

arctan.svg

\begin{tikzpicture}[scale=0.8]
\draw[very thin,color=gray] (-pi,pi) grid (-pi,pi);
\draw[->] (-pi,0) -- (pi,0) node[right] {$x$};
\draw[->] (0,-2) -- (0,2);
\draw[color=red,domain=-pi:pi,samples=150] plot ({\x},{rad(atan(\x))} )node[right,red] {$\arctan(x)$};
\draw[color=blue,domain=-pi:pi] plot ({\x},{rad(-atan(\x))} )node[right,blue] {$-\arctan(x)$};
%Le rad() est une autre faÃ§on de dire que l'argument est en radian
\end{tikzpicture}

To write all the .svg's to disk:

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os import path

cont = requests.get("http://www.physagreg.fr/schemas-figures-physique-svg-tikz.php").text

soup = BeautifulSoup(cont)
base_url = "http://www.physagreg.fr/"

h5s = soup.find_all("h5", limit=6)
for h5 in h5s:
    # find first table after
    table = h5.find_next("table")
    # find all h3 elements in that table
    for h3 in table.select("h3"):
        print(h3.text)
        img = h3.find_next("img")
        src, title = img["src"], img["title"]
        # join base url and image url
        img_url = urljoin(base_url, src)
        # open file using title as file name
        with open(title, "w") as f:
           # requests the img url and write content
            f.write(requests.get(img_url).content)

Which will give you arctan.svg courbe-Epeff.svg and all the rest on the page etc..

edited Mar 27 '16 at 17:47

answered Mar 24 '16 at 20:29

Padraic Cunningham

176,452
29
245
321

Best answer so far ! Thanks a lot for your time and your answer. Another question : How would you save the image from its path given by `img["src"]` ? I tried `fsdopen` and `open` but it doesn't create any file. – Le Duc Banal Mar 27 '16 at 10:57
No worries, are you trying to save the actual image itself? – Padraic Cunningham Mar 27 '16 at 10:59
Yes I do. With the name found by `img["title"]` – Le Duc Banal Mar 27 '16 at 11:07
@SirC, I added how to download and write the svg's – Padraic Cunningham Mar 27 '16 at 11:24
Thanks a lot ! That's perfect. Now I have to figure out why `ImportError: No module named 'urlparse' ` but that's out of the scope of this question. Thank you a lot for your attention and help ! – Le Duc Banal Mar 27 '16 at 11:57
@SirC, you're welcome, for python3 just change the import to `from urllib.parse import urljoin` and you will be fine – Padraic Cunningham Mar 27 '16 at 11:59
It worked ! Thanks ! Just a last thing : your way of downloading the svg doesn't work : it does create a file with the right name but it's not openable (impossible to see the svg) Firefox prints out `Erreur d'analyse XML : erreur de syntaxe` – Le Duc Banal Mar 27 '16 at 13:10
@SirC, what OS are you using? – Padraic Cunningham Mar 27 '16 at 13:16
I'm using Windows 10 x64 – Le Duc Banal Mar 27 '16 at 13:20
Just use `with open(title, "w") as f:` remove the `b`, I did not actually mean to add it for the svg – Padraic Cunningham Mar 27 '16 at 13:26
Already tried, it outputs `TypeError: must be str, not bytes`. – Le Duc Banal Mar 27 '16 at 13:28
duh, sorry yes use .text instead of .content for python3 – Padraic Cunningham Mar 27 '16 at 13:30
Still not working, same error. The svg not printing correctly but still thanks ! – Le Duc Banal Mar 27 '16 at 13:35
All the examples so far have worked on my ubuntu machine so it is definitely something windows specific, when you look inside the .svgs what do you see? arctan.svg should look like http://pastebin.com/bprzwBYU – Padraic Cunningham Mar 27 '16 at 13:47
In deed... ` 404 Not Found
Not Found

The requested URL /Physagreg Schémas_figures tikz_svg pour la physique exemples avec codes_fichiers/ressort1.svg was not found on this server.
` – Le Duc Banal Mar 27 '16 at 13:58
Leave it with me, i will suss it out when i get back on my comp. Just verify you are using the correct base url – Padraic Cunningham Mar 27 '16 at 14:05
@SirC, my fault, I ad the wrong base_url, no idea how it worked for me – Padraic Cunningham Mar 27 '16 at 17:48

Parsing a tag in HTML

2 Answers2

Not Found