1

I'm trying to scrape a wiki table in Python from within RStudio (in Rmarkdown) via reticulate. I can't manage to do it with R (tried rvest but the columns end up being misaligned and I can't figure out exactly why) which is why I'm using Python: I have a r-reticulate Conda env and installed BeautifulSoup and requests.

The code I've written runs flawlessly in my Jupyter notebook running the r-reticulate kernel.

However, when I try to run it in RStudio, I get an ImportError saying lxml was not found. Which can't be, because it is there as you can see at the bottom with conda list (and as evidenced by my working notebook).

Here is my full code:

```{r libraries, include=FALSE}
library(reticulate)
use_condaenv("r-reticulate", required = TRUE)
```

```{python results="hide"}
import pandas as pd
import requests
from bs4 import BeautifulSoup
```

```{python}
url = "https://en.wikipedia.org/wiki/COVID-19_lockdowns"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
table = soup.find("table", {"class": "wikitable"})
dfs = pd.read_html(str(table))               # this is the line that generates the error
df = dfs[0]
df.head(20)
```

This is the error output from the last chunk:

ImportError: lxml not found, please install it

Detailed traceback: 
  File "<string>", line 1, in <module>
  File "C:\PROGRA~3\ANACON~1\envs\R-RETI~1\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "C:\PROGRA~3\ANACON~1\envs\R-RETI~1\lib\site-packages\pandas\io\html.py", line 1100, in read_html
    displayed_only=displayed_only,
  File "C:\PROGRA~3\ANACON~1\envs\R-RETI~1\lib\site-packages\pandas\io\html.py", line 889, in _parse
    parser = _parser_dispatch(flav)
  File "C:\PROGRA~3\ANACON~1\envs\R-RETI~1\lib\site-packages\pandas\io\html.py", line 846, in _parser_dispatch
    raise ImportError("lxml not found, please install it")

The env name is truncated (R-RETI~1) but I don't have any other env starting with this name, so I'm sure that it is the correct env. py_config() also shows that it is the correct env being used. I don't understand what is going on, or which component is not behaving correctly (is it coming from reticulate?)...

python:         C:/ProgramData/Anaconda3/envs/r-reticulate/python.exe
libpython:      C:/ProgramData/Anaconda3/envs/r-reticulate/python37.dll
pythonhome:     C:/ProgramData/Anaconda3/envs/r-reticulate
version:        3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:37:01) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/ProgramData/Anaconda3/envs/r-reticulate/Lib/site-packages/numpy
numpy_version:  1.20.1

NOTE: Python version was forced by use_python function

Output of conda list:

(r-reticulate) C:\[...]>conda list

# packages in environment at C:\ProgramData\Anaconda3\envs\r-reticulate:
#
# Name                    Version                   Build  Channel
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.1                      py_0    conda-forge
beautifulsoup4            4.9.3              pyhb0f4dca_0    conda-forge
brotlipy                  0.7.0           py37hcc03f2d_1001    conda-forge
bs4                       4.9.3                         0    conda-forge
ca-certificates           2020.12.5            h5b45459_0    conda-forge
certifi                   2020.12.5        py37h03978a9_1    conda-forge
cffi                      1.14.5           py37hd8e9650_0    conda-forge
chardet                   4.0.0            py37h03978a9_1    conda-forge
colorama                  0.4.4              pyh9f0ad1d_0    conda-forge
cryptography              3.4.6            py37h20c650d_0    conda-forge
cycler                    0.10.0                     py_2    conda-forge
decorator                 4.4.2                      py_0    conda-forge
freetype                  2.10.4               h546665d_1    conda-forge
icu                       68.1                 h0e60522_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
intel-openmp              2020.3             h57928b3_311    conda-forge
ipykernel                 5.5.0            py37heaed05f_1    conda-forge
ipython                   7.21.0           py37heaed05f_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.18.0           py37h03978a9_2    conda-forge
jpeg                      9d                   h8ffe710_0    conda-forge
jupyter_client            6.1.12             pyhd8ed1ab_0    conda-forge
jupyter_core              4.7.1            py37h03978a9_0    conda-forge
kiwisolver                1.3.1            py37h8c56517_1    conda-forge
lcms2                     2.12                 h2a16943_0    conda-forge
libblas                   3.9.0                     8_mkl    conda-forge
libcblas                  3.9.0                     8_mkl    conda-forge
libclang                  11.1.0          default_h5c34c98_0    conda-forge
libiconv                  1.16                 he774522_0    conda-forge
liblapack                 3.9.0                     8_mkl    conda-forge
libpng                    1.6.37               h1d00b33_2    conda-forge
libsodium                 1.0.18               h8d14728_1    conda-forge
libtiff                   4.2.0                hc10be44_0    conda-forge
libxml2                   2.9.10               hf5bbc77_3    conda-forge
libxslt                   1.1.33               h65864e5_2    conda-forge
lxml                      4.6.2            py37hd07aab1_1    conda-forge
lz4-c                     1.9.3                h8ffe710_0    conda-forge
m2w64-gcc-libgfortran     5.3.0                         6    conda-forge
m2w64-gcc-libs            5.3.0                         7    conda-forge
m2w64-gcc-libs-core       5.3.0                         7    conda-forge
m2w64-gmp                 6.1.0                         2    conda-forge
m2w64-libwinpthread-git   5.0.0.4634.697f757               2    conda-forge
matplotlib                3.3.4            py37h03978a9_0    conda-forge
matplotlib-base           3.3.4            py37h3379fd5_0    conda-forge
mkl                       2020.4             hb70f87d_311    conda-forge
msys2-conda-epoch         20160418                      1    conda-forge
numpy                     1.20.1           py37hd20adf4_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1j               h8ffe710_0    conda-forge
pandas                    1.2.3            py37h08fd248_0    conda-forge
parso                     0.8.1              pyhd8ed1ab_0    conda-forge
patsy                     0.5.1                      py_0    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    8.1.2            py37h96663a1_0    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.17             pyha770c72_0    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pygments                  2.8.1              pyhd8ed1ab_0    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyqt                      5.12.3           py37h03978a9_7    conda-forge
pyqt-impl                 5.12.3           py37hf2a7229_7    conda-forge
pyqt5-sip                 4.19.18          py37hf2a7229_7    conda-forge
pyqtchart                 5.12             py37hf2a7229_7    conda-forge
pyqtwebengine             5.12.1           py37hf2a7229_7    conda-forge
pysocks                   1.7.1            py37h03978a9_3    conda-forge
python                    3.7.10          h7840368_100_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
pywin32                   300              py37hcc03f2d_0    conda-forge
pyzmq                     22.0.3           py37hcce574b_1    conda-forge
qt                        5.12.9               h5909a2a_4    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
scipy                     1.6.0            py37h6db1a17_0    conda-forge
seaborn                   0.11.1               hd8ed1ab_1    conda-forge
seaborn-base              0.11.1             pyhd8ed1ab_1    conda-forge
setuptools                49.6.0           py37h03978a9_3    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
soupsieve                 2.0.1                      py_1    conda-forge
sqlite                    3.34.0               h8ffe710_0    conda-forge
statsmodels               0.12.2           py37hda49f71_0    conda-forge
tk                        8.6.10               h8ffe710_1    conda-forge
tornado                   6.1              py37hcc03f2d_1    conda-forge
traitlets                 5.0.5                      py_0    conda-forge
urllib3                   1.26.3             pyhd8ed1ab_0    conda-forge
vc                        14.2                 hb210afc_4    conda-forge
vs2015_runtime            14.28.29325          h5e1d092_4    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
win_inet_pton             1.1.0            py37h03978a9_2    conda-forge
wincertstore              0.2             py37h03978a9_1006    conda-forge
xz                        5.2.5                h62dcd97_1    conda-forge
zeromq                    4.3.4                h0e60522_0    conda-forge
zlib                      1.2.11            h62dcd97_1010    conda-forge
zstd                      1.4.9                h6255e5f_0    conda-forge

EDIT: For reasons unknown and without doing anything, it now works. The system probably needed one more reboot I guess...

MonkeyBack
  • 61
  • 6
  • Unrelated but would you mind explaining why are you writing *and* running Python code in R studio? – baduker Mar 15 '21 at 17:22
  • do you have more than 1 python version installed? The problem is not beautifulsoup but that pandas uses lxml under the hood. – QHarr Mar 15 '21 at 17:58
  • @QHarr You're right, I found that out while looking for the specific line that produced the error but didn't think that I could edit the title. Turns out I can :) As for python versions, yes I have multiple Conda environments. I thought I checked that it was the specified one that is being used, but there's obviously a problem somewhere, so if you have other ways of checking, I'm all ears. – MonkeyBack Mar 15 '21 at 21:23
  • @baduker I already have a full R project. Up until now I was "hand-scraping" some of the data (lockdown dates) via a web tool to generate a csv, that I then import and clean with R. But this is not practical, as every time I knit the Rmarkdown, I need to manually update this data. As stated in my post, I failed at correctly importing the data with rvest, and since RStudio now allows python and R to cohabitate, I thought this would be the ideal solution (because my python solution works). But I'm stuck... – MonkeyBack Mar 15 '21 at 21:29
  • 1
    Side note: rvest failed due to "merged cells". You would need to loop all table rows and columns and account for the rowspan/colspan values to reconstitute the table as it appears on the page. It's a faff and I can understand why you prefer to use pandas here. I'm honestly surprised there isn't a popular R package to handle this situation given how common it is. – QHarr Mar 15 '21 at 21:46

0 Answers0