URL unknown encoding: mbcs in Azure Databricks

Question

I download https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/ads/ads_index_most_current_vintage.xlsx?la=en&hash=6DF4E54DFAE3EDC347F80A80142338E7 .

Download Source：https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/ads - Click "Most Current ADS Index Vintage" to download the file.

Notepad++ shows the download file is ANSI type.

Therefore, I use encoding = 'mbcs' to decode it. However, there is an error unknown encoding: mbcs

Orginal Code:

ads_url ="https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/ads/ads_index_most_current_vintage.xlsx?la=en&hash=6DF4E54DFAE3EDC347F80A80142338E7"
df_ads = pd.read_csv(ads_url, encoding = 'mbcs')
df_ads_spark = spark.createDataFrame(df_ads)

Does this answer your question? [how to read csv files with mbcs codec in Python on Linux?](https://stackoverflow.com/questions/61481806/how-to-read-csv-files-with-mbcs-codec-in-python-on-linux) — Ecstasy, Feb 04 '22 at 05:54
[MBCS encoding unknown](https://stackoverflow.com/questions/51245983/mbcs-encoding-unknown) — Ecstasy, Feb 04 '22 at 05:55
@DeepDave-MT when I type `df_ads = pd.read_csv(ads_url, encoding='mbcs')`, there is an error: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte — Dicer, Feb 04 '22 at 17:02

score 0 · Answer 1 · answered Feb 04 '22 at 12:47

0

I’m not sure how you concluded you needed MBCS encoding to makes sense of the file but I believe the file is an Excel file. The .xlsx extension indicates it is a zip file (which explains why it’s unreadable in Notepad++) with parts representing the spreadsheet. You can read an .xlsx file in Databricks. No need to extract the zip file parts.

answered Feb 04 '22 at 12:47

GregGalloway

11,355
3
16
47

but when I just directly write down `df_ads = pd.read_csv(ads_url)`, there is an error: `'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte`. I think the problem should be the encoding part. – Dicer Feb 04 '22 at 16:43
It’s an XLSX not a CSV. Use the XLSX example in the link I provided. – GregGalloway Feb 04 '22 at 17:58
but `spark.read.format("com.crealytics.spark.excel").option(“useHeader”, “true”).option(“inferSchema”, “true”).load(ads_url) ` does not work. Does it support reading URL? – Dicer Feb 04 '22 at 18:59
I'm not aware it supports a URL. The docs are at https://github.com/crealytics/spark-excel I expect you should download the file to mounted storage: https://peter.lalovsky.com/2021/07/azure/azure-databricks-extract-from-rest-api-and-save-json-file-in-azure-data-lake/ Then load from storage. – GregGalloway Feb 04 '22 at 19:25

Dicer · Accepted Answer · 2022-02-05T03:44:33.003

0

After google searching, I find this one works. If you are required to install packages, please install them.

import urllib.request
import chardet
from urllib.parse import unquote
import requests

ads_url = "https://www.philadelphiafed.org/-/media/frbp/assets/surveys-and-data/ads/ads_index_most_current_vintage.xlsx?la=en&hash=6DF4E54DFAE3EDC347F80A80142338E7"

r = requests.get(ads_url)

open('ads_index_most_current_vintage.xlsx?la=en&hash=6DF4E54DFAE3EDC347F80A80142338E7', 'wb').write(r.content)

df_ads = pd.read_excel('ads_index_most_current_vintage.xlsx?la=en&hash=6DF4E54DFAE3EDC347F80A80142338E7')

df_ads_spark = spark.createDataFrame(df_ads)

display(df_ads_spark)

edited Feb 05 '22 at 03:44

answered Feb 04 '22 at 20:50

Dicer

63
1
9

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 04 '22 at 22:15

URL unknown encoding: mbcs in Azure Databricks

2 Answers2