How to read .shp files in Databricks notebook

Question

I am working on a problem where I need to plot output on a Map.
In past I was able to do it using geopandas. However this does not work in databricks-notebook. I tried to look at alternative but couldn't find any on web.

Pages I looked in to:
GeoPandas Notebook
Processing Geospatial Data at Scale With Databricks

In the second link it mentions that we can read .shp flies through scala however it does not mention what sc in ShapefileReader.readToGeometryRDD stands for?

%scala
var spatialRDD = new SpatialRDD[Geometry]
spatialRDD = ShapefileReader.readToGeometryRDD(sc, "/ml/blogs/geospatial/shapefiles/nyc")

var rawSpatialDf = Adapter.toDf(spatialRDD,spark)
rawSpatialDf.createOrReplaceTempView("rawSpatialDf") //DataFrame now available to SQL, Python, and R

score 0 · Answer 1 · edited Nov 18 '21 at 17:18

0

In the databricks notebook:

Install geopandas: %pip install geopandas
Import geopandas: import geopandas as gpd
The path have to use "File Api Format": /dbfs/databricks/folderName
Create the gdf: sdf = gpd.read_file('/dbfs/databricks/folderName')

edited Nov 18 '21 at 17:18

tdy

36,675
19
86
83

answered Nov 18 '21 at 15:51

Roberto Arias

1

Not sure why but I am facing the following error with the solution above: DriverError: /FileStore/myfile.shp: No such file or directory – imguessing Oct 21 '22 at 09:09

Priyabrata Samantaray · Answer 2 · 2021-11-23T17:40:29.750

Below steps to read the .shp files in Databricks notebook

Drag the .zip file folder to Databricks data section and once the notepad opens up, put the below code in the python notebook.
Inside the .zip folder, the shape file name is tl_2020_us_state.shp Replace this name with your shape file.

import geopandas from snowflake.connector.pandas_tools import write_pandas import os from dotenv import load_dotenv import re import io import fiona; import fsspec from zipfile import ZipFile import requests from future import print_function import zipfile from io import StringIO import shapefile

myshp = open("tl_2020_us_state.shp", "rb") mydbf = open("tl_2020_us_state.dbf", "rb") shpfile_rd = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf) shpfile_df = geopandas.GeoDataFrame.from_file('tl_2020_us_state.shp') shpfile_df.head()

#for plotting the map shpfile_df.plot()

score 0 · Answer 3 · answered Nov 24 '21 at 00:39

# Define the variables and the values
zipfile_name = 'tl_2020_us_state.zip'
shapefile_name = 'tl_2020_us_state.shp'
shapefile_url = 'https://www2.census.gov/geo/tiger/TIGER2020/STATE/' + zipfile_name

#Define the absolute path of the shapefile out of the zip file in local file
shapefile_path = os.path.abspath(zipfile_name)

## Download a shapefile, upload into local directory
# download the shapefile to the shapefile path
with requests.get(shapefile_url) as r:
    with open(shapefile_path, 'wb') as f:
        f.write(r.content)              

# unzip the file
with zipfile.ZipFile(shapefile_path, 'r') as zip_ref:
    zip_ref.extractall(os.curdir)
    #print(os.curdir)    
    
# set home directory and download data
et.data.get_data("spatial-vector-lidar") ## for earthpy.io.DATA_URLS
os.chdir(os.path.join(et.io.HOME, 'earth-analytics')) ## works on Mac or Linux   ## Set your working directory in Python using os.chdir()

## import shapefile using geopandas
plot_locations_df = geopandas.read_file(
                          os.path.join(
                          os.curdir, 
                          shapefile_name))

## view the top 6 lines of attribute table of data
plot_locations_df.head(20)

#print (plot_locations_df)
#plot_locations_df.head()
#plot_locations_df.plot()
#plot_locations_df

How to read .shp files in Databricks notebook

3 Answers3