1

I am working on a problem where I need to plot output on a Map.
In past I was able to do it using geopandas. However this does not work in databricks-notebook. I tried to look at alternative but couldn't find any on web.

Pages I looked in to:
GeoPandas Notebook
Processing Geospatial Data at Scale With Databricks

In the second link it mentions that we can read .shp flies through scala however it does not mention what sc in ShapefileReader.readToGeometryRDD stands for?

%scala
var spatialRDD = new SpatialRDD[Geometry]
spatialRDD = ShapefileReader.readToGeometryRDD(sc, "/ml/blogs/geospatial/shapefiles/nyc")

var rawSpatialDf = Adapter.toDf(spatialRDD,spark)
rawSpatialDf.createOrReplaceTempView("rawSpatialDf") //DataFrame now available to SQL, Python, and R 
Ajay Verma
  • 610
  • 2
  • 12

3 Answers3

0

In the databricks notebook:

  • Install geopandas: %pip install geopandas
  • Import geopandas: import geopandas as gpd
  • The path have to use "File Api Format": /dbfs/databricks/folderName
  • Create the gdf: sdf = gpd.read_file('/dbfs/databricks/folderName')
tdy
  • 36,675
  • 19
  • 86
  • 83
  • Not sure why but I am facing the following error with the solution above: DriverError: /FileStore/myfile.shp: No such file or directory – imguessing Oct 21 '22 at 09:09
0

Below steps to read the .shp files in Databricks notebook

  1. Drag the .zip file folder to Databricks data section and once the notepad opens up, put the below code in the python notebook.

  2. Inside the .zip folder, the shape file name is tl_2020_us_state.shp Replace this name with your shape file.

    import geopandas from snowflake.connector.pandas_tools import write_pandas import os from dotenv import load_dotenv import re import io import fiona; import fsspec from zipfile import ZipFile import requests from future import print_function import zipfile from io import StringIO import shapefile

    myshp = open("tl_2020_us_state.shp", "rb") mydbf = open("tl_2020_us_state.dbf", "rb") shpfile_rd = shapefile.Reader(shp=myshp, shx=None, dbf=mydbf) shpfile_df = geopandas.GeoDataFrame.from_file('tl_2020_us_state.shp') shpfile_df.head()

    #for plotting the map shpfile_df.plot()

0
# Define the variables and the values
zipfile_name = 'tl_2020_us_state.zip'
shapefile_name = 'tl_2020_us_state.shp'
shapefile_url = 'https://www2.census.gov/geo/tiger/TIGER2020/STATE/' + zipfile_name

#Define the absolute path of the shapefile out of the zip file in local file
shapefile_path = os.path.abspath(zipfile_name)

## Download a shapefile, upload into local directory
# download the shapefile to the shapefile path
with requests.get(shapefile_url) as r:
    with open(shapefile_path, 'wb') as f:
        f.write(r.content)              

# unzip the file
with zipfile.ZipFile(shapefile_path, 'r') as zip_ref:
    zip_ref.extractall(os.curdir)
    #print(os.curdir)    
    
# set home directory and download data
et.data.get_data("spatial-vector-lidar") ## for earthpy.io.DATA_URLS
os.chdir(os.path.join(et.io.HOME, 'earth-analytics')) ## works on Mac or Linux   ## Set your working directory in Python using os.chdir()

## import shapefile using geopandas
plot_locations_df = geopandas.read_file(
                          os.path.join(
                          os.curdir, 
                          shapefile_name))

## view the top 6 lines of attribute table of data
plot_locations_df.head(20)

#print (plot_locations_df)
#plot_locations_df.head()
#plot_locations_df.plot()
#plot_locations_df