How to calculate the Pearson correlation coefficient (r) between population and percentage of land cover type?

Question

My following code outputs what appears to be a list of dictionaries for each census tract, which is basically like a designated area of land. I was able to calculate the population and the percentage of a few different land cover types. Now I want to calculate the Pearson correlation coefficient between the population and percentage of each land cover type.

What I am trying to do is extract/filter the list of dictionaries so that I can compare the population with each and every land cover type. Hence the Pearson correlations of:

population and developed land
population and barren land
population and forest land
...

Here is the code:

import geopandas as gpd
from rasterstats import zonal_stats
from rasterio.mask import mask
from rasterio.plot import show
import matplotlib.pyplot as plt
import numpy as np
import fiona
import rasterio
from scipy import stats
from rasterio.warp import calculate_default_transform, reproject, Resampling

mass_fp = r"New_Massachusetts.tif"

mass_tracts = gpd.read_file("Massachusetts/Massachusetts.shp");
dst_crs = 'EPSG:4269';


with rasterio.open('Massachusetts.tif') as src:
    transform, width, height = calculate_default_transform(
        src.crs, mass_tracts.crs, src.width, src.height, *src.bounds)
    kwargs = src.meta.copy()
    kwargs.update({
        'crs': mass_tracts.crs,
        'transform': transform,
        'width': width,
        'height': height
    })

    with rasterio.open('New_Mass.tif', 'w', **kwargs) as dst:
        for i in range(1, src.count + 1):
            reproject(
                source=rasterio.band(src, i),
                destination=rasterio.band(dst, i),
                src_transform=src.transform,
                src_crs=src.crs,
                dst_transform=transform,
                dst_crs=dst_crs,
                resampling=Resampling.nearest)



#Getting zonal stats
stats = zonal_stats("Massachusetts/Massachusetts.shp", "New_Mass.tif",stats="count",geojson_out=True, copy_properties=True,nodata_value=0,categorical=True);

#Variables for our loop below
total_pop=0.0;
total_pixel_count=0.0;
total_developed = 0.0;
total_water_ice = 0.0;
total_barren_land = 0.0;
total_forest = 0.0;

#Array to store our census track
census_tract_land_percentages = [];

#Looping through each tract in the stats data and getting the data we need and then storing it in a array with dictionaries
#[11,12], [21, 22, 23,24], 31, [41,42,43] 5 

for x in stats:
    total_pixel_count=x["properties"]["count"];
    total_census_population = x["properties"]["DP0010001"]
    total_developed= (float(x["properties"].get(21,0)+x.get(22,0)+x["properties"].get(23,0) + x["properties"].get(24,0))/total_pixel_count)*100;
    total_water_ice = (float(x["properties"].get(11,0)+x["properties"].get(12,0))/total_pixel_count)*100;
    total_barren_land=float(x["properties"].get(31,0)/total_pixel_count)*100;
    total_forest = (float(x["properties"].get(41,0)+x["properties"].get(42,0)+x["properties"].get(43,0))/total_pixel_count)*100;

    census_tract_land_percentages.append({"Total Population:":total_census_population,"Total Water Ice Cover":total_water_ice,"Total Developed":total_developed,
                                         "Total Barren Land":total_barren_land,"Total Forest":total_forest});

print(census_tract_land_percentages);

#Getting the total population for all census tracts
for x in mass_tracts["DP0010001"]:
    total_pop+=x

np_census_arr = np.asarray(census_tract_land_percentages);

Once this code runs I get the following list of dictionaries which I am wondering how I can extract/filter out the population and compare it with each and every land cover percentage and ultimately calculate the Pearson r correlation.

[{'Total Population:': 4585, 'Total Water Ice Cover': 2.848142234497044, 'Total Developed': 17.205368316575324, 'Total Barren Land': 0.22439908514219134, 'Total Forest': 34.40642126612868},

 {'Total Population:': 4751, 'Total Water Ice Cover': 1.047783534830167, 'Total Developed': 37.27115716753022, 'Total Barren Land': 0.11514104778353484, 'Total Forest': 19.11341393206678},

 {'Total Population:': 3214, 'Total Water Ice Cover': 0.09166603009701321, 'Total Developed': 23.50469788404247, 'Total Barren Land': 0.2597204186082041, 'Total Forest': 20.418608204109695},

 {'Total Population:': 5005, 'Total Water Ice Cover': 0.0, 'Total Developed': 66.37545713124746, 'Total Barren Land': 0.0, 'Total Forest': 10.68671271840715},

...
]

Any ideas how I can loop through this and then calculate the Pearson r for the total population variable against the percentage of each land cover type?

Thank you

If wanting to use `pandas` then just `pd.DataFrame(l).corr()` where `l` is your list of dictionaries. This will give you the entire symmetric correlation matrix of every pairwise column combination — ALollz, Dec 11 '18 at 19:23
you get a list of x and y ... and run `scipy.stats.pearson(x,y)` x would be your population y is the list of percentages of a land type it might be `pearsonr` ... i forget — Joran Beasley, Dec 11 '18 at 19:27
Would there be any way of extract out each each individual population for the census tract with each land cover type in each census tract. All the data is being stored at the end in the variable: np_census_arr = np.asarray(census_tract_land_percentages) — Yuen_, Dec 11 '18 at 19:31
@JoranBeasley Thanks! How would I exact out a list of the x's being the population and a list of y's being for the land cover types from what appears to be a list of dictionaries. — Yuen_, Dec 11 '18 at 19:32
If I just want to population first in 'x' list, then I guess at first while looping that would be the 0'th index of the list, and then the 0'th index of the dictionary from np_census_arr to get out each and every population from the census tract, so then I'd pool them together into one list. Then I'd have to have a list with multiple categories for each land cover type to be compared against the population to output a correlation value. — Yuen_, Dec 11 '18 at 19:34
We don't need any of the file-reading or transform code; you can delete everything except the last two paragraphs of the question; questions are supposed to give Minimal complete examples. If you do want a review of the entire (working) code, CodeReview.SE site is a better place to post. — smci, Dec 11 '18 at 21:18

score 0 · Answer 1 · answered Dec 11 '18 at 19:51

xs = []
y1s = []
y2s = []
y3s = []
y4s = []

for entry in entries:
    xs.append(entry['population'])
    ice = entry['Total Water Ice Cover']
    dev = entry['Total Developed']
    bar = entry['Total Barren Land']
    forest = entry['Total Forest']

    total_land = ice+dev+bar+forest
    y1s.append(ice/total_land)
    y2s.append(dev/total_land)
    y3s.append(bar/total_land)
    y4s.append(forest/total_land)



print(scipy.stats.pearsonr(xs,y1s)," = ICE") 
...

theres probably some tricks you can do with pandas however to simplify it

How to calculate the Pearson correlation coefficient (r) between population and percentage of land cover type?

1 Answers1