0

I am attempting to emulate the following paper,using year 2000 decennial Census data to create an index known as the Neighborhood Deprivation Index(NDI): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3261293/#CR73

I am particularly struggling with the very last step outlined in the Component extraction and index construction section of the paper. the final steps are

Performing Principal Component Analysis, retaining the 1st principal component, on 8 variables: 1) % of males in management and professional occupations, 2) % of crowded housing, 3) % of households in poverty, 4) % of female headed households with dependents, 5) percent of households on public assistance 6) % of households earning <$30,000 per year 7) % earning less than a high school education 8) % unemployed

Standardizing the index to have a mean of 0 and standard deviation (SD) of 1 by dividing the index by the square of the eigenvalue.

I am currently using the prcomp() function to perform the Principal Component Analysis. I am aware that I can obtain the eigenvalues by squaring the $sdev object from the prcomp() function.

In order to follow along with this last step. Should I be manually calculating the correct linear combination to put my census data using this formula?

pca_2000 = prcomp(census_2000_vars,rank.=1,center=F,scale=F)

eigenvalues = pca_2000$sdev^2

loadings = pca_2000$rotation[1:8]

lin_comb = loadings/(eigenvalues^2)

  • You have a typo in your first line. The `rank.=` argument is misspelled as `.rank=`. If I understand the parts of the paper you quote, the index is `scale(pca_2000$x)`. – dcarlson Feb 07 '22 at 21:34
  • Thank you for catching that mistake. I've fixed it. wouldn't putting the principal components in the ```scale()``` function be standardizing them by converting the principal components into z-scores? Do you think that is effectively the same as the method the researchers used to standardize the output? – Barayjeemqaf Feb 07 '22 at 21:39
  • No. The principal component (you only have one) has a loading for each column/variable. The index should have a value for each row/observation so we need the principal component score, not the loading. – dcarlson Feb 07 '22 at 21:52
  • I think I understand what it is you're getting at. I guess I was initially thinking the researchers were stating they were dividing the loadings by the square of the eigenvalues, altering the linear combination used to compute the principal components. I tried to further carry out "my" way by attaching ```lin_comb``` to the original data and calculating the principal components manually, but that did not have mean 0 and sd 1. But given that there are 8 eigenvalues, 1 for each variable. How exactly am I to divide the principal components y the square of the eigenvalues? – Barayjeemqaf Feb 07 '22 at 22:10
  • I don't think the description in the paper is sufficient to know exactly what they did. Without at least the code they used, it is just guessing. You assumed the variables were not standardized before running principal components, but I didn't see a clear statement that they used raw, un-centered values. – dcarlson Feb 07 '22 at 23:53
  • You are right. I am making a lot of assumptions here. Unfortunately, I could not find the researcher's code attached to the paper anywhere where its available. – Barayjeemqaf Feb 08 '22 at 13:24

0 Answers0