2

I have a homework assignment where I need to take a CSV file based around population data around the United States and do some data analysis on the data inside. I need to find the data that exists for my state and for starters run a Linear Regression Analysis to predict the size of the population.

I've been studying R for a few weeks now, went through a LinkedIn Learning training, as well as 2 different trainings on pluralsight about R. I have also tried searching for how to do a Linear Regression Analysis in R and I find plenty of examples for how to do it when the data is perfectly laid out in a table in just the right way to Analyze.

The CSV file is laid out so that each state is defined on a single line/row so I used the filter function to grab just the data for my State and put it into a variable.

Within that dataset the population data is defined across several columns with the most important data being the Population Estimates for each year from 2010 to 2018.

library(tidyverse)
population.data <- read_csv("nst-est2018-alldata.csv")
mn.state.data <- filter(population.data, NAME == "Minnesota")

I'm looking for some help to get headed in the right direction my thought is that I will need to create to containers of data 1 having each year from 2010 to 2018 and one that contains the population data for each of those years. And then use the xyplot function with those two containers? If you have some experience in this area please help me think this through I'm not looking for anybody to do the assignment for me just want some help trying to think it through.

Edit: Here is the results of the

dput(head(population.data))

command:

structure(list(SUMLEV = c("010", "020", "020", "020", "020", 
"040"), REGION = c("0", "1", "2", "3", "4", "3"), DIVISION = c("0", 
"0", "0", "0", "0", "6"), STATE = c("00", "00", "00", "00", "00", 
"01"), NAME = c("United States", "Northeast Region", "Midwest Region", 
"South Region", "West Region", "Alabama"), CENSUS2010POP = c(308745538L, 
55317240L, 66927001L, 114555744L, 71945553L, 4779736L), ESTIMATESBASE2010 
= c(308758105L, 
55318430L, 66929743L, 114563045L, 71946887L, 4780138L), POPESTIMATE2010 =
c(309326085L, 
55380645L, 66974749L, 114867066L, 72103625L, 4785448L), POPESTIMATE2011 = 
c(311580009L, 
55600532L, 67152631L, 116039399L, 72787447L, 4798834L), POPESTIMATE2012 =
c(313874218L, 
55776729L, 67336937L, 117271075L, 73489477L, 4815564L), POPESTIMATE2013 = 
c(316057727L, 
55907823L, 67564135L, 118393244L, 74192525L, 4830460L), POPESTIMATE2014 = 
c(318386421L, 
56015864L, 67752238L, 119657737L, 74960582L, 4842481L), POPESTIMATE2015 = 
c(320742673L, 
56047587L, 67869139L, 121037542L, 75788405L, 4853160L), POPESTIMATE2016 = 
c(323071342L, 
56058789L, 67996917L, 122401186L, 76614450L, 4864745L), POPESTIMATE2017 = 
c(325147121L, 
56072676L, 68156035L, 123598424L, 77319986L, 4875120L), POPESTIMATE2018 = 
c(327167434L, 
56111079L, 68308744L, 124753948L, 77993663L, 4887871L), NPOPCHG_2010 = 
c(567980L, 
62215L, 45006L, 304021L, 156738L, 5310L), NPOPCHG_2011 = c(2253924L, 
219887L, 177882L, 1172333L, 683822L, 13386L), NPOPCHG_2012 = c(2294209L, 
176197L, 184306L, 1231676L, 702030L, 16730L), NPOPCHG_2013 = c(2183509L, 
131094L, 227198L, 1122169L, 703048L, 14896L), NPOPCHG_2014 = c(2328694L, 
108041L, 188103L, 1264493L, 768057L, 12021L), NPOPCHG_2015 = c(2356252L, 
31723L, 116901L, 1379805L, 827823L, 10679L), NPOPCHG_2016 = c(2328669L, 
11202L, 127778L, 1363644L, 826045L, 11585L), NPOPCHG_2017 = c(2075779L, 
13887L, 159118L, 1197238L, 705536L, 10375L), NPOPCHG_2018 = c(2020313L, 
38403L, 152709L, 1155524L, 673677L, 12751L), BIRTHS2010 = c(987836L, 
163454L, 212614L, 368752L, 243016L, 14227L), BIRTHS2011 = c(3973485L, 
646265L, 834909L, 1509597L, 982714L, 59689L), BIRTHS2012 = c(3936976L, 
637904L, 830701L, 1504936L, 963435L, 59070L), BIRTHS2013 = c(3940576L, 
635741L, 830869L, 1504799L, 969167L, 57936L), BIRTHS2014 = c(3963195L, 
632433L, 836505L, 1525280L, 968977L, 58907L), BIRTHS2015 = c(3992376L, 
634515L, 837968L, 1545722L, 974171L, 59637L), BIRTHS2016 = c(3962654L, 
628039L, 831667L, 1541342L, 961606L, 59388L), BIRTHS2017 = c(3901982L, 
616552L, 816177L, 1519944L, 949309L, 58259L), BIRTHS2018 = c(3855500L, 
609336L, 804431L, 1499838L, 941895L, 57216L), DEATHS2010 = c(598691L, 
110848L, 140785L, 228706L, 118352L, 11073L), DEATHS2011 = c(2512442L, 
470816L, 586840L, 962751L, 492035L, 48818L), DEATHS2012 = c(2501531L, 
460985L, 584817L, 960575L, 495154L, 48364L), DEATHS2013 = c(2608019L, 
480032L, 605188L, 1011093L, 511706L, 50847L), DEATHS2014 = c(2582448L, 
470196L, 597078L, 1006057L, 509117L, 49692L), DEATHS2015 = c(2699826L, 
488881L, 626494L, 1052360L, 532091L, 51820L), DEATHS2016 = c(2703215L, 
480331L, 619471L, 1058173L, 545240L, 51662L), DEATHS2017 = c(2779436L, 
501022L, 620556L, 1092949L, 564909L, 53033L), DEATHS2018 = c(2814013L, 
506909L, 621030L, 1109152L, 576922L, 53425L), NATURALINC2010 = c(389145L, 
52606L, 71829L, 140046L, 124664L, 3154L), NATURALINC2011 = c(1461043L, 
175449L, 248069L, 546846L, 490679L, 10871L), NATURALINC2012 = c(1435445L, 
176919L, 245884L, 544361L, 468281L, 10706L), NATURALINC2013 = c(1332557L, 
155709L, 225681L, 493706L, 457461L, 7089L), NATURALINC2014 = c(1380747L, 
162237L, 239427L, 519223L, 459860L, 9215L), NATURALINC2015 = c(1292550L, 
145634L, 211474L, 493362L, 442080L, 7817L), NATURALINC2016 = c(1259439L, 
147708L, 212196L, 483169L, 416366L, 7726L), NATURALINC2017 = c(1122546L, 
115530L, 195621L, 426995L, 384400L, 5226L), NATURALINC2018 = c(1041487L, 
102427L, 183401L, 390686L, 364973L, 3791L), INTERNATIONALMIG2010 = 
c(178835L, 
45723L, 25158L, 68742L, 39212L, 928L), INTERNATIONALMIG2011 = c(792881L, 
206686L, 116948L, 285343L, 183904L, 4716L), INTERNATIONALMIG2012 = 
c(858764L, 
207584L, 120995L, 344198L, 185987L, 5874L), INTERNATIONALMIG2013 = 
c(850952L, 
194103L, 126681L, 329897L, 200271L, 5111L), INTERNATIONALMIG2014 = 
c(947947L, 
222685L, 134310L, 365281L, 225671L, 3753L), INTERNATIONALMIG2015 = 
c(1063702L, 
227275L, 142759L, 429088L, 264580L, 4685L), INTERNATIONALMIG2016 = 
c(1069230L, 
236718L, 144859L, 436795L, 250858L, 5950L), INTERNATIONALMIG2017 = 
c(953233L, 
215872L, 126013L, 404582L, 206766L, 3190L), INTERNATIONALMIG2018 = 
c(978826L, 
229700L, 127583L, 418418L, 203125L, 3344L), DOMESTICMIG2010 = c(0L, 
-32918L, -50873L, 90679L, -6888L, 1238L), DOMESTICMIG2011 = c(0L, 
-159789L, -186896L, 335757L, 10928L, -2239L), DOMESTICMIG2012 = c(0L, 
-205314L, -181285L, 336615L, 49984L, 59L), DOMESTICMIG2013 = c(0L, 
-216273L, -123814L, 293443L, 46644L, 2641L), DOMESTICMIG2014 = c(0L, 
-274391L, -182730L, 373439L, 83682L, -755L), DOMESTICMIG2015 = c(0L, 
-339996L, -234823L, 452879L, 121940L, -1553L), DOMESTICMIG2016 = c(0L, 
-372953L, -228200L, 442633L, 158520L, -1977L), DOMESTICMIG2017 = c(0L, 
-316879L, -161387L, 364465L, 113801L, 2065L), DOMESTICMIG2018 = c(0L, 
-292928L, -157048L, 345132L, 104844L, 5718L), NETMIG2010 = c(178835L, 
12805L, -25715L, 159421L, 32324L, 2166L), NETMIG2011 = c(792881L, 
46897L, -69948L, 621100L, 194832L, 2477L), NETMIG2012 = c(858764L, 
2270L, -60290L, 680813L, 235971L, 5933L), NETMIG2013 = c(850952L, 
-22170L, 2867L, 623340L, 246915L, 7752L), NETMIG2014 = c(947947L, 
-51706L, -48420L, 738720L, 309353L, 2998L), NETMIG2015 = c(1063702L, 
-112721L, -92064L, 881967L, 386520L, 3132L), NETMIG2016 = c(1069230L, 
-136235L, -83341L, 879428L, 409378L, 3973L), NETMIG2017 = c(953233L, 
-101007L, -35374L, 769047L, 320567L, 5255L), NETMIG2018 = c(978826L, 
-63228L, -29465L, 763550L, 307969L, 9062L), RESIDUAL2010 = c(0L, 
-3196L, -1108L, 4554L, -250L, -10L), RESIDUAL2011 = c(0L, -2459L, 
-239L, 4387L, -1689L, 38L), RESIDUAL2012 = c(0L, -2992L, -1288L, 
6502L, -2222L, 91L), RESIDUAL2013 = c(0L, -2445L, -1350L, 5123L, 
-1328L, 55L), RESIDUAL2014 = c(0L, -2490L, -2904L, 6550L, -1156L, 
-192L), RESIDUAL2015 = c(0L, -1190L, -2509L, 4476L, -777L, -270L
), RESIDUAL2016 = c(0L, -271L, -1077L, 1047L, 301L, -114L), RESIDUAL2017 = 
c(0L, 
-636L, -1129L, 1196L, 569L, -106L), RESIDUAL2018 = c(0L, -796L, 
-1227L, 1288L, 735L, -102L), RBIRTH2011 = c(12.79898857, 11.646389369, 
12.449493906, 13.0753983, 13.564866164, 12.455601786), RBIRTH2012 = 
c(12.589173852, 
11.454833676, 12.353389372, 12.900715293, 13.172754439, 12.287820829
), RBIRTH2013 = c(12.511116578, 11.384582534, 12.318197145, 12.770698648, 
13.1250523, 12.012410502), RBIRTH2014 = c(12.493440163, 11.301146646, 
12.363692308, 12.814734, 12.993051496, 12.179749675), RBIRTH2015 = 
c(12.493175596, 
11.324209532, 12.357461907, 12.843808208, 12.92441189, 12.301816868
), RBIRTH2016 = c(12.309933949, 11.20434042, 12.242454436, 12.663079639, 
12.619264908, 12.222387438), RBIRTH2017 = c(12.039095529, 10.996948983, 
11.989119413, 12.357287884, 12.333939366, 11.962999487), RBIRTH2018 = 
c(11.820984126, 
10.863177115, 11.789576855, 12.078306222, 12.128940451, 11.720998206
), RDEATH2011 = c(8.0928244199, 8.4846099623, 8.7504877826, 8.3388830191, 
6.7917918366, 10.187095914), RDEATH2012 = c(7.9990857588, 8.2779015368, 
8.6968381072, 8.2343067033, 6.7700904074, 10.060744313), RDEATH2013 = 
c(8.2803198685, 
8.5962112289, 8.9723230665, 8.5807898649, 6.9298356343, 10.542582104
), RDEATH2014 = c(8.1408206164, 8.4020820365, 8.8249187702, 8.4524499397, 
6.8267702932, 10.274434632), RDEATH2015 = c(8.4484528254, 8.7250748685, 
9.2388679994, 8.7443343664, 7.0592978512, 10.689339673), RDEATH2016 = 
c(8.3975028099, 
8.5692003816, 9.1188486402, 8.6935469035, 7.1552465339, 10.632332792
), RDEATH2017 = c(8.5756150392, 8.9363320099, 9.1155717285, 8.8857783149, 
7.3396052849, 10.889883997), RDEATH2018 = c(8.6277792774, 9.0371195009, 
9.1016891619, 8.9320830002, 7.4291216994, 10.944391939), RNATURALINC2011 = 
c(4.7061641498, 
3.161779407, 3.6990061239, 4.7365152812, 6.7730743272, 2.2685058724
), RNATURALINC2012 = c(4.5900880929, 3.1769321388, 3.656551265, 
4.66640859, 6.402664032, 2.2270765159), RNATURALINC2013 = c(4.2307967093, 
2.7883713049, 3.3458740787, 4.1899087829, 6.1952166656, 1.4698283977
), RNATURALINC2014 = c(4.3526195469, 2.89906461, 3.5387735378, 
4.3622840605, 6.1662812026, 1.9053150433), RNATURALINC2015 = 
c(4.0447227708, 
2.5991346635, 3.1185939072, 4.0994738414, 5.8651140389, 1.6124771946
), RNATURALINC2016 = c(3.912431139, 2.6351400388, 3.123605796, 
3.969532736, 5.4640183742, 1.5900546466), RNATURALINC2017 = 
c(3.4634804902, 
2.0606169731, 2.8735476848, 3.4715095687, 4.9943340813, 1.0731154898
), RNATURALINC2018 = c(3.1932048488, 1.8260576141, 2.687887693, 
3.1462232219, 4.6998187519, 0.7766062675), RINTERNATIONALMIG2011 = 
c(2.5539481982, 
3.7247036946, 1.7438348531, 2.4715029092, 2.5385138982, 0.9841112772
), RINTERNATIONALMIG2012 = c(2.7460490726, 3.7275831375, 1.7993217139, 
2.9505576333, 2.5429438207, 1.2219173785), RINTERNATIONALMIG2013 = 
c(2.7017267715, 
3.4759149144, 1.8781318506, 2.7997195452, 2.7121923767, 1.0597112344
), RINTERNATIONALMIG2014 = c(2.988275652, 3.9792291689, 1.9851256285, 
3.0689308523, 3.0260314993, 0.7759790947), RINTERNATIONALMIG2015 = 
c(3.3285982753, 
4.0561842059, 2.1052580818, 3.5654043717, 3.5102060089, 0.9664136698
), RINTERNATIONALMIG2016 = c(3.3215493142, 4.2230961065, 2.1323795548, 
3.5885415898, 3.2920380658, 1.2245437674), RINTERNATIONALMIG2017 = 
c(2.9410856198, 
3.8503376372, 1.8510505744, 3.2892897676, 2.6864164429, 0.6550398799
), RINTERNATIONALMIG2018 = c(3.0010858795, 4.0950670621, 1.8698304564, 
3.3695510667, 2.6156748143, 0.685035969), RDOMESTICMIG2011 = c(0, 
-2.879569389, -2.786843372, 2.9081645678, 0.1508443529, -0.467223314
), RDOMESTICMIG2012 = c(0, -3.686820778, -2.69589683, 2.8855541222, 
0.6834160664, 0.0122732593), RDOMESTICMIG2013 = c(0, -3.872925953, 
-1.835626629, 2.4903472978, 0.6316815776, 0.5475831286), RDOMESTICMIG2014 
= c(0, 
-4.903180146, -2.700781819, 3.1374707924, 1.1220952977, -0.156105573
), RDOMESTICMIG2015 = c(0, -6.067919504, -3.462920156, 3.7630900106, 
1.6177886489, -0.320350145), RDOMESTICMIG2016 = c(0, -6.653555548, 
-3.359190761, 3.6365043774, 2.0802759896, -0.40687782), RDOMESTICMIG2017 = 
c(0, 
-5.651919379, -2.370672066, 2.963134779, 1.4785645494, 0.4240305179
), RDOMESTICMIG2018 = c(0, -5.222289092, -2.301663494, 2.7793734944, 
1.350093835, 1.1713623417), RNETMIG2011 = c(2.5539481982, 0.845134306, 
-1.043008519, 5.379667477, 2.6893582511, 0.516887963), RNETMIG2012 = 
c(2.7460490726, 
0.0407623599, -0.896575116, 5.8361117555, 3.2263598871, 1.2341906378
), RNETMIG2013 = c(2.7017267715, -0.397011039, 0.0425052219, 
5.2900668429, 3.3438739543, 1.6072943629), RNETMIG2014 = c(2.988275652, 
-0.923950977, -0.71565619, 6.2064016447, 4.148126797, 0.6198735214
), RNETMIG2015 = c(3.3285982753, -2.011735298, -1.357662074, 
7.3284943823, 5.1279946578, 0.6460635248), RNETMIG2016 = c(3.3215493142, 
-2.430459441, -1.226811206, 7.2250459672, 5.3723140554, 0.8176659475
), RNETMIG2017 = c(2.9410856198, -1.801581742, -0.519621492, 
6.2524245465, 4.1649809923, 1.0790703978), RNETMIG2018 = c(3.0010858795, 
-1.12722203, -0.431833037, 6.1489245611, 3.9657686492, 1.8563983107
)), .Names = c("SUMLEV", "REGION", "DIVISION", "STATE", "NAME", 
"CENSUS2010POP", "ESTIMATESBASE2010", "POPESTIMATE2010", 
"POPESTIMATE2011", 
"POPESTIMATE2012", "POPESTIMATE2013", "POPESTIMATE2014", 
"POPESTIMATE2015", 
"POPESTIMATE2016", "POPESTIMATE2017", "POPESTIMATE2018", "NPOPCHG_2010", 
"NPOPCHG_2011", "NPOPCHG_2012", "NPOPCHG_2013", "NPOPCHG_2014", 
"NPOPCHG_2015", "NPOPCHG_2016", "NPOPCHG_2017", "NPOPCHG_2018", 
"BIRTHS2010", "BIRTHS2011", "BIRTHS2012", "BIRTHS2013", "BIRTHS2014", 
"BIRTHS2015", "BIRTHS2016", "BIRTHS2017", "BIRTHS2018", "DEATHS2010", 
"DEATHS2011", "DEATHS2012", "DEATHS2013", "DEATHS2014", "DEATHS2015", 
"DEATHS2016", "DEATHS2017", "DEATHS2018", "NATURALINC2010", 
"NATURALINC2011", 
"NATURALINC2012", "NATURALINC2013", "NATURALINC2014", "NATURALINC2015", 
"NATURALINC2016", "NATURALINC2017", "NATURALINC2018", 
"INTERNATIONALMIG2010", 
"INTERNATIONALMIG2011", "INTERNATIONALMIG2012", "INTERNATIONALMIG2013", 
"INTERNATIONALMIG2014", "INTERNATIONALMIG2015", "INTERNATIONALMIG2016", 
"INTERNATIONALMIG2017", "INTERNATIONALMIG2018", "DOMESTICMIG2010", 
"DOMESTICMIG2011", "DOMESTICMIG2012", "DOMESTICMIG2013", 
"DOMESTICMIG2014", 
"DOMESTICMIG2015", "DOMESTICMIG2016", "DOMESTICMIG2017", 
"DOMESTICMIG2018", 
"NETMIG2010", "NETMIG2011", "NETMIG2012", "NETMIG2013", "NETMIG2014", 
"NETMIG2015", "NETMIG2016", "NETMIG2017", "NETMIG2018", "RESIDUAL2010", 
"RESIDUAL2011", "RESIDUAL2012", "RESIDUAL2013", "RESIDUAL2014", 
"RESIDUAL2015", "RESIDUAL2016", "RESIDUAL2017", "RESIDUAL2018", 
"RBIRTH2011", "RBIRTH2012", "RBIRTH2013", "RBIRTH2014", "RBIRTH2015", 
"RBIRTH2016", "RBIRTH2017", "RBIRTH2018", "RDEATH2011", "RDEATH2012", 
"RDEATH2013", "RDEATH2014", "RDEATH2015", "RDEATH2016", "RDEATH2017", 
"RDEATH2018", "RNATURALINC2011", "RNATURALINC2012", "RNATURALINC2013", 
"RNATURALINC2014", "RNATURALINC2015", "RNATURALINC2016", 
"RNATURALINC2017", 
"RNATURALINC2018", "RINTERNATIONALMIG2011", "RINTERNATIONALMIG2012", 
"RINTERNATIONALMIG2013", "RINTERNATIONALMIG2014", "RINTERNATIONALMIG2015", 
"RINTERNATIONALMIG2016", "RINTERNATIONALMIG2017", "RINTERNATIONALMIG2018", 
"RDOMESTICMIG2011", "RDOMESTICMIG2012", "RDOMESTICMIG2013", 
"RDOMESTICMIG2014", 
"RDOMESTICMIG2015", "RDOMESTICMIG2016", "RDOMESTICMIG2017", 
"RDOMESTICMIG2018", 
"RNETMIG2011", "RNETMIG2012", "RNETMIG2013", "RNETMIG2014", "RNETMIG2015", 
"RNETMIG2016", "RNETMIG2017", "RNETMIG2018"), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
darcsider
  • 33
  • 4
  • Well formulated, i thank you for asking not for solutions but directions. Could you provide an output example using `dput(head(population.data))`. From your description it is unclear to me, whether you mean that states are written in wide format (1 line has many years), and that you want to have it in the normal long format (1 column contains states, 1 contains years and others contain the remaining info ). If this is the case, an alternative is to look at the `melt` function from `reshape2` or `data.table` (The examples of data.table's help page are easier to follow in my experience). – Oliver Apr 22 '19 at 07:42
  • So the way the CSV is setup when I open it in Excel to look at the data each "row" in excel is one specific states data. It has 38 columns like Name which equals the state and then columns like 2010 Population Estimates, 2011 Population Estimates, etc. I'm just trying to figure out the best way to take the data from this CSV file and be able to for starters run the Linear Regression Analysis, from there I can probably figure out the rest of the assignment but just wrapping my head around how to do this is tripping me up. – darcsider Apr 22 '19 at 13:59
  • 1
    running `dput(head(population.data))` (at the end of your code example) will output an example of your data. I will brew up a short answer. – Oliver Apr 22 '19 at 14:21

1 Answers1

0

In order to help you out, an example data using dput(head(population.data)) would be helpful. Based on your comments, your data is in what is called 'wide' format, meaning each observation is contained in a column, rather than a row (pupulation 2010, population 2011 etc.).

As i hinted in my comment, a sub-goal within statistical modelling is always to clean and reshape data to a proper format, that will work for running models. In this case the problem is that your format is in an incorrect shape. The most common is likely melting to long format via the reshape2 or data.table package as explained in this link. I personally prefer the data.table package, as it seems to have better large scale performance. Their usage however is identical.

Lets say you have a column 'NAME' for states and 9 columns for population estimates (2010 population estimates, 2011 population estimates and so on), we could then convert these columns into a long format, using melt from either of the two suggested packages (They are identical in use)

require(data.table)
value_columns <- paste(2010:2018, "Population Estimates")
population.data_long <- melt(population.data, id.vars = "NAME", 
                             measure.vars = value_columns, #Columns containing values we (that are grouped by their column names) 
                             variable.name = 'Year (Population Estimate)', #Name of the column which tells us [(Year) Population Estimate]
                             value.name = 'Population Estimate') #Name of the column with values
population.data_long$year <- as.integer(substr(population.data_long$`Year (Population Estimate)`, 1, 4)) #Create a year column in a bit of a hacky way

Note i have ignored any additional columns, and these should be included in your melt statement. From here on a linear regression should follow any standard example that you have found.

Oliver
  • 8,169
  • 3
  • 15
  • 37
  • 1
    Thank you for your help I'm sorry I didn't post any data yet I was at work all day and didn't have a chance to even get back on here until now. Your information that you have given me is a great start for me. I ran the command you suggested and it came back with a LARGE amount of data that StackOverflow says is to long for a comment. I will take what you have given me for info and continue working on this project. – darcsider Apr 23 '19 at 01:37