Normalizing data in R - dataframe construction

Question

I'm using R to connect to a database I have stored in PGAdmin so that I can add data into the database through R. Before adding the data into the database, it has to be normalized.

The dataset consists of employee data. If an employee is a manager, the managing column lists the employee ids of the employees that they manage.

I want to create a new dataframe in R that would represent my Managers table that has two columns: employee_id and manager_id (where manager_id is just the employee_id of the manager) that are both pulled from the original dataframe

How would I do this in R? Here's reproducible data:

> dput(test)
structure(list(first_name = c("Carrol", "Scott", "Michael", "Mary", 
"Jane", "Alex"), last_name = c("Dhin", "Peters", "Scott", "Smith", 
"Johnson", "Barter"), employee_id = c(412153L, 534253L, 643645L, 
765453L, 627234L, 174543L), email = c("carrol.dhin@company.com", 
"scott.peters@company.com", "michael.scott@company.com", "mary.smith@company.com", 
"jane.johnson@company.com", "alex.barter@company.com"), managing = c("174543", 
"", "", "", "534253, 643645", ""), department = c("Accounting", 
"Sales", "Sales", "Marketing", "Sales", "Accounting"), department_budget = c(500000L, 
1100000L, 1100000L, 750000L, 1100000L, 500000L), serial_number = c("KX6234", 
"FS5235", "LP5242", "GK6246", "KX6902", "BN7451"), manufacturer = c("Lenovo", 
"Lenovo", "Lenovo", "Lenovo", "Lenovo", "Lenovo"), model = c("X1 Gen 10", 
"T14s", "P1", "X1 Gen 10", "T15", "T14s"), date_assigned = c(44576L, 
44471L, 44341L, 44681L, 44606L, 44378L), installed_software = c("MS Office, Adobe Acrobat, Slack", 
"MS Office", "Mathcad, Adobe Acrobat", "", "MS Office, Slack", 
"Google Chrome")), class = "data.frame", row.names = c(NA, -6L
))

Could you please add reproducible data for the manager_id so we can understand how to connect with the data you `dput` already? Also, could please give an example of your desired output? — SEAnalyst, Nov 16 '22 at 00:29
@SEAnalyst the manager_id is just the employee_id of the manager, the managing column lists all of the ids that that employee manages, all the data is contained in the test dataframe which is what I'm trying to breakdown and normalize — amatof, Nov 16 '22 at 00:36

score 1 · Accepted Answer · answered Nov 16 '22 at 23:54

1

Given your sample data as df, you could use select() and filter() from dplyr to create a subset of managers.

library(dplyr)
managers <- df |> select("manager_id" =employee_id, 
                   "employee_id" =managing) %>%
  filter(.$employee_id!= "")

Result:

> managers
  manager_id    employee_id
1     412153         174543
2     627234 534253, 643645

answered Nov 16 '22 at 23:54

SEAnalyst

1,077
8
15

is there a way to separate that second row into two rows? so that each employee_id has their own observation, it's okay if manager_id repeats – amatof Nov 18 '22 at 17:01
1

Yes, `tidyr` is a good option for this. With given output answer as `managers`, you can use `managers |> tidyr::separate_rows(employee_id)` to separate row number 2 into rows 2 and 3. – SEAnalyst Nov 18 '22 at 18:49

Normalizing data in R - dataframe construction

1 Answers1