-2

I hope one of you can help me - I have been trying loads of different ways of doing this and can't seem to find the right answer. I am fairly new to R, but have been writing a script to format some data that I have. Ultimately, I will want to run this script weekly as the data comes in.

I have a list of breed codes (1 - 80) many of which (but not all) have a corresponding 3 character country (eg. GBR or NLD etc). What I want to do is to create a new colum in my data from which has the country code in, which corresponds to the breed code.

One of the problems I'm having is that not all of the numbers (1 - 80) have a corresponding country code. So I can't create a vector with them all in as they are not of the same type.

If there is no associated country code, I would like the country code to be the number of the breed code. For example, breed code 6 has no associated country, so I woud want "6" to populate the relevant field in my new sire_country column.

In case it helps, I have added the script I have been trying to use, to no avail!

#denoting country codes for breed codes 1-80
breed_country<-c("GBR", "GBR", "GBR", "GBR", "GBR", "6", "GBR", "8", "9", 
"10", 
"11", "GBR", "NZL", "GBR", "GBR", "16", "DNK", "18", "19", "GBR", "21", 
"GBR", 
"23", "24", "25", "26", "CHE", "28", "29", "30", "31", "32", "33", "34", 
"35", 
"36", "37", "38", "39", "40", "41", "42", "CZE", "44", "45", "IRL", "AUS", 
"POL", "DEU", "50", "51", "SWE", "DEU", "ESP", "55", "56", "57", "58", 
"SWE", 
"DEU", "DNK", "NZL", "NLD", "CAN", "USA", "66", "67", "68", "USA", "70", 
"FRA", 
"ITA", "FIN", "JEY", "GGY", "76", "NOR", "78", "79", "80")

breed_id<-c("Sire.Breed")

sire_country<-breed_country[breed_id]

sire_country[is.na("Sire.ID")]<-""


#the output looks like
    sire_country
 [1] NA


#when I add sire_country to my data frame, I get


sire_country
1                       <NA>
2                       <NA>
3                       <NA>
4                       <NA>  
5                       <NA>
6                       <NA> 
7                       <NA>
8                       <NA>
9                       <NA>
10                      <NA>
11                      <NA>
12                      <NA>
13                      <NA>
14                      <NA>
15                      <NA>

# "Sire.Breed" is a column containing numerical breed codes in the data 
frame: df
# sire_country is what I want the new column with the country codes in to be 
called
# if there is no "Sire.ID" present, I want the field to remain blank - I 
have used this function elsewhere and it work fine

My data is read from a .csv file. Unfortunately I can not post it, as it is confidential. But a fictional example would be:

animal  name    breed   Mother  Father  ID              Company DOB
1       Alice   2       Vera    Tom     123456789012    Heinz   12/05/2017
2       Kate    63      Lucy    Jack    123456987147    Google  03/06/2017

(I can't format the table better, sorry)

Then I would want country code, which relates to the breed (2 or 63 in this case) to be added at the end like so:

animal  name    breed   Mother  Father  ID              Company DOB   Country
1       Alice   2       Vera    Tom     123456789012    Heinz   12/05/2017   GBR
2       Kate    63      Lucy    Jack    123456987147    Google  03/06/2017   NLD

Apologies if I have used the wrong language throughout this, I'm still learning! Any help you can give me would be very much appreciated.

Thank you!

Djork
  • 3,319
  • 1
  • 16
  • 27
KatySpi
  • 1
  • 1
  • can you please provide your data as well, not the country code alone – Hardik Gupta Oct 16 '17 at 08:36
  • Start off with data structure that actually has columns. I.e. `data.frame(code = 1:80, country = breed_country)`. Could you please provide us visual representation of expected output (create it manually). And 80 rows is overkill, 10 is more than enough to get the point across. – statespace Oct 16 '17 at 08:40
  • I am struggling to understand the difference between the `breed` column and the new column you wish to create. it would be helpful if your example showed the different scenarios with respect to the `breed` columns and the desired column (*e.g.,* what kind of value of `breed` maps onto what value of outcome col). – Milan Valášek Oct 16 '17 at 08:57

1 Answers1

1

You should learn different ways to index vectors, matrices and data frames, e.g. http://www.cookbook-r.com/Basics/Indexing_into_a_data_structure/

As an exercise you can see the output of:

breed_country[2]
breed_country[c(2, 65, 10, 80)] 

As you notice the order of breed_country elements actually corresponds to the breed codes 1:80, therefore you can easily index breed_country by their corresponding breed codes as seen in the exercise.

Now you will use df$breed, which is the column of your data frame corresponding to breed codes, to index your breed_country vector.

As you can see df$breed gives you a vector of breed codes in the order seen in your data frame:

df$breed # breed codes of df
breed_country[df$breed] # index breed_country by breed codes in df
df$Country <- breed_country[df$breed] # assign to new column "Country"
head(df) # print first 6 rows of df

Here is where you went wrong:

breed_id<-c("Sire.Breed")
breed_country[breed_id]

This is equivalent to:

breed_country["Sire.Breed"]

Yet none of your breed_country elements has the name "Sire.Breed", so your output sire_country is NA.

Then further you use is.na("Sire.ID"), asking if a character vector is NA, it's not, the output is FALSE. You should step through your code and see the output of each call.

Djork
  • 3,319
  • 1
  • 16
  • 27