1

I'm trying to do read.table in R. My data (txt file) is like the following:

a b c d e
Australia 1 2 4 3 2
United States 1 2 4 2 2

The problems with reading this table are that:

1) Line 1 only has 5 elements (a~e), as opposed to 6 elements in all rows below that. It's supposed to have the column name like "Country". Then, a corresponds to the first number 1, b corresponds to 2,..and e corresponds to 2 (in the case of Australia.) How do I add a column name to the first column so that R won't show an error that says "line 1 did not have 6 elements"?

2) In United States case, United States are two words instead of one, so when R reads the data, it puts "States" into the second column instead of reading "United States" as one element name.

(i've been advised by my friend to use rownames. Does anyone know how to go about using rownames??)

How can I fix these issues and correctly read my data?

Thank you very much!!

smci
  • 32,567
  • 20
  • 113
  • 146
Betty
  • 11
  • 4
  • What is the source or the data? A plain text file? A spreadsheet? – A5C1D2H2I1M1N2O1R2T1 Nov 15 '14 at 01:58
  • plain text file! @AnandaMahto – Betty Nov 15 '14 at 02:42
  • What is the output supposed to look like? What code have you tried? – ben rudgers Nov 15 '14 at 03:04
  • @benrudgers I tried read.table(file="filename",header=FALSE,fill=TRUE) but it filled the wrong column by pushing a~e to the left (thus a=>country).I also tried header=TRUE, but it didn't work because of problem #2 above.And I'm going to run a regression with this data, so the data should be cleaned beforehand. – Betty Nov 15 '14 at 04:00

2 Answers2

2

Here's another possibility. This one adds quotes to any two words that begin a string

x <- readLines("your.txt")
x[1] <- paste("Country", x[1])
read.table(text=sub("([A-Za-z]{2,}\\s[A-Za-z]{2,})", "'\\1'", x), header=TRUE)
#         Country a b c d e
# 1     Australia 1 2 4 3 2
# 2 United States 1 2 4 2 2

With regard to @akrun's comment about countries containing more than two words, I think this will work:

x[4] <- 'Papua New Guinea 3 4 3 2 5'
xx <- sub("([A-Za-z]{2,}(\\s[A-Za-z]{2,})+)", "'\\1'", x)
read.table(text = xx, header = TRUE)
#            Country a b c d e
# 1        Australia 1 2 4 3 2
# 2    United States 1 2 4 2 2
# 3 Papua New Guinea 3 4 3 2 5

It also occurred to me that the country names might be the row names for the data frame. If that's the case, then you could do

x <- readLines("your.txt")
read.table(text = sub("([A-Za-z]{2,}\\s[A-Za-z]{2,})", "'\\1'", x))
#               a b c d e
# Australia     1 2 4 3 2
# United States 1 2 4 2 2
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

Assuming that the example data mimics the content in the file, we could read it using readLines and then use regex to separate the country names from the rest. The separated country names can be added as a new column.

lines <- readLines('Betty2.txt')
lines
#[1] "a b c d e"               "Australia 1 2 4 3 2"    
#[3] "United States 1 2 4 2 2"

dat <-  read.table(text=c(lines[1], gsub('[A-Za-z]+\\s+', '',
                lines[-1])), header=TRUE)

In the above code, we are replacing the character elements followed by space. ie. the country names with ''.

i.e 

 gsub('[A-Za-z]+\\s+', '',  lines[-1])
 #[1] "1 2 4 3 2" "1 2 4 2 2"

 dat1 <- data.frame(Country= gsub(" \\d+.*", '', lines[-1]),
                               dat, stringsAsFactors=FALSE)

Similarly, here we are replacing the space followed by number (\\d+) followed by one or more characters .* with ''.

 gsub(" \\d+.*", '', lines[-1])
 #[1] "Australia"     "United States"


dat1
#        Country a b c d e
#1     Australia 1 2 4 3 2
#2 United States 1 2 4 2 2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • It works! Can you explain your third and fourth line of code? I don't really understand it. Thanks! – Betty Nov 15 '14 at 05:51
  • I see. It makes sense now!! What does "lines[-1]" do here? – Betty Nov 15 '14 at 06:04
  • @Betty `lines[-1]`. Here, I am not taking the `1st line` i.e. the lines with `headers` because it would create problems with the `regex` I was using. So, what I did was I concatenated the first line `c(lines[1],` with the modified `lines[-1]`. – akrun Nov 15 '14 at 06:06