1

I am trying to import a heavy text file into R. There are many line breaks in a text file as shown below. How can I bring the data into the original format? Note that the delimiter is ~~ here.

This is how it looks

Raw image of how the file looks

PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~53~~66014~~WB19~~2011~~Q3~~13~~3~~0~~61965~~0~~1098~~323~~775~~~~~~~~18428.79781420765~~43536.202185792346~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~11018~~WB19~~2011~~Q1~~5~~1~~0~~6045~~0~~366~~315~~51~~~~~~~~5202.6639344262294~~842.3360655737705~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~11018~~WB19~~2011~~Q3~~4~~1~~0~~6195~~0~~366~~167~~199~~~~~~~~2826.6803278688526~~3368.3196721311474~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~6027~~WB19~~2011~~Q2~~14~~1~~0~~6195~~0~~366~~184~~182~~~~~~~~3114.4262295081967~~3080.5737704918033~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~6027~~WB19~~2011~~Q3~~7~~1~~0~~6195~~0~~366~~183~~183~~~~~~~~3097.5~~3097.5~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~6027~~WB19~~2011~~Q4~~14~~1~~0~~6195~~0~~366~~87~~279~~~~~~~~1472.5819672131147~~4722.4180327868853~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~66014~~WB19~~2011~~Q1~~14~~1~~0~~6045~~0~~366~~287~~79~~~~~~~~4740.2049180327867~~1304.795081967213~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54
~~66014~~WB19~~2011~~Q1~~9~~2~~0~~9800~~0~~732~~629~~103~~~~~~~~8198.920765027322~~1601.0792349726776~~~~~~~~0~~0~~~~~~
PSU~~WEST BENGAL~~SOUTH 24 PARGANAS~~1~~21~~54~~10016~~WB19~~2011~~Q4~~11~~1~~0~~8285~~0~~366~~74~~292~~~~~~~~1675.1092896174864~~6609.890710382514~~~~~~~~0~~0~~~~~~
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Kush
  • 11
  • 4
  • What is the original format exactly ? You want the remove line breaks to have all lines starting with "PSU" ? – gdevaux Nov 27 '19 at 08:27
  • Perhaps this answer to a different question might be helpful: https://stackoverflow.com/a/30783059/7439717 – crlwbm Nov 27 '19 at 08:47
  • Your description is a little unclear. Does my answer solve your question? – Darren Tsai Nov 27 '19 at 12:17
  • Dear Darren, I am currently trying the method you have suggested. The total size of text file is 1.2 GB. It is taking its time. The code has been running for last 30 minutes. Will update you once it is done. Thanks! – Kush Nov 28 '19 at 08:57
  • its working. Thanks! – Kush Nov 28 '19 at 15:12

2 Answers2

1

The file has some line breaks between complete data rows, so the first thing is to concatenate them according to whether the line starts with "~~" or not. I use the concept of iteration to paste each line by Reduce(), and then you will get a string of length 1, i.e. text I assign.

text <- Reduce(function(x, y)
  if(grepl("^~~", y)) paste0(x, y) else paste(x, y, sep = "\n"),
  readLines("test.txt"))

data <- read.table(text = gsub("~~", ",", text), sep = ",")
data

#    V1          V2                V3 V4 V5 V6    V7   V8   V9 V10 V11 V12 V13   V14 V15  V16 V17 V18 V19 V20 V21       V22        V23 V24 V25 V26 V27 V28 V29 V30 V31
# 1 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 53 66014 WB19 2011  Q3  13   3   0 61965   0 1098 323 775  NA  NA  NA 18428.798 43536.2022  NA  NA  NA   0   0  NA  NA  NA
# 2 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54 11018 WB19 2011  Q1   5   1   0  6045   0  366 315  51  NA  NA  NA  5202.664   842.3361  NA  NA  NA   0   0  NA  NA  NA
# 3 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54 11018 WB19 2011  Q3   4   1   0  6195   0  366 167 199  NA  NA  NA  2826.680  3368.3197  NA  NA  NA   0   0  NA  NA  NA
# 4 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54  6027 WB19 2011  Q2  14   1   0  6195   0  366 184 182  NA  NA  NA  3114.426  3080.5738  NA  NA  NA   0   0  NA  NA  NA
# 5 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54  6027 WB19 2011  Q3   7   1   0  6195   0  366 183 183  NA  NA  NA  3097.500  3097.5000  NA  NA  NA   0   0  NA  NA  NA
# 6 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54  6027 WB19 2011  Q4  14   1   0  6195   0  366  87 279  NA  NA  NA  1472.582  4722.4180  NA  NA  NA   0   0  NA  NA  NA
# 7 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54 66014 WB19 2011  Q1  14   1   0  6045   0  366 287  79  NA  NA  NA  4740.205  1304.7951  NA  NA  NA   0   0  NA  NA  NA
# 8 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54 66014 WB19 2011  Q1   9   2   0  9800   0  732 629 103  NA  NA  NA  8198.921  1601.0792  NA  NA  NA   0   0  NA  NA  NA
# 9 PSU WEST BENGAL SOUTH 24 PARGANAS  1 21 54 10016 WB19 2011  Q4  11   1   0  8285   0  366  74 292  NA  NA  NA  1675.109  6609.8907  NA  NA  NA   0   0  NA  NA  NA
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
0

Not sure if you want something like below.

Assuming your file with contents in the example is named as dat.txt.

fileName <- 'dat.txt'
writeLines(gsub("~~","\n",readChar(fileName, file.info(fileName)$size)))

which gives (at your console)

PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
53
66014
WB19
2011
Q3
13
3
0
61965
0
1098
323
775



18428.79781420765
43536.202185792346



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

11018
WB19
2011
Q1
5
1
0
6045
0
366
315
51



5202.6639344262294
842.3360655737705



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

11018
WB19
2011
Q3
4
1
0
6195
0
366
167
199



2826.6803278688526
3368.3196721311474



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

6027
WB19
2011
Q2
14
1
0
6195
0
366
184
182



3114.4262295081967
3080.5737704918033



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

6027
WB19
2011
Q3
7
1
0
6195
0
366
183
183



3097.5
3097.5



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

6027
WB19
2011
Q4
14
1
0
6195
0
366
87
279



1472.5819672131147
4722.4180327868853



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

66014
WB19
2011
Q1
14
1
0
6045
0
366
287
79



4740.2049180327867
1304.795081967213



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54

66014
WB19
2011
Q1
9
2
0
9800
0
732
629
103



8198.920765027322
1601.0792349726776



0
0



PSU
WEST BENGAL
SOUTH 24 PARGANAS
1
21
54
10016
WB19
2011
Q4
11
1
0
8285
0
366
74
292



1675.1092896174864
6609.890710382514



0
0




ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81