3

I am importing a flat .txt file with both row and column delimiters. The problem is that the row delimiter is used to reduce file size and so often the rest of the columns are skipped. Another problem is that the length of the longest character is unknown and so, if this char string is truncated then we lose the delimiter and the whole structure falls apart.

An explicit example of the problems I am facing include

.txt file

Var1'~'Var2'~'Var3'~'Var4'~'Var5'~'Var6'#@#@'
1'~''#@#@'
This is going to be a really long string as an example of a situation where the long string is very large and so the truncated string does not indicate a delimiter and we lose data '#@#@'
1'~' 2'~' 3'~' 4'~' 5'~' 6'#@#@'
1'~' 2'~' 3'~''#@#@'

I am having a lot of problems trying to import this data due to a number of reasons:

  1. putting a very large length for the character variables makes the import process very time consuming and we do not know the length of the longest character var so each iteration takes more time

  2. I have not yet found a way to deal with both the column and row delimiter when the structure means that the next row can be signalled before all the columns have been filled in i.e. can't just make an extra column for the row delimiter and drop it.

SAS code for which I have tried:

data want;
infile "file-location" dlmstr = "#@#@" dsd recfm = F lrecl=10000000000;
informat var $200.
input var $ @@;
run;

Any experience and insight is greatly appreciated.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
78282219
  • 593
  • 5
  • 21
  • Your posted example makes it look like your end of line markers (I hate the thinking of these as delimiters) include an actual linefeed. – Tom Oct 16 '18 at 15:19
  • Are your long strings longer than 32K max for a dataset variable? – Tom Oct 16 '18 at 15:24
  • I have not yet checked, the flat file comes with no documentation surrounding the maximum length of variables contained, i'll include the first two lines from a file – 78282219 Oct 16 '18 at 15:49
  • I recommend using `LIST` command to look at the file. So to read the full file and see what the maximum line length is and also list the first 5 lines run code like this: `data _null_; infile "file" lrecl=1M; input; if _n_<=5 then list; run;` – Tom Oct 16 '18 at 16:28
  • Does the valid data ever contain any of those individual characters? `~'#@`? If not then just use ALL of them as delimiters. – Tom Oct 16 '18 at 16:30
  • Hi, I hope that the data provider would avoid using these characters in free text but I think that they will be present as some names are given in other languages like Japanese. In this case, SAS does not recognise the character and it is imported with special characters – 78282219 Oct 16 '18 at 16:57
  • I am currently working on testing out your proposed solutions so will need a bit of time to respond with results – 78282219 Oct 16 '18 at 17:04
  • I am getting the error: the lrecl exceeds the maximum allowable length for an infile / infile = variable statement (32,767*) – 78282219 Oct 16 '18 at 18:08
  • 32767 is limit for a VARIABLE. Limit for LRECL depends on OS, but should be much larger (unless you are using a really old version of SAS). Try `LRECL=1M` to set to one million or just type out the zeros. – Tom Oct 16 '18 at 19:41

1 Answers1

1

If you have one record per line, then just use missover or truncover option on infile; that will tell SAS to stop reading after it hits EOL. You'll have to deal with the weird end delimiter only insofar as it will cause an error if you try to read it into a numeric; you could possibly try to remove the string first in preprocessing.

data want;
  infile "yourfile-location" dlmstr="'~'" dsd lrecl=32767 truncover;
  input @;
  _infile_ = tranwrd(_infile_,"'#@#@'"," ");
  input var1 var2 var3 var4 var5 var6;
run;

The way you wrote it there could also work, if the above doesn't for some reason; basically read in the string twice with two different delimiter options, once with "'#@#@'" dlmstr, then with "'~'" dlmstr. Or rather than read it in twice, read it in once with the first, then parse it with the second.

data want;
  infile "yourfile-location" dlmstr="'#@#@'" dsd lrecl=32767;
  input @;
  array var[6] var1-var6;
  do _i = 1 to countc(_infile_,"~")+1;
    var[_i] = scan(_infile_,_i,"~");
  end;
run;

The above isn't perfect as it doesn't deal with those quotation marks around the delimiter, but you can figure that out depending on details - is that quotation mark safe to just compress out entirely pre-input, or do you need to do some fancy footwork with SUBSTR?

As far as the string variable length goes, most likely what's taking time is writing out the file. Use options compress=char; to turn on dataset compression, assuming your ultimate use for these files is compatible with that (if you're just running SAS code on them, it should be). Then it won't try to write out the full variable length. If that fails, you may need to reconsider your dataset structure to avoid having this issue - you'd need to ask a separate question with a lot more details though to find out the best solution there.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • Hi Joe, Thanks for the information, I'm currently reading and exploring the truncover option. I wasn't explicit with the structure of my .txt file, I have received it from a data provider and the long charvar columns are not the only one present in the row, my typical problem is that the char variable is so long that it truncates and therefore the delimiter is truncated and lost. I'm hoping the truncover command will fix this? – 78282219 Oct 16 '18 at 15:21
  • I am currently working on testing out your proposed solutions so will need a bit of time to respond with results – 78282219 Oct 16 '18 at 17:05
  • I am getting the following error message the lrecl exceeds the maximum allowable length for an infile / infile = variable statement (32,767*) – 78282219 Oct 16 '18 at 18:07
  • You will have to 2-pass the data file, 1st pass to determine the column sizes needed, and for the case of really long texts, the number of 32K length character variables needed to contain the data (this would be preparations for code generation (aka macro)). 2nd pass would read the data file using your codegen. – Richard Oct 16 '18 at 18:18
  • Thanks, I'll tackle this – 78282219 Oct 16 '18 at 18:21
  • You cannot use the `_INFILE_` automatic variable if your LRECL is longer than 32K. – Tom Oct 16 '18 at 19:44
  • @Tom Good point - you'll have to parse it in 32k chunks if it's legitimately that long. If it's not, then set LRECL=32767 (or shorter). – Joe Oct 16 '18 at 19:45