0

I'm having an issue with reading in data from a pipe delimited txt file. There are 3 records with these values for address_1 "'S example,". All of these records are followed by a few blank fields. address_1 ends up being "S example |||" for some reason SAS eats the "'" and then ends up ignoring the pipe delimiter. Does anyone know how to properly read in values which begin with a single quotation mark?

This is what I use for reading in the file

 data test;
attrib
        ID     length= $16
        ADDRESS_1        length= $50
        ADDRESS_2        length= $50
        ADDRESS_3        length= $50
        EMAIL_ADDRESS    length= $60;
      INFILE "&directory./Data/Example_Data.txt"
        dlm="|"   RECFM=v lrecl=32767 dsd firstobs=2 missover end=eof;


    input 
        ID     $
        ADDRESS_1        : $CHAR50.
        ADDRESS_2        : $CHAR50.
        ADDRESS_3        : $CHAR50.
        EMAIL_ADDRESS    $
;
run;

Here's some example data just copy and paste into a notepad

ID|Address_1|Address_2|Address_3|Email
1234|'S Road,|||email@GMAIL.COM
2534|'S Road,|||email2@GMAIL.COM

Annoyingly when i try this with example data it works but when I use my actual data the ' disappears and the fields get concatenated into 1.

Here's the code generated by SAS EG which reads my original data file in just fine

data example;
Length 
     Address_1 $40;
Format 
    ADDRESS_1        $CHAR40.;
Informat 
        ADDRESS_1        $CHAR40.;

infile "test_data.txt"
        LRECL=226
        ENCODING="LATIN1"
        TERMSTR=CRLF
        DLM='7F'x
        MISSOVER
        DSD ;

Input
        ADDRESS_1        : $CHAR40.;
run;
Mobix
  • 103
  • 10
  • someone can easily help you if you give couple lines of your data in datalines – Kiran Apr 08 '18 at 13:07
  • Try using the proc import first to see what is the length of the variable SAS is expecting from the file. If your `address_1` variable is less than 50, and you recursively give 50 as length, it will read past the delimiter! If you're using SAS EG, better use the GUI import to see the code it generates (same can be done via a proc import). Then change your snippet accordingly. – samkart Apr 08 '18 at 13:58
  • Show some example data lines. It is supposed to remove the quotes around quoted values. That is how you can load data that contains the delimiter. This means that values that contain quotes need to also be quoted. SAS will remove the quotes from values that are quoted with either single quote or double characters. – Tom Apr 08 '18 at 14:35
  • Make an example that has another bare single quote later in the line. – Tom Apr 08 '18 at 14:49
  • Still works fine, I'm looking into the link you sent me. Sadly, I can't send you the actual data for obvious reasons. – Mobix Apr 08 '18 at 14:53
  • Did EG generate the code to read a CSV file or did it convert an Excel spreadsheet into a delimited file? I know for Excel files that EG converts the input and uploads the converted input and reads that converted file. Perhaps it does the same thing when you import a delimited file? So SAS is actually reading a different file than what you see on your PC. – Tom Apr 08 '18 at 15:10
  • Exactly! SAS EG uploads the delimited file to SAS server after creating an intermediate delimited file which is then read back to the SAS dataset. But, while using GUI import of SAS EG, check the box for 'Generalize the step to run out of SAS EG environment'. That comes just before finish (the last step). Check that generated code with yours. That should clear some doubts. – samkart Apr 08 '18 at 15:16
  • Samkart What does the generalise the step to run out of SAS EG do? How is it different to the code automatically generated by SAS when I run the import. Yeah since SAS copies the file to a SAS server and then imports it I'm not sure if that's not a factor. The issues is affecting just 3 records and they all have the same value for address_1 – Mobix Apr 08 '18 at 15:22
  • SAS EG generalizes the step to run on Base SAS or other machines. As you know, SAS EG runs on a server, and does some behind-the-scenes cleansing of your spreadsheets to optimize the usage, it will be there in your generated code! sort of – samkart Apr 08 '18 at 15:24
  • ERROR: The file "#LN00306" could not be opened. A byte-order mark indicates that the data is encoded in "utf-16le". This conflicts with the "latin1" encoding that was specified for the fileref "#LN00306". Should I change the encoding to Latin1...but it's not on the encoding list... – Mobix Apr 08 '18 at 15:33
  • Keep the `ENCODING=LATIN1` option, and use your file path in the `infile` statement and your dlm. You can remove the `termstr` option. That should do it. – samkart Apr 08 '18 at 15:40
  • Oh sorry I should have specified. I get this error when trying to import using the EG wizard. I think the generalise step caused that. – Mobix Apr 08 '18 at 15:42
  • That shouldn't have happened with the wizard. Because it auto-detects the encoding for your file, even if you run with generalize step feature. – samkart Apr 08 '18 at 15:47

1 Answers1

0

Show your data lines.

Values containing delimiters or quotes should be quoted. So if the value of address_1 is 'S example, then in the delimited file it should be represented as

"'S example,"

or

'''S example,'

The DSD option on the INFILE statement that recognizes adjacent delimiters as indicating a missing value will also remove the quotes from around quoted values.

You might be hitting a similar bug as in this report at SAS Communities. https://communities.sas.com/t5/Base-SAS-Programming/Importing-delimited-file-with-text-qualifiers/m-p/449215#M113034

In that case the values could be parsed from the data line using the scan() function with the m modifier without the q modifier.

You could also try pre-processing the _INFILE_ variable to convert bare single quotes to "'". Here is a simple example.

data test2 ;
  length var1-var5 $50 ;
  infile cards dsd dlm='|' truncover firstobs=2;
  input @;
  _infile_=tranwrd(cats('|',_infile_,'|'),"|'|","|""'""|");
  _infile_=substrn(_infile_,2,length(_infile_)-2);
  input var1-var5 ;
cards4;
ID|Address_1|Address_2|Address_3|Email
2534|'S Road,||'|email2@GMAIL.COM
;;;;
Tom
  • 47,574
  • 2
  • 16
  • 29
  • The annoying thing is that the code generated by SAS EG reads these data rows just fine – Mobix Apr 08 '18 at 14:55
  • Could I use the ~ modifier to put quotation marks around address_1 and then delete them after the data has been read in? – Mobix Apr 08 '18 at 15:02
  • I don't think the issue is in that value. Check if your infile ever has a bare single quote. Search for `|'|`. – Tom Apr 08 '18 at 15:04
  • Why don't you just use the code generated by SAS EG. The one that works. Or reduce your length to 40 (same as SAS EG generated code) to check for the flaw. – samkart Apr 08 '18 at 15:05
  • Because I don't trust that code completely. It sometimes reads character data in as a number and messes up the dates. My approach is to read in the file using the import and then modify the values according to the file spec. – Mobix Apr 08 '18 at 15:12
  • Create a simple file with just one or two examples of lines it is reading wrong so you can easily test different methods. You could even modify that small file so that you could post it. – Tom Apr 08 '18 at 15:15
  • @Mobix You can do the same with SAS EG GUI import. It's not entirely out of your hands. Try clicking 'NEXT' a couple of times. That should bring you to the Variable names and data types window, where you can change the data types of your desired column. And, if you have no idea about the file you are importing ... that IS a problem! – samkart Apr 08 '18 at 15:21
  • Oh wow I had no idea I could do that :). I guess i'm too used to using base SAS and importing things manually. – Mobix Apr 08 '18 at 15:25
  • Been there! I get that – samkart Apr 08 '18 at 15:28