Skipping alternate lines while reading .tsv file

Question

I have a .tsv file which has 39 column the last but one column has data as string whose length in more than 100,000 characters Now what is happening is when i am trying to read the file line 1 has headers and then the data follows

What is happening is its after reading line 1 its goes to line 3 then line 5 then line 7 Though all the rows have same data Following the log i am getting

lineNo=3, rowNo=2, customer=503837-100 , last but one cell length=111275
lineNo=5, rowNo=3, customer=503837-100 , last but one cell length=111275
lineNo=7, rowNo=4, customer=503837-100 , last but one cell length=111275
lineNo=9, rowNo=5, customer=503837-100 , last but one cell length=111275
lineNo=11, rowNo=6, customer=503837-100 , last but one cell length=111275
lineNo=13, rowNo=7, customer=503837-100 , last but one cell length=111275
lineNo=15, rowNo=8, customer=503837-100 , last but one cell length=111275
lineNo=17, rowNo=9, customer=503837-100 , last but one cell length=111275
lineNo=19, rowNo=10, customer=503837-100 , last but one cell length=111275

Following is my code:

import java.io.FileReader;
import org.supercsv.cellprocessor.Optional;
import org.supercsv.cellprocessor.constraint.NotNull;
import org.supercsv.cellprocessor.ift.CellProcessor;
import org.supercsv.io.CsvBeanReader;
import org.supercsv.io.ICsvBeanReader;
import org.supercsv.prefs.CsvPreference;

public class readWithCsvBeanReader {
    public static void main(String[] args) throws Exception{
        readWithCsvBeanReader();
    }


private static void readWithCsvBeanReader() throws Exception {

    ICsvBeanReader beanReader = null;

    try {

        beanReader = new CsvBeanReader(new FileReader("C:\MAP TSV\abc.tsv"), CsvPreference.TAB_PREFERENCE);
        // the header elements are used to map the values to the bean (names must match)
        final String[] header = beanReader.getHeader(true);
        final CellProcessor[] processors = getProcessors();
        TSVReaderBrandDTO tsvReaderBrandDTO = new TSVReaderBrandDTO();

        int i = 0;
        int last = 0;

        while( (tsvReaderBrandDTO = beanReader.read(TSVReaderBrandDTO.class, header, processors)) != null ) {
            if(null == tsvReaderBrandDTO.getPage_cache()){
                last = 0;
            }
            else{
                last = tsvReaderBrandDTO.getPage_cache().length();
            }
            System.out.println(String.format("lineNo=%s, rowNo=%s, customer=%s , last but one cell length=%s", beanReader.getLineNumber(),
                beanReader.getRowNumber(), tsvReaderBrandDTO.getUnique_ID(), last));
            i++;
        }

        System.out.println("Number of rows : "+i);

    }
    finally {
        if( beanReader != null ) {
            beanReader.close();
        }
    }
}

private static CellProcessor[] getProcessors() {

    final CellProcessor[] processors = new CellProcessor[] { 
         new Optional(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new NotNull(), new NotNull(),
         new NotNull(), new NotNull(), new NotNull(), new Optional()};

        return processors;
    }
}

Please let me know where i am going wrong

score 1 · Answer 1 · edited Dec 10 '15 at 15:34

If you use a CSV parser to parse a TSV input, you're gonna have a bad time. Use a proper TSV parser. uniVocity-parsers comes with a TSV parser/writer. You can use annotated java beans as well to parse your file directly into instances of a class.

Examples:

This code parses a TSV as rows.

TsvParserSettings settings = new TsvParserSettings();

// creates a TSV parser
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

Use a BeanListProcessor parse into java beans:

BeanListProcessor<TestBean> rowProcessor = new BeanListProcessor<TestBean>(TestBean.class);

TsvParserSettings parserSettings = new TsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(parserSettings);
parser.parse(new FileReader(yourFile));

// The BeanListProcessor provides a list of objects extracted from the input.
List<TestBean> beans = rowProcessor.getBeans();

This is how the TestBean class looks like: class TestBean {

// if the value parsed in the quantity column is "?" or "-", it will be replaced by null.
@NullString(nulls = { "?", "-" })
// if a value resolves to null, it will be converted to the String "0".
@Parsed(defaultNullRead = "0")
private Integer quantity;


@Trim
@LowerCase
@Parsed(index = 4)
private String comments;

// you can also explicitly give the name of a column in the file.
@Parsed(field = "amount")
private BigDecimal amount;

@Trim
@LowerCase
// values "no", "n" and "null" will be converted to false; values "yes" and "y" will be converted to true
@BooleanString(falseStrings = { "no", "n", "null" }, trueStrings = { "yes", "y" })
@Parsed
private Boolean pending;

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Honza Zidek · Accepted Answer · 2014-02-26T09:00:15.540

0

I checked http://supercsv.sourceforge.net/examples_reading.html. Have a close look at Example CSV file and Output. Couldn't it be the case that your lines contain a non-escaped " (double apostrophe) character so the parser thinks that the data record spans over two physical lines?

If you do not use the double-apostrophe character as a quote character, you can change the CsvPreference - see http://supercsv.sourceforge.net/apidocs/org/supercsv/prefs/CsvPreference.html - so that double-quote is not considered as a quote character:

CsvPreference MY_PREFERENCES = new CsvPreference.Builder(
    SOME_NEVER_USED_CHARACTER, ',', "\r\n").build();

Of course for tab-delimited CSV use something like this:

CsvPreference MY_PREFERENCES = new CsvPreference.Builder(
    SOME_NEVER_USED_CHARACTER, '\t', "\r\n").build();

Refer to the CsvPreference javadoc for the signature of the Builder and amend the actual values accordingly.

edited Feb 26 '14 at 09:00

answered Jan 17 '14 at 09:29

Honza Zidek

9,204
4
72
118

Ya Honza you are right it has a non-escaped " ... can you suggest any method to handle it during runtime ? – Subhrajyoti Das Jan 20 '14 at 08:45
I'm afraid you should: - either handle it on the side of the application which creates the data, - or pass your input file through a "preprocessor" - your code which reads it and replaces all the double-apostrophes with two double-apostrophes. Maybe the library provides such an option. Otherwise your source file is not a valid CSV/TSV. Btw, if I answered your question and the presence of non-escaped double-apostrophes was the root cause of your problem, could you please mark my answer as answer? :) Thanks. – Honza Zidek Jan 20 '14 at 11:15
The data files i have can have these kind of discrepancies and problem is data size is huge and can go upto 100gb of data in a single file ... so i cant design a processor as it will run out of memory and i am unable to find a way to handle this while reading – Subhrajyoti Das Jan 20 '14 at 12:32
I am not getting this. How can a processor run out of memory? Just create a simple reader and writer, read the rows from one file, fix the apostrophes a write it to the other file. The maximum memory consumption is the data needed for a single row. Of course the garbage collector will be busy :) By processor I do not mean one of the CellProcessors, I am just meaning it in general. – Honza Zidek Jan 20 '14 at 13:24
You have one MORE option. If you do not use the double-apostrophe character as a quate character, you can change the CsvPreference - see http://supercsv.sourceforge.net/apidocs/org/supercsv/prefs/CsvPreference.html - so that double-quote is not considered as a quote character. – Honza Zidek Jan 20 '14 at 13:28
hey Honza ... I am unable to fix it Below is a sample data A B C 1" 2" 3" 1 2 3 if i try to process the above data then it says that unexpected end of file while reading quoted column beginning on line 3 and ending on line 3 context=null – Subhrajyoti Das Feb 25 '14 at 11:16
Hi Subhrajyoti, due to the removed line ends in the comment, I am not sure what your input data is... – Honza Zidek Feb 25 '14 at 13:25
Hi Honza this is one row of sample tab spaced sample data A B C 1" 2" 3" 1 2 3 with each being one cell ... now when the supercsv tries to read it will take 1" 2" as one cell as anything between "" is considered as in the same cell and hence it falls short of data and throws the error : abrupt end of file – Subhrajyoti Das Feb 26 '14 at 08:27
How did you set CsvPreference? public CsvPreference.Builder(char quoteChar, int delimiterChar, String endOfLineSymbols) You should set quoteChar to something which is NOT double apostroph, delimiterChar to what you use for delimiting fields. And pass your instance of CsvPreference to CsvBeanReader. Place here your code or give me your email address so we may communicate about the code. – Honza Zidek Feb 26 '14 at 08:50
I do not know if you used my example in the solution literally (whoch you should have not) - probably you will have to replace ',' with '\t' :) and do not forget to pass it to the constructor CsvBeanReader(..., MY_PREFERENCES); – Honza Zidek Feb 26 '14 at 08:54

Skipping alternate lines while reading .tsv file

2 Answers2