Remove duplicate row from CSV file based on a string - JAVA

Question

I have recently scraped TripAdvisor for some review data and currently have a dataset with following structure.

Organization,Address,Reviewer,Review Title,Review,Review Count,Help Count,Attraction Count,Restaurant Count,Hotel Count,Location,Rating Date,Rating

Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth  is open only twice a day and so it's best to check these timings ...   More,89,48,7,0,0,Vientiane,2 days ago,3

Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth  is open only twice a day and so it's best to check these timings  though I would imagine that the crowds would be at a peak.,89,48,7,0,0,Vientiane,2 days ago,3

As you can see, the first row of objects has a partial review, where as the second row has the full review.

What I want to achieve is to check for duplicates like this, and remove the object(row) which has the partial review, and keep the row which has full review.

I see that every partial review ends with 'More' at the end, can this be somehow used to filter out partial reviews?

How can I go about this using OpenCSV?

@thst the scraper is written such that any commas inside reviews are removed. — Mahesh De Silva, Jan 24 '16 at 21:15

Jose Martinez · Answer 1 · 2016-01-24T21:47:07.343

How about the following:

 HashMap<String, String[]> preferredReviews = new HashMap<>();
 int indexOfReview = 4;
 CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
     String reviewId = nextLine[0];
     String[] prevReview = preferredReviews.get(reviewId);
     if (prevReview == null || prevReview[indexOfReview].length < nextLine[indexOfReview].length) {
         preferredReviews.put(reviewId, nextLine);
     }
 }

In the second clause of the IF statement it does a length comparison to decide which to go with. What I like about this approach is that if for some reason the full size review is not present, then at least you will get the short one.

But it can be changed to check for "... More" instead of review length.

 HashMap<String, String[]> preferredReviews = new HashMap<>();
 int indexOfReview = 4;
 CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
     String reviewId = nextLine[0];
     if (nextLine[indexOfReview].endsWith("... More")){
         preferredReviews.put(reviewId, nextLine);
     }       
 }

judging from the column titles, `indexOfReview` is `4`, `3` is the title. — thst, Jan 24 '16 at 21:34

thst · Accepted Answer · 2016-01-24T21:37:02.727

Note: It is not okay to commercially use the data of another webservice without explicit permission.

Having said that: Basically, openCSV will give you an enumeration of arrays. The arrays are your lines.

You need to copy your lines into some other, more semantic data structure. Judging from your header rows, I would create a bean like this.

public class TravelRow {
   String organization;
   String address;
   String reviewer;
   String reviewTitle;
   String review; // you get it... 

   public TravelRow(String[] row) {
       // assign row-index to property
       this.organization = row[0];
       // you get it ...
   }
}

You may want to generate getXXX and setXXX functions for it.

Now you need to find a primary key for the row, I suggest it is organisation. Iterate over the rows, create a bean for it, add it to a hashmap with key organisation.

If the organisation is already in the hashmap, you compare the current review with the already stored review. If the new review is longer or the stored one ends with ... more, you replace the object in the map.

After iterating over all lines, you have a Map with the reviews you want.

Map<TravelRow> result = new HashMap<TravelRow>();
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
   // nextLine[] is an array of values from the line
   if( result.containsKey(nextLine[0]) ) {
       // compare the review
       if( reviewNeedsUpdate(result.get(nextLine[0]), nextLine[4]) ) {
           result.get(nextLine[0]).setReview(nextLine[4]); // update only the review, create a new object, if you like
       }
   }
   else {
       // create TravelRow with array using the constructor eating the line
       result.put(nextLine[0], new TravelRow(nextLine));
   }
}

reviewNeedsUpdate(TravelRow row, String review) will compare the review with the row.review and return true, if the new review is better. You can extend this function until it matches your needs....

private boolean reviewNeedsUpdate( TravelRow row, String review ) {
    return ( row.review.endsWith("more") && !review.endsWith("more") ); 
}

Thanks. I managed to get it done tweaking the code snippet you provided. :) — Mahesh De Silva, Jan 29 '16 at 15:53

David Frank · Answer 3 · 2016-01-24T22:02:51.337

Say, you define class Rating to store the related data.

class Rating {
  public String review;  // consider using getters/setters instead of public fields

  Rating(String review) {
    this.review = review;
  }
}

Read the content of the CSV.

Set<Rating> readCSV() {
  List<String[]> csv = new CSVReader(new FileReader("reviews.csv")).readAll();
  List<Rating> ratings = csv.stream()
      .map(row -> new Rating(row[4])) // add the other attributes
      .collect(Collectors.toList());
  return mergeRatings(ratings);
}

We will use a TreeSet to sort out the duplicates. That requires a custom comparator that discards items that are already in the set.

class RatingMergerComparator implements Comparator<Rating> {

  @Override
  public int compare(Rating rating1, Rating rating2) {
    if (rating1.review.startsWith(rating2.review) ||
        rating2.review.startsWith(rating1.review)) { 
      return 0;
    }
    return rating1.review.compareTo(rating2.review);
  }
}

Create mergeRatings method

void removeMoreEndings(List<Ratings> ratings) {
  for (Rating rating : ratings) {
    if (rating.review.endsWith("...   More")) {
      rating.review = rating.review.substring(0, rating.review.length() - 9); // 9 = length of "...  More"
    }
  }
}

Set<Rating> mergeRatings(List<Rating> ratings) {
  removeMoreEndings(ratings); // remove all "...  More" endings
  // sort ratings by length in a descending order, since the set will discard certain items,
  // it is important to keep the longer ones, so they come first
  ratings.sort(Comparator.comparing((Rating rating) -> rating.review.length()).reversed());
  TreeSet<Rating> mergedRatings = new TreeSet<>(new RatingMergerComparator());
  mergedRatings.addAll(ratings);
  return mergedRatings;
}

UPDATE

I may have misread the OP. The above solution gives very good performance even if the records that have to be merged are further away in the CSV. If you are sure, the partial a full reviews are consecutive, the above may be an overkill.

score 0 · Answer 4 · answered Jan 26 '16 at 17:53

It depends on how you are reading the data.

If you are reading the data as Beans using the MappingStategy you can create your own filter using the CSVFilter interface and inject that into the CsvToBean class. This causes a line to be read (allowed) or skipped based on the criteria in allowedLine method. The java docs for CSVFilter gives an excellent example - for your case you would allow all lines whose Review column does not end with "More".

If you are just using the CSVReader/CSVParser it will be a little trickier. You will need to read the header and see what column is the Review. Then when reading each line you will look at the element at that index and if it ends in "More" do not process it.

Remove duplicate row from CSV file based on a string - JAVA

4 Answers4