How can I optimize this code?

Question

My current project has us using TreeSet and TreeMap in Java, with an input array of 10514 Song elements read in from a text file. Each Song contains a Artist, Title and Lyric fields. The aim of this project is to conduct fast searches on the lyrics using sets and maps.

First, I iterate over the input Song array, accessing the lyrics field and creating a Scanner object to iterate over the lyric words using this code: commonWords is a TreeSet of words that should not be keys, and lyricWords is the overall map of words to Songs.

public void buildSongMap() {
    for (Song song:songs) {
        //method variables
        String currentLyrics= song.getLyrics().toLowerCase(); 
        TreeSet<Song> addToSet=null;
        Scanner readIn= new Scanner(currentLyrics);
        String word= readIn.next();

        while (readIn.hasNext()) {

            if (!commonWords.contains(word) && !word.equals("") && word.length()>1) {
                if (lyricWords.containsKey(word)) {
                    addToSet= lyricWords.get(word);
                    addToSet.add(song);
                    word=readIn.next();
                } else 
                    buildSongSet(word);

            } else 
                word= readIn.next();
        }

    }

In order to build the songSet, I use this code:

public void buildSongSet(String word) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    for (Song song:songs) {
        //adds song to set 
        if (song.getLyrics().contains(word)) {
            songSet.add(song);
        }
    }
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

Now, since buildSongSet is called from inside a loop, creating the map executes in N^2 time. When the input array is 4 songs, searches run very fast, but when using the full array of 10514 elements, it can take over 15+ min to build the map on a 2.4GHz machine with 6 GiB RAM. What can I do to make this code more efficient? Unfortunately, reducing the input data is not an option.

Could you include the definitions of commonWords and lyricWords — mR_fr0g, Nov 03 '10 at 16:39
added definitions. commonWords is a set of words that should not be keys and lyricWords is the overall structure that searches are run on. — Jason, Nov 03 '10 at 16:53

score 6 · Accepted Answer · edited May 23 '17 at 12:11

It looks like your buildSongSet is doing redundant work. Your block:

if (lyricWords.containsKey(word)) {
    addToSet= lyricWords.get(word);
    addToSet.add(song);
    word=readIn.next();
}

adds a song to an existing set. So, when you find a word you don't know about, just add one song to it. Change buildSongSet to:

public void buildSongSet(String word, Song firstSongWithWord) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    songSet.add(firstSongWithWord);
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

the remaining songs left to be iterated will then be added to that songset from the first block of code if they contain that word. I think that should work.

EDIT just saw this was homework... so removed the HashSet recommendations..

Ok.. so suppose you have these Songs in order with lyrics:

Song 1 - foo
Song 2 - foo bar
Song 3 - foo bar baz

Song 1 will see that foo does not contain lyricWords, so it will call buildSongSet and create a set for foo. It will add itself into the set containing foo.

Song 2 will see that foo is in lyricWords, and add itself to the set. It will see bar is not in the set, and create a set and add itself. It doesn't need to traverse previous songs since the first time the word was seen was in Song 2.

Song 3 follows the same logic.

Another thing you can try doing to optimize your code is to figure out a way to not process duplicate words in the lyrics. if your lyrics are foo foo foo foo bar bar bar foo bar then you're going to be doing a lot of unnecessary checks.

EDIT also see rsp's answer - additional speedups there, but the big speedup is getting rid of the inner loop - glad it's down to 15 secs now.

My thinking was when a new word is encountered, iterate over the song list and add the matches to the set. If I follow your code, only the songs after that point will be added to the set, but there is the possiblity that the previous songs will also be matches, thus the nested loop. — Jason, Nov 03 '10 at 16:50
If the previous songs contained that word, wouldn't it have triggered buildSongSet anyway? If you follow the logic, previous songs won't contain any words that you haven't already called buildSongSet for. I'll try to explain in the answer.. — NG., Nov 03 '10 at 17:12
Alright, makes sense. I altered the code, it executes in ~15 seconds now. Eliminating that second loop did the trick. Unfortunately, using HashSet is not an option, project requires using TreeSet — Jason, Nov 03 '10 at 17:24

score 4 · Answer 2 · answered Nov 03 '10 at 17:11

4

The whole buildSongSet() method is not needed imho, as your main loop already adds songs to the collection by word. The only thing you are missing is the addition of a set for a new word, something like:

if (lyricWords.containsKey(word)) {
    addToSet= lyricWords.get(word);
} else {
    addToSet = new TreeSet();
    lyricWords.put(word, addToSet);
}
addToSet.add(song);

One issue that you did not tackle is that songs end up being added to the set multiple times, for every occurence of the word in the song.

Another issue is that in the case that a song contains just 1 word, you do not add it at all! It is always better to check the condition first:

String word = null;
while (readIn.hasNext()) {
    word = readIn.next();

Your condition is doing one check too many (the empty string has length < 1), and swapping the checks can speed up things too:

if (word.length() > 1 && !commonWords.contains(word)) {

answered Nov 03 '10 at 17:11

rsp

23,135
6
55
69

1

"One issue that you did not tackle is that songs end up being added to the set multiple times, for every occurence of the word in the song." This is a `Set`, not a `List`. If you try to [`add`](http://download.oracle.com/javase/6/docs/api/java/util/Set.html#add%28E%29) a duplicate element, it just returns `false`. – Powerlord Nov 03 '10 at 17:35
thats why I didn't use a conditional to check. both `contains()` and `add()` are O(logN). If the add were O(N), then I would use the check. – Jason Nov 03 '10 at 19:28
I marked the other answer the accepted answer because it helped me the most, but if possible I would have accepted yours as well. Thanks for showing me the reordering of the if statement – Jason Nov 03 '10 at 19:45

score 3 · Answer 3 · answered Nov 03 '10 at 16:30

3

Please, try change TreeSet to HashSet. I can't see where you obtain the benefits of TreeSet.

answered Nov 03 '10 at 16:30

Vladimir Ivanov

42,730
18
77
103

except this is a requirement of the project. From the objectives: **Using Java TreeSet and TreeMap and set operations** – Jason Nov 03 '10 at 16:40

score 0 · Answer 4 · edited May 23 '17 at 12:04

0

if you want a very extensible, easy way of solving this with performance in the order of a few millisecons. Consider lucene http://lucene.apache.org/

refer to my answer here for example of how to index and search How do I index and search text files in Lucene 3.0.2?

edited May 23 '17 at 12:04

Community

1
1

answered Nov 03 '10 at 22:29

smartnut007

6,324
6
45
52

For a real-world situation, I might consider your solution, but this is a data structures class project designed to give us hands-on experience with all aspects of structures. Using Lucene would short-circuit this purpose. – Jason Nov 03 '10 at 22:47

How can I optimize this code?

4 Answers4