3

I am working on a database self project. I have an input file got from: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/

After processing into 1400 separate file, each named 00001.txt,... 01400.txt...) and after applying Stemming on them, I will store them separately in a specific folder lets call it StemmedFolder with the following format:

in StemmedFolder: 00001.txt includes:

investig
aerodynam
wing
slipstream
brenckman
experiment
investig
aerodynam
wing

in StemmedFolder: 00756.txt includes:

remark
eddi
viscos
compress
mix
flow
lu
ting

And so on....

I wrote the codes that do:

  1. get the StemmedFolder, Count the Unique words
  2. Sort Alphabetically
  3. Add the ID of the document
  4. save each to a new file 00001.txt to 01400.txt as will be described

{I can provide my codes for these 4 sections in case somebody needs to see how is the implementation or change or any edit}


output of each file will be result to a separate file. (1400, each named 00001.txt, 00002.txt...) in a specific folder lets call it FrequenceyFolder with the following format:

in FrequenceyFolder: 00001.txt includes:

00001,aerodynam,2
00001,agre,3
00001,angl,1
00001,attack,7
00001,basi,4
....

in FrequenceyFolder: 00999.txt includes:

00999,aerodynam,5
00999,evalu,1
00999,lift,3
00999,ratio,2
00999,result,9
....

in FrequenceyFolder: 01400.txt includes:

01400,subtract,1
01400,support,1
01400,theoret,1
01400,theori,1
01400,.....

______________

Now my question:

I need to combine these 1400 files again to output a txt file that looks like this format with some calculation:

'aerodynam' totalFrequency=3docs: [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
'book' totalFrequncy=2docs: [[Doc_00562,6],[Doc_01111,1]
....
....
'result' totalFrequency=1doc: [[Doc_00010,5]]
....
....

'zzzz' totalFrequency=1doc: [[Doc_01235,1]]

Thanks for spending time reading this long post

Rebin
  • 516
  • 1
  • 6
  • 16

1 Answers1

1

You can use a Map of List.

Map<String,List<FileInformation>> statistics = new HashMap<>()

In the above map, the key will be the word and the value will be a List<FileInformation> object describing the statistics of individual files containing the word. The FileInformation class can be declared as follows :

class FileInformation {
    int occurrenceCount;
    String fileName;

    //getters and setters
}

To populate the above Map, use the following steps :

  1. Read each file in the FrequencyFolder
  2. When you come across a word for the first time, put it as a key in the Map.
  3. Create a FileInformation object and set the occurrenceCount to the number of occurrences found and set the fileName to the name of the file it was found in. Add this object in the List<FileInformation> corresponding to the key created in step 2.
  4. The next time you come across the same word in another file, create a new FileInfomation object and add it to the List<FileInformation> corresponding to the entry in the map for the word.

Once you have the Map populated, printing the statistics should be a piece of cake.

for(String word : statistics.keySet()) {
  List<FileInformation> fileInfos = statistics.get(word);
  for(FileInformation fileInfo : fileInfos) {
      //sum up the occureneceCount for the word to get the total frequency
  }
}
Chetan Kinger
  • 15,069
  • 6
  • 45
  • 82
  • thanks for the time that you spent on thinking of my problem. let me implement your idea and post my code here to see. But meanwhile I added my codes that was generating the files and please take a look at them meanwhile. – Rebin May 30 '15 at 02:10
  • sorry i didnot know that I should not post my codes here. I will delete my original codes and also i will delete the codes related to your suggestion. Thank you. – Rebin May 30 '15 at 08:09
  • @Rebin You are allowed to post code here. However, you should not take someones answer, convert it to code and post it as part of your question. You should post a new question on the site instead of updating the existing question since stackoverflow does not work like a forum. If you feel that this answer was helpful, you can upvote it and accept the answer. – Chetan Kinger May 30 '15 at 08:11
  • i think here is misunderstanding. as I mentioned in my original post, I myself implemented the codes up to **(Now My Question is ....)** and I said if somebody needs i can provide my code and I thought it would be helpful to show you my original code and I was think about your suggestion and implemented partially. and put it there. So i posted my original codes and then partially touched your idea and showed it to you to see if i understood it correctly. anyway. Thanks for your time and effort – Rebin May 30 '15 at 08:18
  • @Rebin Yes. The code that you wrote and posted is not a problem. The code that you wrote after my suggestion is also not a problem. But you should not edit the question and post the code as part of the question. Instead, you should post a new question on the site with the current question as the background. You should upvote and accept an answer if it was helpful. – Chetan Kinger May 30 '15 at 08:20
  • thanks for spending time on this post. I m very new for asking questions and I didnt know that I should make another question. I will do that. thankyou. also I voted on your answer. – Rebin May 30 '15 at 08:23
  • @Rebin That's not a problem. There is a first time for everything. Don't forget to click on the tick and accept the answer if you feel that this answer is what you were looking for. Questions with accepted answers are still open to new answers. Read : [What does it mean when an answer is accepted](http://stackoverflow.com/help/accepted-answer) – Chetan Kinger May 30 '15 at 08:27