I am attempting to read in all of the files in all subdirectories of a directory. I have the logic written, but I am doing something slightly wrong because it is reading in each file twice.
To test my implementation, I created a directory with three subdirectories in it each having 10 documents in them. That should be 30 documents in total.
Here is my code for testing that I am reading in the documents correctly:
String basePath = "src/test/resources/20NG";
Driver driver = new Driver();
List<Document> documents = driver.readInCorpus(basePath);
assertEquals(3 * 10, documents.size());
Where Driver#readInCorpus
has the following code:
public List<Document> readInCorpus(String directory)
{
try (Stream<Path> paths = Files.walk(Paths.get(directory)))
{
return paths
.filter(Files::isDirectory)
.map(this::readAllDocumentsInDirectory)
.flatMap(Collection::stream)
.collect(Collectors.toList());
}
catch (IOException e)
{
e.printStackTrace();
}
return Collections.emptyList();
}
private List<Document> readAllDocumentsInDirectory(Path path)
{
try (Stream<Path> paths = Files.walk(path))
{
return paths
.filter(Files::isRegularFile)
.map(this::readInDocumentFromFile)
.collect(Collectors.toList());
}
catch (IOException e)
{
e.printStackTrace();
}
return Collections.emptyList();
}
private Document readInDocumentFromFile(Path path)
{
String fileName = path.getFileName().toString();
String outputClass = path.getParent().getFileName().toString();
List<String> words = EmailProcessor.readEmail(path);
return new Document(fileName, outputClass, words);
}
When I run the test case, I see that the assertEquals
failed because there were 60 documents retrieved, not 30, which is incorrect. When I debugged, the documents were all inserted into the list once, and then inserted again in the exact same order.
Where in my code am I reading in the documents twice?