1

Aim is to get latest 100 files. Currently it is done by scanning all files - preparing a files list - and then apply sort+limit.

this is very slow - in cases when directory is too large. So is there any way or API available which does this without loading full file list.

Currently following three approaches do not give satisfactory performance when files are in range of few thousands.

  • Files.listFiles - Java 1.2
  • DirectoryStream - Java 1.7
  • Files.Walk - Java 1.8
RaviSam
  • 130
  • 1
  • 8
  • You want to get the n files with the latest updated timestamp? – dan1st Oct 20 '20 at 11:43
  • @dan1st I want to get latest 100 files from say 20,000 files. – RaviSam Oct 20 '20 at 11:49
  • 1
    A java directory watch service perhaps. Though large directories are inherently slow. A ProcessBuilder process with a Linux filter for the last files? – Joop Eggen Oct 20 '20 at 11:50
  • 1
    If you have any control over the creation of the files then sharding them into subdirectories, by time of creation may help. – tgdavies Oct 20 '20 at 12:00
  • @tgdavies good idea of sharding, but not applicable to my case, since file system is controlled by end user. Say my case - is similar to G.Drive or Dropbox. – RaviSam Oct 20 '20 at 12:03
  • As Java is inherently slow with file operations (due to the cross platform implementation), it might be quicker using a native command line list function and import the n lines of that output. Also parallelism isn't helping because you end up running into the filesystem limitations at some point (unless it's high end PCI 4 SSD). – Fullslack Oct 20 '20 at 12:03
  • 1
    @RaviSamani of course try a WatchService first. Every boot requires a full directory scan first (Files.list) but then it should be faster. – Joop Eggen Oct 20 '20 at 12:03
  • Are the latest files newly created, or may they be files which were already present which have been updated? – tgdavies Oct 20 '20 at 12:10
  • @tgdavies already present. – RaviSam Oct 20 '20 at 12:20
  • 1
    Iterate through all files and store them in a `PriorityQueue`; whenever the size reaches 101, remove the oldest. – Holger Nov 06 '20 at 08:46

1 Answers1

3

You have to look at the attributes of each file to find its age, and you have to look at them all to find the N newest.

Your only freedom of choice is in how you do the looking. There's no need to read the file contents, for example.

I'd consider using Files.find(). This appears from its documentation to do the minimum work required.

You don't need to save all files. Track the oldest of the newest 100 seen. If the 'next' file is older than that, you don't need to keep it. Otherwise you have to figure out which of the 100 to discard. This trades off overhead of keeping an entire list for overhead of deciding what to discard. It could work in your favour if the number of files is much larger than 100.

To some extent the overhead is file-system dependent. If the last-modified time is stored in the directory entry then there's no need to look at the inode to get it. That's not under your control, of course.

user14387228
  • 391
  • 2
  • 4
  • well so in our case we've seen files in the multiples of 10k - meaning some folder having - 20,000 files, 70,000 even! LARGE directories! So to find out latest 100 - only way is to iterate thru all and then figure out - right? . – RaviSam Oct 20 '20 at 12:36
  • 1
    Yes. The design that puts that many files in one directory is defective, so you have to program round that defective design. How above moving the 69,900 files that are not the oldest into a separate directory? That's a one-time hit that will bring future benefit. – user14387228 Oct 20 '20 at 17:06
  • 2
    @user14387228 unfortunately, from the OP's comments above it sounds as though any file may be modified and become a newest file. – tgdavies Oct 20 '20 at 20:22