How to tag and store files, by metadata, in Python?

Question

I want to build a manual file tagging system like this. Given that a folder contains these files:

data/
    budget.xls
    world_building_budget.txt
a.txt
b.exe
hello_world.dat
world_builder.spec

I want to write a tagging system where executing

py -3 tag_tool.py -filter=world -tag="World-Building Tool"

will output

These files were tagged with "World-Building Tool":
    data/world_building_budget.txt
    hello_world.dat
    world_builder.spec

Another example. If I execute:

py -3 tag_tool.py -filter="\.txt" -add_tag="Human Readable"

It will output

These files were tagged with "Human Readable":
    data/world_building_budget.txt
    a.txt

I am not asking "Do my homework for me". I want to know what approach I can take to build something this? What data structure should I use? How should I tag contents in a directory?

score 2 · Accepted Answer · edited May 09 '22 at 12:01

First, I am not clear if this is actually homework, but my first recommendation is always to see if it's already done (and it seems to be): https://pypi.org/project/pytaggit/

If I were to ignore that and build it myself, I would consider what a tagging systems structure is. Long story (skip ahead if not interested): consider a simple file system... It has exactly one path to every file. You can do a string search by file name or even properties, but the organization is such that a file can only exist in one place. This is much like a physical file system. In virtual file systems, we also have file links. There are two types: soft (short cuts in Windows) and hard. Soft links make a file appear as though they were in multiple locations. This is like being able to file Soccer under "S" and creating another file called Football in "F" that just says "see Soccer". By contrast, hard links actually make it so that the file effectively exists in multiple locations. This would be like being able to pull the exact same file "Soccer" in both "F" and "S". If someone makes a change to one, the change is made to both. This is still a very limited organization restricted to file location. If you wanted to be nimble and apply arbitrary organizations, hard links become heavy to maintain. Tagging is another way to accomplish this without too much overhead.

...... Past the skipped part ......

There is more than one way to accomplish this, but here is a generic look at what is needed. Tags need to be able to have a many-to-many relationship between files and tags. I.e. you should be able to look at a file and see all tags associated AND you should be able to look at a tag and see all files associated with it. If you want to store the data once, you will have to choose which way to optimize as you are choosing to organize your data only one way. Therefore, forward lookup will be natural and reverse lookup will require processing. If you want to maintain two data sets (or indexes), you can store both forward and reverse lookups. If you know that your data won't grow past a certain size and/or the usage will typically only require one direction, then one index should be fine. Otherwise, I would choose two. The trade-off is the overhead of keeping them in-sync.
If you want to optimize for tags(filename), then you would probably use a dict with something like

filenameTags = {'myFileName': ['tag1', 'tag2', ...]}

Getting filenames from tags with this structure would require a process of searching all of the embedded lists and returning the key associated, if there is a match. You can reverse this structure (filenames(tag)) if you want to optimize the other way. You can also create both file structures, but then you have the overhead of keeping both in sync.
Lastly to keep this persistent, save to a file or DB. Redis supports this nicely.

Note that `pytaggit` does not support Windows, nor Ubuntu. Also, for Ubuntu it does not seem (easily) possible: https://askubuntu.com/questions/827701/how-can-i-tag-files-and-search-them-later-based-on-the-tag — a.t., Jul 09 '22 at 18:35

How to tag and store files, by metadata, in Python?

1 Answers1