0

I'm new to programming, but I have basics of python 3 and have treetagger installed, and through my command shell I can POS tag files.

However, I have 427 files in a folder I am trying to automatically have opened and run through the Treetagger and I can't quite figure out how to make this happen. My current code is as such:

import os
import sys
import subprocess
import re

rootdir = r"/Spanish_(ACTIV-es)_corpus/plain"

I want to automatically check through a folder with over 427 files and to have appropriate files be POS tagged

I think this requires some combo of what is below along with code including SUBPROCESS commands to get TreeTagger to do its work, that I absolutely do not understand how to use, but tried to implement from feedback from another question here on stack overflow.

sample code

How do I get the movie_pos to be the file that treetagger will POS tag as it walks through the files in the folder?

Then:output?? Do I need to have already created 427 separate files for the output, or is there a way to authomatically make the output the a modified title of the input so the output isn't confused. (in the title of the files, is where the metadata is currently stored).

gmuraleekrishna
  • 3,375
  • 1
  • 27
  • 45

1 Answers1

0

Is using Python a strict requirement for tagging the files ? If not, you can easily achieve it just using the shell, by looping over the files of your folder, running TreeTagger on each, and saving (like you correctly assume is possible) to a file with a different name.

As an example, here's a directory with 3 files:

$ ls mydir/
1.txt 2.txt 3.txt

With some Spanish text in them.

$ cat mydir/1.txt
Esto es una prueba.

You can then use

  1. the shell's find command to list all the files you care about (e.g. all the files that end in ".txt"

    find mydir/ -name "*.txt"

  2. the for command to loop over find's results (using backquotes ` `), and run TreeTagger over each

    $ for i in `find ....`; do tag_command_using_$i; done

(the variable $iholding the path to each file)

  1. the shell's redirect feature (>) to redirect TreeTagger's output (which you'd normally see on the screen), to a file that you can name appropriately, using the name of your original file

    tag_command $i > $i.tagged

In one line, it looks like this:

$ for i in `find mydir/ -name "*.txt"`; do cat $i | cmd/tree-tagger-spanish > $i.tagged; done

After it's finished you will have the newly created files in the same folder:

$ ls mydir/
1.txt        1.txt.tagged 2.txt        2.txt.tagged 3.txt        3.txt.tagged

$ cat mydir/1.txt.tagged
Esto es ADV esto~es
una ART un
prueba  NC  prueba
.   FS  .
Alex Constantin
  • 519
  • 4
  • 8