0

currently I'm working on a project and am using a CsvIterator from the MALLET API to create an InstanceList. However, I'm not sure quite how the data field in a MALLET Instance object is supposed to be formatted. I'm attempting to write the data parsed from a line of text to a file.

I understand that the data field is typically a FeatureVector object in an InstanceList but I'm just not sure what the CsvIterator is looking for.

Thanks.

1 Answers1

1

For classification or topic modeling, the "data" field in the input file should look like the original document with spaces substituted for newline characters.

How Mallet understands the "data" field is determined by the pipes you use. These classes define the rules that convert string input into a FeatureVector.

The default behavior implemented in the Csv2Vectors class, for example, divides the string into tokens based on a regular expression, and then converts each token string into a feature from a data alphabet. There are pipe objects for many common transformations such as lower-casing and stopword removal.

David Mimno
  • 1,836
  • 7
  • 7