0

I am using Java to read and process some datasets from the UCI Machine Learning Repository. I started out with making a class for each dataset and working with the particular class file. Every attribute in the dataset was represented by a corresponding data member in the class of the required type. This approach worked fine till no. of attributed <10-15. I just increased or decreased the data members of the class and changed their types to model new datasets. I also made the required changes to the functions.

The problem: I have to work with much large datasets now. Ones with >20-30 attributes are vey tedious to work with in this manner. I dont need to query. My data discretization algorithm just needs 4 scans of the data to discretize it. My work ends right after the discretization. What would be an effective strategy here?

I hope I have been able to state my problem clearly.

The Mitra Boy
  • 764
  • 1
  • 6
  • 13
  • Some questions: 1) How do you plan on using the data? If you want to query or do something like that probably a database is your best bet. 2) How do you get the data from the repository? – javydreamercsw May 07 '12 at 16:53
  • What do you mean, when you say large datasets? What exactly is the problem with the data? Could you provide an example? – Behe May 07 '12 at 18:28
  • I am testing out a new algorithm for Data Discretization. For that, i need to read the data and process it in Java – The Mitra Boy May 14 '12 at 17:48

3 Answers3

3

Some options:

  1. Write a code generator to read the meta-data of the file and generate the equivalent class file.
  2. Don't bother with classes; keep the data in arrays of Object or String and cast them as needed.
  3. Create a class that contains a collection of DataElements and subclass DataElements for all the types you need and use the meta-data to create the right class at runtime.
Pooven
  • 1,744
  • 1
  • 25
  • 44
dfb
  • 13,133
  • 2
  • 31
  • 52
  • Thanks. This opens up new avenues of learning for me. I have never done like the Code Generator thing you are talking about. Could you possibly provide some pointers to where i can start learning about it? – The Mitra Boy May 14 '12 at 17:53
  • In this case, you would simply write a program that output Java class files. There are lots of ways to do this, but you'd basically just be printing out the class skeleton and the member variables based on the metadata, just like you would do if you were doing it manually. – dfb May 14 '12 at 20:29
1

Create a simple DataSet class that contains a member like the following:

 public class DataSet {
     private List<Column> columns = new ArrayList<Column>();
     private List<Row> rows = new ArrayList<Row>();

     public void parse( File file ) {
         // routines to read CSV data into this class
     }
 }

 public class Row {
     private Object[] data;

     public void parse( String row, List<Column> columns ) {
         String[] row = data.split(",");
         data = new Object[row.length];

         int i = 0;
         for( Column column : columns ) {
             data[i] = column.convert(row[i]);
             i++;
         }
     }
 }

 public class Column {
     private String name;
     private int index;
     private DataType type;

     public Object convert( String data ) {
         if( type == DataType.NUMERIC ) {
            return Double.parseDouble( data );
         } else {
            return data;
         }
     }
 }

 public enum DataType {
     CATEGORICAL, NUMERIC
 }

That'll handle any data set you wish to use. The only issue is the user must define the dataset by defining the columns and their respective data types to the DataSet. You can do it in code or reading it in from a file whatever you think is easier. You might be able to default a lot of the configuration data (say as CATEGORICAL), or attempt to parse the field if that fails it must be CATEGORICAL otherwise its numeric. Normally, the file contains a header you could parse to find the names of the columns, then you just need to figure out the data type by looking at the data in that column. A simple algorithm to guess the data type goes a long way in aiding you. Essentially this is the exact same data structure every other package uses for data like this (eg R, Weka, etc).

chubbsondubs
  • 37,646
  • 24
  • 106
  • 138
  • Thanks a lot. This is the closest to the implementation i was thiking about. It seems not all files from the UCI Repo contain the information in the header. I'm feeding my discretized datasets to Weka. This is a great help! – The Mitra Boy May 14 '12 at 17:58
  • Not all data sets in UCI Repo have a header, but that can be configurable parameter you give the parser. Whether it has a header or not really just a parameter to your parser to look for it or not. In the end a header is just simply a user friendly labels your user can use to refer to columns and configure your dataset. If it's there parse the human friendly labels. If not F1, F2, F3, etc could be used. Your user will have to provide information like which column is the prediction, possibly data types (string, float), etc anyways. – chubbsondubs May 14 '12 at 19:53
  • Thanks. just some minor corrections to the above code 'public void parse( String row, List columns ) { String[] cols = row.split(","); data = new Object[cols.length]; int i = 0; for( Column col : columns ) { data[i] = col.convert(cols[i]); i++; } }' – The Mitra Boy May 17 '12 at 19:57
  • Sorry renamed cols -> row for more clarity – chubbsondubs May 17 '12 at 20:11
0

I did something like that in one of my projects; lots of variable data, and in my case I obtained the data from the Internet. Since I needed to query, sort, etc., I spent some time designing a database to accommodate all the variations of the data (not all entries had the same number of properties). It did take a while but in the end I used the same code to get the data for any entry (using JPA in my case). My IDE (NetBeans) created most of the code straight using the database schema.

From your question, it is not clear on how you plan to use the data so I'm answering based on personal experience.

Pooven
  • 1,744
  • 1
  • 25
  • 44
javydreamercsw
  • 5,363
  • 13
  • 61
  • 106