How to handle input file with non standard delimiters in dsx ml pipeline?

Question

I'm trying to work with a data set that has no header and has :: for field delimiters:

! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
! head ratings.dat

The output:

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

I have loaded the file into my dsx pipeline, but I am unclear how to get dsx to split this file using the :: delimiters.

How do I do this?
If it is not possible to get dsx to reshape this file using dsx ml pipeline functionality, does dsx have any pre-requisities in terms of input file format?

Update:

The ml pipeline functionality I'm trying to use can be seen from the screenshot below:

I have added a data set, but can't figure out how to get dsx to recognise the field delimiters:

DSX provides a bunch of APIs. Could you be a bit more specific about which one you'd like to use for processing the file? I assume you're using notebooks here, not R Studio. But do you want to work with Python, R, or Scala? Would it be acceptable to read the file into memory using a Python lib or Scala function, and feed it from there into the ML pipeline? Would it be acceptable, as a last resort, to convert the file format with some bash commands from a Python notebook, and then process the converted file? — Roland Weber, Feb 21 '17 at 09:34
I've updated the question with more info. I was expecting to see functionality in the pipeline ui to help with this preprocessing, maybe that is a misunderstanding on my part. I was also working under the assumption that you work with notebooks *or* pipelines, but not both. If i need to work in the notebook as well as the pipeline, I would probably do everything in the notebook? If we need to process data before uploading it to the pipeline, what format should we convert it to? — Chris Snow, Feb 21 '17 at 09:49

score 2 · Accepted Answer · answered Feb 21 '17 at 14:55

As of Feb-2017...

When you create a new pipeline and select a dataset, I believe DSX loads the file you select using a Spark DataFrameReader. The DataFrameReader defaults to using a single , as the delimiter. DSX does not provide a way to change the default delimiter in the UI.

I think preprocessing the data is your best option. You can do this in a notebook. Be aware that the Spark DataFrameReader only supports a single character delimiter, so you can't use that with this particular dataset. You can user pandas, however.

import pandas as pd

pdf = pd.read_csv('ml-1m/ratings.dat', sep='::', 
              header=None, 
              names=['UserID','MovieID','Rating','Timestamp'], 
              engine='python')

pdf.to_csv('ratings.csv', index=False)

!head ratings.csv
UserID,MovieID,Rating,Timestamp
1,1193,5,978300760
1,661,3,978302109
1,914,3,978301968
1,3408,4,978300275
1,2355,5,978824291
1,1197,3,978302268
1,1287,5,978302039
1,2804,5,978300719
1,594,4,978302268

Now the data will be in a format that DSX will be able to parse properly.

How to handle input file with non standard delimiters in dsx ml pipeline?

1 Answers1