How read large number of large files on NFS and dump to HDFS

Question

I am working with some legacy systems in investment banking domain, which are very unfriendly in the sense that, only way to extract data from them is through a file export/import. Lots of trading takes place and large number of transactions are stored on these system. Q is how to read large number of large files on NFS and dump it on a system on which analytics can be done by something like Spark or Samza.

Back to issue. Due nature of legacy systems, we are extracting data and dumping into files. Each file is in hundreds of gigabyte size.
I feel next step is to read these and dump to Kafka or HDFS, or maybe even Cassandra or HBase. Reason being I need to run some financial analytics on this data. I have two questions:

How to efficiently read large number of large files which are located on one or numerous machines

With mainframes, there are very limited options to read & dump data to hadoop .. like apache sqoop which has `import-from-mainframe` which spawns multiple SQL type of queries but it has its own limitations (ebcdic to ascii conversion issues and no support for packed decimals) , you can explore jdbc connector options too, or there is specialty softwares like system z connector for hadoop by ibm or syncsort's DMX-h. — Pushkr, Apr 29 '17 at 01:46
Pushkr like I said.. we dump files from mainframes to NFS! Q is, how to read large number of large files basically. — Apurva Singh, Apr 29 '17 at 01:49
OK. there was no mention of NFS in your post so I assumed you wanted to read large files of mainframe first. — Pushkr, Apr 29 '17 at 01:52
Bill. Q is how to read large number of large files On multiple folders in nfs — Apurva Singh, Apr 29 '17 at 15:41
You appear to have painted yourself into a corner by making decisions about your implementation before evaluating available options. As @jedijs points out, [zconnector](https://www.google.com/search?&rls=en&q=zconnector&ie=UTF-8&oe=UTF-8) is available, as is Spark on the z System platform (pointed out by others), as are various roll-your-own event-driven solutions. Alienating some of the target audience for your question is counterproductive. Sometimes the correct answer is "don't do it that way." — cschneid, Apr 29 '17 at 16:45
cschneid.. this is not zos!! Only files are possible. Why are you assuming it is IBM? Maybe I remove mainframe reference — Apurva Singh, Apr 30 '17 at 00:32
You were the one who tagged the question with "mainframe" and went on about that being the source of your files. Of course that led us to believe the files resided on a z/OS machine to which you had an NFS mount. — cschneid, Apr 30 '17 at 03:19
cshneid.. weird.. there are all kind of mainframes.. where did z/os come from? Do you know that all these mainframes are like chalk and cheese? How can you assume it is z/os? — Apurva Singh, Apr 30 '17 at 04:12
Bill Woodger.. tags look good. apache-spark apache-kafka bigdata data-migration. Spark and Kafka are there because I think solution could be somewhere there. — Apurva Singh, May 01 '17 at 16:02

score 4 · Answer 1 · answered Apr 29 '17 at 14:25

Apparently you've discovered already that mainframes are good at writing large numbers of large files. They're good at reading them too. But that aside...

IBM has been pushing hard on Spark on z/OS recently. It's available for free, although if you want support, you have to pay for that. See: https://www-03.ibm.com/systems/z/os/zos/apache-spark.html My understanding is that z/OS can be a peer with other machines in a Spark cluster.

The z/OS Spark implementation comes with a piece that can read data directly from all sorts of mainframe sources: sequential, VSAM, DB2, etc. It might allow you to bypass the whole dump process and read the data directly from the source.

Apparently Hadoop is written in Java, so one would expect that it should be able to run on z/OS with little problem. However, watch out for ASCII vs. EBCDIC issues.

On the topic of using Hadoop with z/OS, there's a number of references out there, including a red piece: http://www.redbooks.ibm.com/redpapers/pdfs/redp5142.pdf

You'll note that in there they make mention of using the CO:z toolkit, which I believe is available for free.

However you mention "unfriendly". I'm not sure if that means "I don't understand this environment as it doesn't look like anything I've used before" or it means "the people I'm working with don't want to help me". I'll assume something like the latter since the former is simply a learning opportunity. Unfortunately, you're probably going to have a tough time getting the unfriendly people to get anything new up and running on z/OS.

But in the end, it may be best to try to make friends with those unfriendly z/OS admins as they likely can make your life easier.

Finally, I'm not sure what analytics you're planning on doing with the data. But in some cases it may be easier/better to move the analytics process to the data instead of moving the data to the analytics.

This is not a x/OS. Investment banks use all kind of crazy mainframes. — Apurva Singh, May 01 '17 at 16:03

score 1 · Answer 2 · answered Apr 29 '17 at 16:00

1

The simplest way to do it better is zconnector, a ibm product for data ingestion between mainframe to hadoop cluster.

answered Apr 29 '17 at 16:00

jedijs

563
5
14

1

but this is not zos etc. Only files are possible – Apurva Singh Apr 30 '17 at 00:31
1

Unfortunately, comment reveals it is non-IBM Mainframe. – Bill Woodger Apr 30 '17 at 07:35

score 0 · Answer 3 · answered May 02 '17 at 20:27

I managed to find an answer. The biggest bottleneck is that reading files is essentially a serial operation.. that is the most efficient way to read from a disk. So for one file I am stuck with a single thread reading it from NFS and sending it to HDFS or Kafka via their APIs.
So it appears best way is to make sure that the source from where data is coming dumps files in multiple NFS folders. That point onward I can run multiple processes to load data to HDFS or Kafka since they are highly parallelized.
How to load? One good way is to mount NFS into Hadoop infrastructure and use distcp. There are other possiblities too which open up once we make sure files are available from large number of NFS. Otherwise remember, reading file is a serial operation. Thanks.

You can FTP the file at ur local Unix and then convert it into ASCII format using python codec and also need to break the file into diffrent lines as per your LRECL of source mainframe file.. then you can parse this file and delimited at specific positions.. and can then move your file from local Unix to hdfs.. all this you can do with python.. incase you have comp fields in input file.. use sort to convert them to numeric at source itself.. — vikrant rana, Oct 31 '18 at 17:53

How read large number of large files on NFS and dump to HDFS

3 Answers3