Questions tagged [cascalog]

Cascalog is a fully-featured data processing and querying library for Clojure. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer from the Clojure REPL. Cascalog is a replacement for tools like Pig, Hive, and Cascading.

Cascalog operates at a significantly higher level of abstraction than a tool like SQL. More importantly, its tight integration with Clojure gives you the power to use abstraction and composition techniques with your data processing code just like you would with any other code. It's this latter point that sets Cascalog far above any other tool in terms of expressive power.

Easy to install, Cascalog has a five-minute set up on.

Cascalog is hosted on Github

Source

26 questions
1
vote
0 answers

How much do Cascalog Traps thread into other functions?

I was wondering how far down cascalog traps boiled down using the following example. (defn -main "Good ole main boilerplate" [people-path trap-path output-path] (?- (hfs-textline output-path) (people-query (hfs-textline trap-path) …
mcgeep
  • 53
  • 6
1
vote
1 answer

reading XML with Cascalog/Cascading

There is some info on the web indicating that Mahout's XMLInputFormat can be used to efficiently process XML on hadoop, but I've been unable to find an example of how to get this working. Can someone point me in the right direction? I'm using…
Kevin
  • 24,871
  • 19
  • 102
  • 158
1
vote
1 answer

Clojure + Lemur

I am trying to run some multi step job using lemur+clojure. I have issue with passing multiple input as argument to clojure+lemur. As first step for my job I trying to run emr Streaming Job lemur run ${CONF_DIR}/run-pipeline.clj…
PulsAm
  • 71
  • 5
1
vote
1 answer

Cascalog first-n - unable to join predicates

I'm working through the following example in a lein repl in a clone from the cascalog project. I've run: (def src [[1] [3] [2]]) (def queryx (<- [?x ?y] (src ?x) (inc ?x :> ?y))) (?<- (stdout) [?x ?y] (queryx ?x ?y)) -- works (?- (stdout)…
hawkeye
  • 34,745
  • 30
  • 150
  • 304
0
votes
0 answers

Is it possible (and if so, how) to kill a running cascalog or cascading job?

Title should be pretty self-explanatory. I'm specifically interested in Cascalog, but I might accept an answer tuned more broadly to Cascading if it seems clear how that might apply towards Cascalog. Occasionally, I'll create a Cascalog query that…
metasoarous
  • 2,854
  • 1
  • 23
  • 24
0
votes
1 answer

Cascalog process multi-line json?

I have a directory of Json files that I want to process using cascalog. The solution I have right now requires me to remove all newline characters from my json files using a bash script. I am looking a better solution because I sync these files…
john
  • 709
  • 3
  • 13
  • 25
0
votes
0 answers

Clojure failing to compile jackknife

As a clojure noob, I am trying to use cascalog to parse a large CSV file. Here is my minimal project.clj: (defproject org.example/sample "1.0.0-SNAPSHOT" :description "extract fields from a certain csv file." :dependencies [ [cascalog…
user1158559
  • 1,954
  • 1
  • 18
  • 23
0
votes
1 answer

Cascalog: start uberjar and main on hadoop

I have compiled an uberjar from a file like: (defmain HadoopTest (:use 'cascalog.api) (defn bla ("alot of code")) I run that uberjar on hadoop like: $ hadoop jar myStandalone.jar clojure.main and i get a REPL, but nothing from that file is…
0
votes
1 answer

How to disable the echo for cascalog queries

it is a howto question. When I execute simple queries in the cascalog.playground area there is to much information. How to display only the results to (stdout). What setting(s) do i need to update/change/add? thank you!
dag
  • 288
  • 2
  • 10
0
votes
2 answers

Can Cascalog link to external Hadoop Cluster?

I am using Cascalog on Eclipse . it looks like the dependency on hadoop is provided in project.clj file of project like below :profiles { :dev {:dependencies [[org.apache.hadoop/hadoop-core "1.1.2"]]}} If i have to include dependency on locally…
Sindhu
  • 11
  • 1
0
votes
1 answer

How to merge the small files on S3 generated by EMR with thousands of reducers

My cascalog EMR job generated thousands of small files on S3 buckets. It generate the same number of files as the number of reducers I used. Dumping all these tiny files take minutes. I wonder if there is a way to concat them on S3 so that I can…
rninja
  • 540
  • 1
  • 4
  • 12
1
2