Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

1 answer

Correct way of writing two floats into a regular txt

I am running a big job, in cluster mode. However, I am only interested in two floats numbers, which I want to read somehow, when the job succeeds. Here what I am trying: from pyspark.context import SparkContext if __name__ == "__main__": sc =…

asked Sep 03 '16 at 04:25

gsamaras

71,951
46
188
305

votes

1 answer

spark unix_timestamp data type mismatch

Could someone help guide me in what data type or format I need to submit from_unixtime for the spark from_unixtime() function to work? When I try the following it works, but responds not with…

apache-spark apache-spark-sql distributed-computing unix-timestamp bigdata

asked Sep 01 '16 at 21:30

daniel

votes

3 answers

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?. Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table…

sql google-bigquery bigdata

asked Sep 01 '16 at 05:03

Soma Sekhar Kuruva

votes

0 answers

How to compute this huge Correlation Matrix?

I have a huge matrix with nrow=144 and ncol=156267 containing numbers and I would like to compute the correlation between all the columns. This can be done using the bigcor function described here:…

r matrix ff pearson bigdata

asked Aug 30 '16 at 20:54

NKGon

votes

2 answers

Spark starting more executors than specified

I'm running Spark 1.5.1 in standalone (client) mode using Pyspark. I'm trying to start a job that seems to be memory heavy (in python that is, so that should not be part of the executor-memory setting). I'm testing on a machine with 96 cores and 128…

apache-spark memory-management pyspark distributed-computing bigdata

asked Aug 28 '16 at 12:25

KarelV

votes

1 answer

What is the difference between FAILED AND ERROR in spark application states

I am trying to create a state diagram of a submitted spark application. I and kind of lost on when then an application is considered FAILED. States are from here:…

apache-spark driver scheduling distributed-computing bigdata

asked Aug 26 '16 at 17:54

Aravind Yarram

78,777
46
231
327

votes

0 answers

Architecture: How to use Spark ML predictions as HTTP service

I have a Spark streaming application which trains a model and periodically stores the model to HFS. In a http based web service, I would like to POST some values and retrieve a prediction for it. The service should also reload the model on demand…

apache-spark architecture bigdata

asked Aug 24 '16 at 05:49

marquies

1,066
1
10
17

votes

3 answers

NameError: global name 'NoneType' is not defined in Spark

I have written a UDF to replace a few specific date values in a column named "latest_travel_date" with 'NA'. However, this column also contains many null values, so I have handled this also in the UDF. (please see below) Query: def…

python apache-spark pyspark user-defined-functions bigdata

asked Aug 19 '16 at 14:17

Preyas

votes

1 answer

Query on the last element of an array in MongoDB when the array size is stored in a variable

I have a dataset in MongoDB and this is an example of a line of my data: { "conversionDate": "2016-08-01", "timeLagInDaysHistogram": 0, "pathLengthInInteractionsHistogram": 4, "campaignPath": [ {"campaignName": "name1", "source": "sr1",…

mongodb bigdata database

asked Aug 18 '16 at 14:51

Martin Mas

votes

1 answer

Spark::KMeans calls takeSample() twice?

I have many data and I have experimented with partitions of cardinality [20k, 200k+]. I call it like that: from pyspark.mllib.clustering import KMeans, KMeansModel C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10,…

scala apache-spark distributed-computing k-means bigdata

asked Aug 17 '16 at 00:26

gsamaras

71,951
46
188
305

votes

2 answers

Skip hyphen in hive

I have executed a query in HIVE CLI that should generate an External Table . "create EXTERNAL TABLE IF NOT EXISTS hassan( code int, area_name string, male_60_64 STRUCT, male_above_65 STRUCT) ROW FORMAT DELIMITED FIELDS TERMINATED BY…

hadoop hive mapreduce hiveql bigdata

asked Aug 12 '16 at 06:49

Hassan

votes

1 answer

Scala - How to Return this Kind of RDD type

I try to make method that return RDD refer to this, but failed because the return need parameter. According to API (Java), Here's my code: def HBaseToRDD(_HBaseConfiguration:HBaseConfiguration, _sc:SparkContext) : RDD[(K, V)] = { val HBaseRDD =…

java scala hadoop apache-spark bigdata

asked Aug 11 '16 at 08:19

questionasker

2,536
12
55
119

votes

1 answer

HBase Scan TimeRange Does not Work in Scala

I write scala code to retrieve data based on its time range. Here're my code : object Hbase_Scan_TimeRange { def main(args: Array[String]): Unit = { //===Basic Hbase (Non Deprecated)===Start Logger.getLogger(this.getClass) …

java scala hadoop hbase bigdata

asked Aug 11 '16 at 04:59

questionasker

2,536
12
55
119

votes

3 answers

Load data into Hive with custom delimiter

I'm trying to create an internal (managed) table in hive that can store my incremental log data. The table goes like this: CREATE TABLE logs (foo INT, bar STRING, created_date TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<=>' STORED AS…

hadoop hive loaddata bigdata

asked Aug 08 '16 at 09:12

shriyog

votes

4 answers

Spark-Submit through command line does not enforce UTF-8 encoding

When I run my spark job from an IDE using Spark's Java APIs, I get the output in a desired encoding format (UTF-8). But if I start the 'spark-submit' method from command line, the output misses out on the encoding. Is there a way where I can enforce…

java apache-spark encoding utf-8 bigdata

asked Jul 28 '16 at 12:20

KJ Sudarshan

2,694
1
29
22

Prev 1 2 3

…

100 Next