Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
2
votes
1 answer

Correct way of writing two floats into a regular txt

I am running a big job, in cluster mode. However, I am only interested in two floats numbers, which I want to read somehow, when the job succeeds. Here what I am trying: from pyspark.context import SparkContext if __name__ == "__main__": sc =…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
2
votes
1 answer

spark unix_timestamp data type mismatch

Could someone help guide me in what data type or format I need to submit from_unixtime for the spark from_unixtime() function to work? When I try the following it works, but responds not with…
2
votes
3 answers

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?. Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table…
2
votes
0 answers

How to compute this huge Correlation Matrix?

I have a huge matrix with nrow=144 and ncol=156267 containing numbers and I would like to compute the correlation between all the columns. This can be done using the bigcor function described here:…
NKGon
  • 55
  • 8
2
votes
2 answers

Spark starting more executors than specified

I'm running Spark 1.5.1 in standalone (client) mode using Pyspark. I'm trying to start a job that seems to be memory heavy (in python that is, so that should not be part of the executor-memory setting). I'm testing on a machine with 96 cores and 128…
2
votes
1 answer

What is the difference between FAILED AND ERROR in spark application states

I am trying to create a state diagram of a submitted spark application. I and kind of lost on when then an application is considered FAILED. States are from here:…
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
2
votes
0 answers

Architecture: How to use Spark ML predictions as HTTP service

I have a Spark streaming application which trains a model and periodically stores the model to HFS. In a http based web service, I would like to POST some values and retrieve a prediction for it. The service should also reload the model on demand…
marquies
  • 1,066
  • 1
  • 10
  • 17
2
votes
3 answers

NameError: global name 'NoneType' is not defined in Spark

I have written a UDF to replace a few specific date values in a column named "latest_travel_date" with 'NA'. However, this column also contains many null values, so I have handled this also in the UDF. (please see below) Query: def…
Preyas
  • 773
  • 1
  • 7
  • 12
2
votes
1 answer

Query on the last element of an array in MongoDB when the array size is stored in a variable

I have a dataset in MongoDB and this is an example of a line of my data: { "conversionDate": "2016-08-01", "timeLagInDaysHistogram": 0, "pathLengthInInteractionsHistogram": 4, "campaignPath": [ {"campaignName": "name1", "source": "sr1",…
Martin Mas
  • 23
  • 4
2
votes
1 answer

Spark::KMeans calls takeSample() twice?

I have many data and I have experimented with partitions of cardinality [20k, 200k+]. I call it like that: from pyspark.mllib.clustering import KMeans, KMeansModel C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10,…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
2
votes
2 answers

Skip hyphen in hive

I have executed a query in HIVE CLI that should generate an External Table . "create EXTERNAL TABLE IF NOT EXISTS hassan( code int, area_name string, male_60_64 STRUCT, male_above_65 STRUCT) ROW FORMAT DELIMITED FIELDS TERMINATED BY…
Hassan
  • 43
  • 3
  • 6
2
votes
1 answer

Scala - How to Return this Kind of RDD type

I try to make method that return RDD refer to this, but failed because the return need parameter. According to API (Java), Here's my code: def HBaseToRDD(_HBaseConfiguration:HBaseConfiguration, _sc:SparkContext) : RDD[(K, V)] = { val HBaseRDD =…
questionasker
  • 2,536
  • 12
  • 55
  • 119
2
votes
1 answer

HBase Scan TimeRange Does not Work in Scala

I write scala code to retrieve data based on its time range. Here're my code : object Hbase_Scan_TimeRange { def main(args: Array[String]): Unit = { //===Basic Hbase (Non Deprecated)===Start Logger.getLogger(this.getClass) …
questionasker
  • 2,536
  • 12
  • 55
  • 119
2
votes
3 answers

Load data into Hive with custom delimiter

I'm trying to create an internal (managed) table in hive that can store my incremental log data. The table goes like this: CREATE TABLE logs (foo INT, bar STRING, created_date TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<=>' STORED AS…
shriyog
  • 938
  • 1
  • 13
  • 26
2
votes
4 answers

Spark-Submit through command line does not enforce UTF-8 encoding

When I run my spark job from an IDE using Spark's Java APIs, I get the output in a desired encoding format (UTF-8). But if I start the 'spark-submit' method from command line, the output misses out on the encoding. Is there a way where I can enforce…
KJ Sudarshan
  • 2,694
  • 1
  • 29
  • 22
1 2 3
99
100