Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

2 answers

JSON to dataset in Spark

I am facing an issue for which I am seeking your help. I have a task to convert a JSON file to dataSet so that it can be loaded into HIVE. Code 1 SparkSession spark1 = SparkSession .builder() …

apache-spark apache-spark-sql bigdata

asked Sep 30 '16 at 12:46

manshul goel

votes

1 answer

Loading a very large table without a numeric ID from MySQL to S3

I'm trying to pump (with Sqoop) a large table (500GB in size with around 200M rows) in MYSQL to S3. However this table doesn't have a Key column which is numeric. It has a combined primary key with 3 columns. I observed that sqoop cannot chunk the…

mysql sqoop bigdata database

asked Sep 30 '16 at 09:37

Malinga

votes

1 answer

Counting latest state of stateful entities in streaming with Flink

I tried to create my first real-time analytics job in Flink. The approach is kappa-architecture-like, so I have my raw data on Kafka where we receive a message for every change of state of any entity. So the messages are of the form: (id,newStatus,…

scala analytics apache-flink bigdata

asked Sep 30 '16 at 06:29

Chobeat

3,445
6
41
59

votes

3 answers

Processing a .txt file of 10 GB

I have a text file of around 10 GB, I need to have some processing on the textual data in the file. What is the best way to read, access and process such a huge file? I am thinking to break the file into chunks and then process it by handling…

file data-processing bigdata

asked Sep 28 '16 at 13:26

Danyal Sandeelo

12,196
10
47
78

votes

1 answer

Modified Structure of RDD in Spark

I am new to spark/scala. val First: RDD[((Short, String), (Int, Double, Int))] This is structure of RDD. I want to modified this sturcture something like bellow: val First: RDD[(Short, String , Int, Double, Int)] Because I am having another RDD…

scala function apache-spark distributed-computing bigdata

asked Sep 26 '16 at 17:39

Darshan

votes

3 answers

How to get tables registered as spark table into data frame

I have imported tables from PostgreSQL database into spark-sql using spark-thriftserver jdbc connection and now from beeline I can see these tables. Is there any way I can convert these tables into spark data frame.

hadoop apache-spark apache-spark-sql bigdata

asked Sep 24 '16 at 23:31

nat

votes

3 answers

expoting big data to csv file - carshes

Im trying to export big data to csv files. (above 20000 lines... can get to over 100000 lines easily). after I try download the file, it crashes - download failed due to network failures. (managed to download an 18000 lines file, that weighs 1.7MB,…

javascript csv webclient bigdata

asked Sep 22 '16 at 11:59

badbuda

votes

2 answers

Can we have null value for BIGINT column in HIVE

I have a question , probably a basic one . I would like to give null value to a column whose data type BIGINT . Is that possible?

hive null bigdata

asked Sep 21 '16 at 11:50

user3752667

votes

3 answers

First steps for OLAP within BigData world

First of all I may be misinformed about BigData capability nowadays. So, don't hesitate to correct me if I'm too optimistic. I usually work with regular KPIs, like show me: count of new clients where they meets certain complex conditions (joining…

hadoop olap kylin bigdata

asked Sep 20 '16 at 12:54

user1464922

votes

1 answer

Java heap utilization in spark job

I am running a Spark Streaming job through Java. I have a 4 node cluster on AWS with cloudera distribution, out of which 3 are compute nodes. I need to record how much java heap is utilized on each executor/node of cluster when my job runs. I am…

java apache-spark bigdata

asked Sep 14 '16 at 13:25

Anup

votes

0 answers

How to optimize treeAggregate of LBFGS on spark

I'm run LBFGS on spark, with 5 features and 10w records, and found treeAggregate this: We can see the treeAggregte is time-consuming I have 100 cores, and every job 'treeAggregate at LBFGS.scala:218' has 1w+ tasks

apache-spark machine-learning data-mining apache-spark-mllib bigdata

asked Sep 14 '16 at 08:06

Dylan Wang

votes

1 answer

Storing deltas in the database instead of the whole object

I would like to store/update an object with a long list of fields in database. I am planning to use SQL Server (not 2016) and I have no predefined data format to store this object which means I can store it in JSON/BSON, as a binary blob and…

.net sql-server bigdata database

asked Sep 13 '16 at 11:17

Deniz

votes

1 answer

Convert a DStream to a Data Frame

Hi I am trying to read tweets from Twitter using Apache Spark Streaming and trying to convert to a DataFrame. I have the approach that I have pasted below. However, I am not beign able to get the correct approach. Some pointers would be welcome. As…

scala apache-spark bigdata

asked Sep 09 '16 at 13:09

Ayon

votes

1 answer

BigQuery : is it possible to execute another query inside an UDF?

I have a table that records a row for each unique user per day with some aggregated stats for that user on that day, and I need to produce a report that tells me for each day, the no. of unique users in the last 30 days including that day. eg.…

google-bigquery user-defined-functions sliding-window bigdata

asked Sep 08 '16 at 20:45

theburningmonk

15,701
14
61
104

votes

1 answer

Pick m points per cluster

I have 100m pairs of that form: (point_index, cluster_index) The goal is to select (the first? It doesn't matter) m points for every cluster. The clusters are 16k in number, at max. How to do this efficiently? m shall be small, <=100. My first…

javascript algorithm data-structures time-complexity bigdata

asked Sep 06 '16 at 04:38

gsamaras

71,951
46
188
305

Prev 1 2 3

…

99 100 Next