Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

0 answers

Add Column to Hive External Table Error

Trying to add a column to an external table in HIVE but get the error below. This table currently has a thousand partitions registered and I want' to avoid re-creating the table and then running MSCK REPAIR which would take a very long time to…

asked Dec 01 '16 at 20:17

user3250672

votes

2 answers

Using Kinesis Analytics to construct real time sessions

Is there an example somewhere or can someone explain how to using Kinesis Analytics to construct real time sessions. (ie sessionization) It is mentioned that this possible here:…

amazon-kinesis bigdata

asked Nov 25 '16 at 21:53

Jesse Hull

votes

1 answer

Storing a large file in hadoop HDFS?

I need to stored a large file of about 10TB on HDFS. What i need to understand is how HDFS will store this file. Say, The replication factor for the cluster is 3 and I have a 10 node cluster with over 10 TB of disk space on each node i.e total…

java hadoop mapreduce hdfs bigdata

asked Nov 14 '16 at 15:59

samshers

votes

3 answers

How can I increase big data performance?

I am new at this concept, and still learning. I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). I am currently using spark with python on Apache Zeppelin. I am reading files with the following…

json amazon-s3 amazon-ec2 pyspark bigdata

asked Nov 09 '16 at 13:47

Beril Boga

votes

2 answers

How to select a specific column with sqldf if no column name is given

I have a large file (data.txt, 35 GB) which has 3 columns. Some example part of the file would look like the following: ... ... ... 5 701565 8679.56 8 1.16201e+006 3193.18 1 1.16173e+006 4457.85 14 1.16173e+006 4457.85 9 …

r sqldf bigdata

asked Oct 28 '16 at 12:21

Fabi

votes

1 answer

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs. I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main…

google-bigquery google-cloud-sql bigdata

asked Oct 27 '16 at 03:42

hum_hum_pa

votes

1 answer

Process large text file using Zeppelin and Spark

I'm trying to analyze(visualize actually) some data from large text file(over 50 GB) using Zeppelin (scala). Examples from the web use csv files with known header and datatypes of each column. In my case, I have lines of a pure data with " "…

scala apache-spark apache-zeppelin bigdata

asked Oct 22 '16 at 19:02

zelenov aleksey

votes

3 answers

managing data in big data

I am reading book on big data for dummies. Welcome to Big Data For Dummies. Big data is becoming one of the most important technology trends that has the potential for dramatically changing the way organizations use information to enhance the…

database data-science bigdata

asked Oct 21 '16 at 06:46

venkysmarty

11,099
25
101
184

votes

3 answers

NULL Pointer Exception, while creating DF inside foreach()

I have to read certain files from S3, so I created a CSV containing path of those files on S3. I am reading created CSV file using below code: val listofFilesRDD = sparkSession.read.textFile("s3://"+ file) This is working fine. Then I am trying to…

scala apache-spark amazon-s3 apache-spark-sql bigdata

asked Oct 20 '16 at 16:41

Uday Shankar Singh

votes

3 answers

Optimizing cohort analysis on Google BigQuery

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90…

mysql sql postgresql google-bigquery bigdata

asked Oct 20 '16 at 15:13

mnort9

1,810
3
30
54

votes

0 answers

process XML in spark without external utility

What I am trying to do- I have been asked to flatten an XML using Spark java but without com.databricks utility. I have copied the XMLInputClass java code and using it. So that while processing via RDD, file split may not case an issue. public class…

java xml hadoop apache-spark bigdata

asked Oct 18 '16 at 12:02

manshul goel

votes

1 answer

LeftOuterJoin in Flink (JAVA API)

I am trying to do a LeftOuterJoin in Flink. I do not try to implement the leftOuterJoin myself as it is done with the CoGroupFunction here: https://gist.github.com/mxm/c2e9c459a9d82c18d789 I am trying to use the FlatJoinFunction: public static…

java mapreduce apache-flink bigdata

asked Oct 13 '16 at 19:11

SevenOfNine

votes

2 answers

How to Optimize Sqoop import?

What are the techniques which can be used to optimize sqoop import? I have tried to use split by column to enable parallelism and increased the number of mappers based on the table's data volume. Will changing to Fair Scheduler from FIFO will help? …

hadoop sqoop bigdata

asked Oct 05 '16 at 07:59

Holmes

1,059
2
17
25

votes

2 answers

REST API for processing data stored in hbase

I have a lot of records in hbase store (millions) like this key = user_id:service_id:usage_timestamp value = some_int That means an user used some service_id for some_int at usage_timestamp. And now I wanted to provide some rest api for…

rest hadoop apache-spark hbase bigdata

asked Oct 05 '16 at 00:19

Normal

1,347
4
17
34

votes

2 answers

Spark partitioning/cluster enforcing

I will be using a large amount of files structured as follows: /day/hour-min.txt.gz with a total of 14 days. I will use a cluster of 90 nodes/workers. I am reading everything with wholeTextFiles() as it is the only way that allows me to split the…

file apache-spark distributed-computing partitioning bigdata

asked Oct 02 '16 at 15:50

Dimebag

Prev 1 2 3

…

99 100 Next