Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
2
votes
0 answers

Add Column to Hive External Table Error

Trying to add a column to an external table in HIVE but get the error below. This table currently has a thousand partitions registered and I want' to avoid re-creating the table and then running MSCK REPAIR which would take a very long time to…
user3250672
  • 192
  • 1
  • 5
2
votes
2 answers

Using Kinesis Analytics to construct real time sessions

Is there an example somewhere or can someone explain how to using Kinesis Analytics to construct real time sessions. (ie sessionization) It is mentioned that this possible here:…
Jesse Hull
  • 150
  • 1
  • 8
2
votes
1 answer

Storing a large file in hadoop HDFS?

I need to stored a large file of about 10TB on HDFS. What i need to understand is how HDFS will store this file. Say, The replication factor for the cluster is 3 and I have a 10 node cluster with over 10 TB of disk space on each node i.e total…
samshers
  • 1
  • 6
  • 37
  • 84
2
votes
3 answers

How can I increase big data performance?

I am new at this concept, and still learning. I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). I am currently using spark with python on Apache Zeppelin. I am reading files with the following…
Beril Boga
  • 97
  • 2
  • 9
2
votes
2 answers

How to select a specific column with sqldf if no column name is given

I have a large file (data.txt, 35 GB) which has 3 columns. Some example part of the file would look like the following: ... ... ... 5 701565 8679.56 8 1.16201e+006 3193.18 1 1.16173e+006 4457.85 14 1.16173e+006 4457.85 9 …
Fabi
  • 71
  • 1
  • 8
2
votes
1 answer

Google CloudSQL or BigQuery for Big Data Actively Update Every Second

So now I'm currently using Google CloudSQL for my needs. I'm collecting data from user activities. Every day the number of rows in my table will increase around 9-15 million rows and always updated every second. The data including several main…
hum_hum_pa
  • 121
  • 9
2
votes
1 answer

Process large text file using Zeppelin and Spark

I'm trying to analyze(visualize actually) some data from large text file(over 50 GB) using Zeppelin (scala). Examples from the web use csv files with known header and datatypes of each column. In my case, I have lines of a pure data with " "…
zelenov aleksey
  • 398
  • 4
  • 13
2
votes
3 answers

managing data in big data

I am reading book on big data for dummies. Welcome to Big Data For Dummies. Big data is becoming one of the most important technology trends that has the potential for dramatically changing the way organizations use information to enhance the…
venkysmarty
  • 11,099
  • 25
  • 101
  • 184
2
votes
3 answers

NULL Pointer Exception, while creating DF inside foreach()

I have to read certain files from S3, so I created a CSV containing path of those files on S3. I am reading created CSV file using below code: val listofFilesRDD = sparkSession.read.textFile("s3://"+ file) This is working fine. Then I am trying to…
2
votes
3 answers

Optimizing cohort analysis on Google BigQuery

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90…
mnort9
  • 1,810
  • 3
  • 30
  • 54
2
votes
0 answers

process XML in spark without external utility

What I am trying to do- I have been asked to flatten an XML using Spark java but without com.databricks utility. I have copied the XMLInputClass java code and using it. So that while processing via RDD, file split may not case an issue. public class…
manshul goel
  • 71
  • 2
  • 7
2
votes
1 answer

LeftOuterJoin in Flink (JAVA API)

I am trying to do a LeftOuterJoin in Flink. I do not try to implement the leftOuterJoin myself as it is done with the CoGroupFunction here: https://gist.github.com/mxm/c2e9c459a9d82c18d789 I am trying to use the FlatJoinFunction: public static…
SevenOfNine
  • 630
  • 1
  • 6
  • 25
2
votes
2 answers

How to Optimize Sqoop import?

What are the techniques which can be used to optimize sqoop import? I have tried to use split by column to enable parallelism and increased the number of mappers based on the table's data volume. Will changing to Fair Scheduler from FIFO will help? …
Holmes
  • 1,059
  • 2
  • 17
  • 25
2
votes
2 answers

REST API for processing data stored in hbase

I have a lot of records in hbase store (millions) like this key = user_id:service_id:usage_timestamp value = some_int That means an user used some service_id for some_int at usage_timestamp. And now I wanted to provide some rest api for…
Normal
  • 1,347
  • 4
  • 17
  • 34
2
votes
2 answers

Spark partitioning/cluster enforcing

I will be using a large amount of files structured as follows: /day/hour-min.txt.gz with a total of 14 days. I will use a cluster of 90 nodes/workers. I am reading everything with wholeTextFiles() as it is the only way that allows me to split the…
Dimebag
  • 833
  • 2
  • 9
  • 29