Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
2
votes
2 answers

JSON to dataset in Spark

I am facing an issue for which I am seeking your help. I have a task to convert a JSON file to dataSet so that it can be loaded into HIVE. Code 1 SparkSession spark1 = SparkSession .builder() …
manshul goel
  • 71
  • 2
  • 7
2
votes
1 answer

Loading a very large table without a numeric ID from MySQL to S3

I'm trying to pump (with Sqoop) a large table (500GB in size with around 200M rows) in MYSQL to S3. However this table doesn't have a Key column which is numeric. It has a combined primary key with 3 columns. I observed that sqoop cannot chunk the…
Malinga
  • 505
  • 3
  • 14
2
votes
1 answer

Counting latest state of stateful entities in streaming with Flink

I tried to create my first real-time analytics job in Flink. The approach is kappa-architecture-like, so I have my raw data on Kafka where we receive a message for every change of state of any entity. So the messages are of the form: (id,newStatus,…
Chobeat
  • 3,445
  • 6
  • 41
  • 59
2
votes
3 answers

Processing a .txt file of 10 GB

I have a text file of around 10 GB, I need to have some processing on the textual data in the file. What is the best way to read, access and process such a huge file? I am thinking to break the file into chunks and then process it by handling…
Danyal Sandeelo
  • 12,196
  • 10
  • 47
  • 78
2
votes
1 answer

Modified Structure of RDD in Spark

I am new to spark/scala. val First: RDD[((Short, String), (Int, Double, Int))] This is structure of RDD. I want to modified this sturcture something like bellow: val First: RDD[(Short, String , Int, Double, Int)] Because I am having another RDD…
Darshan
  • 81
  • 2
  • 4
  • 8
2
votes
3 answers

How to get tables registered as spark table into data frame

I have imported tables from PostgreSQL database into spark-sql using spark-thriftserver jdbc connection and now from beeline I can see these tables. Is there any way I can convert these tables into spark data frame.
nat
  • 557
  • 2
  • 11
  • 25
2
votes
3 answers

expoting big data to csv file - carshes

Im trying to export big data to csv files. (above 20000 lines... can get to over 100000 lines easily). after I try download the file, it crashes - download failed due to network failures. (managed to download an 18000 lines file, that weighs 1.7MB,…
badbuda
  • 93
  • 6
2
votes
2 answers

Can we have null value for BIGINT column in HIVE

I have a question , probably a basic one . I would like to give null value to a column whose data type BIGINT . Is that possible?
user3752667
2
votes
3 answers

First steps for OLAP within BigData world

First of all I may be misinformed about BigData capability nowadays. So, don't hesitate to correct me if I'm too optimistic. I usually work with regular KPIs, like show me: count of new clients where they meets certain complex conditions (joining…
user1464922
  • 371
  • 1
  • 3
  • 10
2
votes
1 answer

Java heap utilization in spark job

I am running a Spark Streaming job through Java. I have a 4 node cluster on AWS with cloudera distribution, out of which 3 are compute nodes. I need to record how much java heap is utilized on each executor/node of cluster when my job runs. I am…
Anup
  • 927
  • 2
  • 14
  • 30
2
votes
0 answers

How to optimize treeAggregate of LBFGS on spark

I'm run LBFGS on spark, with 5 features and 10w records, and found treeAggregate this: We can see the treeAggregte is time-consuming I have 100 cores, and every job 'treeAggregate at LBFGS.scala:218' has 1w+ tasks
2
votes
1 answer

Storing deltas in the database instead of the whole object

I would like to store/update an object with a long list of fields in database. I am planning to use SQL Server (not 2016) and I have no predefined data format to store this object which means I can store it in JSON/BSON, as a binary blob and…
Deniz
  • 858
  • 10
  • 31
2
votes
1 answer

Convert a DStream to a Data Frame

Hi I am trying to read tweets from Twitter using Apache Spark Streaming and trying to convert to a DataFrame. I have the approach that I have pasted below. However, I am not beign able to get the correct approach. Some pointers would be welcome. As…
Ayon
  • 315
  • 3
  • 20
2
votes
1 answer

BigQuery : is it possible to execute another query inside an UDF?

I have a table that records a row for each unique user per day with some aggregated stats for that user on that day, and I need to produce a report that tells me for each day, the no. of unique users in the last 30 days including that day. eg.…
2
votes
1 answer

Pick m points per cluster

I have 100m pairs of that form: (point_index, cluster_index) The goal is to select (the first? It doesn't matter) m points for every cluster. The clusters are 16k in number, at max. How to do this efficiently? m shall be small, <=100. My first…
gsamaras
  • 71,951
  • 46
  • 188
  • 305