Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
2
votes
1 answer

Spark broadcast vs. Singleton wrapper

I'm new to Spark, and I'm trying to understand what are the benefit of using broadcast var on using singleton wrapper. I am aware that Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce…
apolak
  • 141
  • 1
  • 14
2
votes
1 answer

How to access the static variable in the static inner class which extends reducer in mapreduce?

Here is the code of main method of my program: public class Path { public static void main(String[] args) throws Exception { ArrayList input = new ArrayList (); input.add(args[0]); String output0="/output/path2"; …
Yu Gu
  • 2,382
  • 5
  • 18
  • 33
2
votes
1 answer

How to apply a function to each row in SparkR?

I have a file in CSV format which contains a table with column "id", "timestamp", "action", "value" and "location". I want to apply a function to each row of the table and I've already written the code in R as follows: user <-…
2
votes
2 answers

operating with big.matrix

I have to work with big.matrix objects and I can’t compute some functions. Let's consider the following big.matrix: # create big.matrix object x <- as.big.matrix( matrix( sample(1:10, 20, replace=TRUE), 5, 4, dimnames=list( NULL,…
Sara V.
  • 21
  • 1
  • 2
2
votes
1 answer

Can Redis save 30 TB data?

Redis is a good solution for my work, but the problem is Redis needs much memory to save data. And my data is too big. Is there some solution that I can save such big data? Can Redis compress these data to save? Thanks!
Cocoa3338
  • 95
  • 1
  • 2
  • 12
2
votes
4 answers

Reading large CSV files from nth line in Python (not from the beginning)

I have 3 huge CSV files containing climate data, each about 5GB. The first cell in each line is the meteorological station's number (from 0 to about 100,000) each station contains from 1 to 800 lines in each file, which is not necessarily equal in…
Mohammad ElNesr
  • 2,477
  • 4
  • 27
  • 44
2
votes
1 answer

Identifying similar values by iterating between two Pandas dataframes.

I have 2 Pandas dataframes which are of unequal length. I have quoted an example below.My code should run through the value of apples in the 1st data frame and locate if it exists in the 2nd one(there will always be a value existing in the 2nd…
Arun Krishnan
  • 85
  • 1
  • 9
2
votes
1 answer

Error while importing gobblin gradle project into IDE

I am getting this error while I try to import the gobblin distribution into my IDE , I have tried both inteliJ and eclipse , not able to find any luck. Below are the errors which I get when I try to import. In Eclipse the error…
2
votes
1 answer

Read a ASCII file based on its headlines Matlab

I have a file like this: ID LHW dms 1 105.28 1 2 357.01 0 3 150.23 3 My question is if it is possible to get one column value based on the headline? I can of course get LHW by its column position, 2, but I would like to get it by just…
KGB91
  • 630
  • 2
  • 6
  • 24
2
votes
1 answer

How to migrate data between clusters?

I have to duplicate Hive tables to another cluster keeping the schema and the hierarchy of my tables, so my question is : What is the safest and proper way to do it, in order to have the exact tables (and databases) copies of Cluster1 into…
mttb12
  • 75
  • 1
  • 9
2
votes
3 answers

Count number of character occurrences from input text file

How to convert flatMap of a text file to flatMap of characters? I have to count of occurrences of each character from a text file. What approach to take after following code? val words = readme.flatMap(line => line.split(" ")).collect()
Govind Yadav
  • 37
  • 1
  • 1
  • 5
2
votes
1 answer

Error: while processing statement: FAILED: Hive Internal Error: hive.mapred.supports.subdirectories must be true

i stumbled in an error Error while processing statement: FAILED: Hive Internal Error: hive.mapred.supports.subdirectories must be true if any one of following is true: hive.optimize.listbucketing , mapred.input.dir.recursive and…
galih
  • 499
  • 1
  • 6
  • 16
2
votes
0 answers

How to get elapsed time for a Hadoop task on local mode?

Hi I am trying to run the WordCount program with Hadoop in local/standalone mode and I want to see the time needed for the task. I'm using the code from the Hadoop website. I tried adding this at the end of the code but it prints out…
h.ni
  • 21
  • 3
2
votes
1 answer

Is there a clever HBase Schema to Aid with Discovering Missing Value?

Let's assume I have billions of rows in my HBase table. The rows in this table change slowly, meaning there will be new rowkeys and some rowkeys get deleted. I receive lots of events per row. However, there will be very few rows that will not have…
hba
  • 7,406
  • 10
  • 63
  • 105
2
votes
1 answer

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The…
user3554004
  • 1,044
  • 9
  • 24