Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

1 answer

Spark broadcast vs. Singleton wrapper

I'm new to Spark, and I'm trying to understand what are the benefit of using broadcast var on using singleton wrapper. I am aware that Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce…

asked Feb 18 '17 at 15:32

apolak

votes

1 answer

How to access the static variable in the static inner class which extends reducer in mapreduce?

Here is the code of main method of my program: public class Path { public static void main(String[] args) throws Exception { ArrayList input = new ArrayList (); input.add(args[0]); String output0="/output/path2"; …

java hadoop mapreduce bigdata

asked Feb 16 '17 at 13:26

Yu Gu

2,382
5
18
33

votes

1 answer

How to apply a function to each row in SparkR?

I have a file in CSV format which contains a table with column "id", "timestamp", "action", "value" and "location". I want to apply a function to each row of the table and I've already written the code in R as follows: user <-…

r apache-spark sparkr bigdata

asked Feb 13 '17 at 13:12

Scorpion775

votes

2 answers

operating with big.matrix

I have to work with big.matrix objects and I can’t compute some functions. Let's consider the following big.matrix: # create big.matrix object x <- as.big.matrix( matrix( sample(1:10, 20, replace=TRUE), 5, 4, dimnames=list( NULL,…

r r-bigmemory bigdata

asked Feb 08 '17 at 11:28

Sara V.

votes

1 answer

Can Redis save 30 TB data?

Redis is a good solution for my work, but the problem is Redis needs much memory to save data. And my data is too big. Is there some solution that I can save such big data? Can Redis compress these data to save? Thanks!

redis bigdata

asked Feb 08 '17 at 08:40

Cocoa3338

votes

4 answers

Reading large CSV files from nth line in Python (not from the beginning)

I have 3 huge CSV files containing climate data, each about 5GB. The first cell in each line is the meteorological station's number (from 0 to about 100,000) each station contains from 1 to 800 lines in each file, which is not necessarily equal in…

python performance csv bigdata

asked Feb 06 '17 at 08:28

Mohammad ElNesr

2,477
4
27
44

votes

1 answer

Identifying similar values by iterating between two Pandas dataframes.

I have 2 Pandas dataframes which are of unequal length. I have quoted an example below.My code should run through the value of apples in the 1st data frame and locate if it exists in the 2nd one(there will always be a value existing in the 2nd…

python pandas dataframe data-analysis bigdata

asked Feb 02 '17 at 11:18

Arun Krishnan

votes

1 answer

Error while importing gobblin gradle project into IDE

I am getting this error while I try to import the gobblin distribution into my IDE , I have tried both inteliJ and eclipse , not able to find any luck. Below are the errors which I get when I try to import. In Eclipse the error…

eclipse intellij-idea bigdata gobblin

asked Feb 02 '17 at 05:10

Sayyad Ghazi

votes

1 answer

Read a ASCII file based on its headlines Matlab

I have a file like this: ID LHW dms 1 105.28 1 2 357.01 0 3 150.23 3 My question is if it is possible to get one column value based on the headline? I can of course get LHW by its column position, 2, but I would like to get it by just…

matlab bigdata

asked Feb 01 '17 at 22:14

KGB91

votes

1 answer

How to migrate data between clusters?

I have to duplicate Hive tables to another cluster keeping the schema and the hierarchy of my tables, so my question is : What is the safest and proper way to do it, in order to have the exact tables (and databases) copies of Cluster1 into…

hadoop hive hdfs data-migration bigdata

asked Feb 01 '17 at 16:36

mttb12

votes

3 answers

Count number of character occurrences from input text file

How to convert flatMap of a text file to flatMap of characters? I have to count of occurrences of each character from a text file. What approach to take after following code? val words = readme.flatMap(line => line.split(" ")).collect()

scala apache-spark rdd flatmap bigdata

asked Feb 01 '17 at 10:42

Govind Yadav

votes

1 answer

Error: while processing statement: FAILED: Hive Internal Error: hive.mapred.supports.subdirectories must be true

i stumbled in an error Error while processing statement: FAILED: Hive Internal Error: hive.mapred.supports.subdirectories must be true if any one of following is true: hive.optimize.listbucketing , mapred.input.dir.recursive and…

hadoop recursion optimization hive bigdata

asked Feb 01 '17 at 04:03

galih

votes

0 answers

How to get elapsed time for a Hadoop task on local mode?

Hi I am trying to run the WordCount program with Hadoop in local/standalone mode and I want to see the time needed for the task. I'm using the code from the Hadoop website. I tried adding this at the end of the code but it prints out…

java hadoop word-count bigdata

asked Jan 31 '17 at 20:51

h.ni

votes

1 answer

Is there a clever HBase Schema to Aid with Discovering Missing Value?

Let's assume I have billions of rows in my HBase table. The rows in this table change slowly, meaning there will be new rowkeys and some rowkeys get deleted. I receive lots of events per row. However, there will be very few rows that will not have…

mapreduce hbase bigdata

asked Jan 25 '17 at 21:33

hba

7,406
10
63
105

votes

1 answer

Effectively derive term co-occurrence matrix from Google Ngrams

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The…

sparse-matrix n-gram google-books text2vec bigdata

asked Jan 25 '17 at 14:04

user3554004

1,044
9
24

Prev 1 2 3

…

99 100 Next