Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

1 answer

How do databases handle redundant values?

Suppose I have a database with several columns. In each column there are lots of values that are often similar. For example I can have a column with the name "Description" and a value could be "This is the description for the measurement". This…

asked Jul 26 '16 at 16:57

Ohumeronen

1,769
2
14
28

votes

1 answer

Working with AWS S3 Large Public Data Set

AWS has several public "big data" data sets available. Some are hosted for free on EBS, and others, like NASA NEX climate data are hosted on S3. I have found more discussion on how to work with those that are hosted in EBS, but have been unable to…

amazon-web-services amazon-s3 amazon-ec2 bigdata

asked Jul 22 '16 at 19:49

csg2136

votes

1 answer

Why we need a coarse quantizer?

In Product Quantization for Nearest Neighbor Search, when it comes to section IV.A, it says they they will use a coarse quantizer too (which they way I feel it, is just a really smaller product quantizer, smaller w.r.t. k, the number of…

algorithm encoding nearest-neighbor quantization bigdata

asked Jul 15 '16 at 05:53

gsamaras

71,951
46
188
305

votes

1 answer

Compress output of Hadoop Archive tool

I'm using Hadoop Archive for reduce number of files in my Hadoop cluster, but for data retention, I want to keep my data as long as possible. Then the problem is Hadoop Archive not reduce folder size (my folder have multi-type of file, both small…

hadoop hadoop-yarn hadoop-archive bigdata

asked Jul 04 '16 at 09:55

dltu

votes

2 answers

Best practice for storing and indexing 1M+ XML documents?

I have an archive of several years' worth of XML documents. There are 1M+ unique document subjects, and each subject may have one or more documents for any given year. Each document contains hundreds of nodes and parameters. Total XML cache is about…

mysql xml elasticsearch bigdata database

asked Jul 03 '16 at 18:06

MarathonStudios

2,849
4
20
18

votes

0 answers

dplyr left_join with similiar, but not exactly the same, columns of strings (pmatch or str_detect)

I recently posted: dplyr, lapply, or Map to identify information from one data.frame and place it into another My main issue involves using dplyr/lapply to combine two data.frames by a column of strings. The strings are first names, but they are not…

r dplyr lapply bigdata

asked Jun 30 '16 at 19:43

beemyfriend

votes

1 answer

Create Adjacency Matrix in Python for large Dataset

I have a problem with representing website user behaviour in a Adjacency Matrix in Python. I want to analyze the user interaction between 43 different websites to see which websites are used together. The given data set has about 13.000.000 lines…

python numpy matrix adjacency-matrix bigdata

asked Jun 25 '16 at 15:33

Duesentrieb

votes

1 answer

Python: many-to-many comparison to find required set of data

This is my first question so please forgive any mistakes. I have a large file(csv) with several(~10000000+) lines of information like the following example: date;box_id;box_length;box_width;box_height;weight;type --snip-- 1999-01-01…

python algorithm bigdata

asked Jun 24 '16 at 18:21

A.I.

votes

2 answers

Need a method to filter data for records having more than one record for an id in HIVE

Consider the table below in HIVE: Here i need to find out the unique combination of household,vehicle and customer. But the condition is this.If for the same household and vehicle there are two different customers with role DRIVER and OWNER, i have…

hadoop group-by hive where-clause bigdata

asked Jun 24 '16 at 10:14

Vaishak

votes

1 answer

Restricting a yarn container to execute only one task at a time

I am running a Spark program using a hadoop cluster, which uses the yarn scheduler to run the tasks. However, I notice a strange behavior. Yarn sometimes kills a task complaining out of memory error, whereas if I execute the tasks in rounds, that…

scala hadoop apache-spark hadoop-yarn bigdata

asked Jun 23 '16 at 17:27

pythonic

20,589
43
136
219

votes

2 answers

Can we integrate Hadoop with Python?

I have one project requirement. I'm using python script for analyzing the data. Initially, I used the txt files as an input to that python script. But as data grows, I have to switch my storage platform to Hadoop HDFS. How can I provide HDFS data to…

python hadoop hdfs bigdata

asked Jun 21 '16 at 06:25

M_Gandhi

votes

1 answer

Time-Efficient Wide to Long Conversion Pandas

I have an dataset of around 54 million rows that I need to read from a tab-delimited text file, convert from wide to long format, and write to a new text file. The data is too large to fit in memory, so I've been using iterators. There are three…

python performance pandas numpy bigdata

asked Jun 16 '16 at 20:05

jesseWUT

votes

1 answer

How Locality Sensitive Hashing (LSH) works?

I've read already this question, but unfortunately it didn't help. What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q: suppose that using our set of locality sensitive family…

hash similarity nearest-neighbor locality-sensitive-hash bigdata

asked Jun 14 '16 at 02:45

justHelloWorld

6,478
8
58
138

votes

5 answers

How to most efficiently increase values at a specified range in a large array and then find the largest value

So I just had a programming test for an interview and I consider myself a decent programmer, however I was unable to meet time constraints on the online test (and there was no debugger allowed). Essentially the question was give a range of indices…

c++ algorithm bigdata

asked Jun 13 '16 at 20:43

user3086956

votes

2 answers

Is there a way to catch executor killed exception in Spark?

During execution of my Spark program, sometimes (The reason for it is still a mystery to me) yarn kills containers (executors) giving the message that the memory limit was exceeded. My program does recover though with Spark re-executing the task by…

apache-spark bigdata hadoop-yarn

asked Jun 13 '16 at 14:29

pythonic

20,589
43
136
219

Prev 1 2 3

…

100