Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
2
votes
1 answer

How do databases handle redundant values?

Suppose I have a database with several columns. In each column there are lots of values that are often similar. For example I can have a column with the name "Description" and a value could be "This is the description for the measurement". This…
Ohumeronen
  • 1,769
  • 2
  • 14
  • 28
2
votes
1 answer

Working with AWS S3 Large Public Data Set

AWS has several public "big data" data sets available. Some are hosted for free on EBS, and others, like NASA NEX climate data are hosted on S3. I have found more discussion on how to work with those that are hosted in EBS, but have been unable to…
csg2136
  • 235
  • 4
  • 10
2
votes
1 answer

Why we need a coarse quantizer?

In Product Quantization for Nearest Neighbor Search, when it comes to section IV.A, it says they they will use a coarse quantizer too (which they way I feel it, is just a really smaller product quantizer, smaller w.r.t. k, the number of…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
2
votes
1 answer

Compress output of Hadoop Archive tool

I'm using Hadoop Archive for reduce number of files in my Hadoop cluster, but for data retention, I want to keep my data as long as possible. Then the problem is Hadoop Archive not reduce folder size (my folder have multi-type of file, both small…
dltu
  • 34
  • 8
2
votes
2 answers

Best practice for storing and indexing 1M+ XML documents?

I have an archive of several years' worth of XML documents. There are 1M+ unique document subjects, and each subject may have one or more documents for any given year. Each document contains hundreds of nodes and parameters. Total XML cache is about…
MarathonStudios
  • 2,849
  • 4
  • 20
  • 18
2
votes
0 answers

dplyr left_join with similiar, but not exactly the same, columns of strings (pmatch or str_detect)

I recently posted: dplyr, lapply, or Map to identify information from one data.frame and place it into another My main issue involves using dplyr/lapply to combine two data.frames by a column of strings. The strings are first names, but they are not…
beemyfriend
  • 85
  • 1
  • 11
2
votes
1 answer

Create Adjacency Matrix in Python for large Dataset

I have a problem with representing website user behaviour in a Adjacency Matrix in Python. I want to analyze the user interaction between 43 different websites to see which websites are used together. The given data set has about 13.000.000 lines…
Duesentrieb
  • 492
  • 2
  • 7
  • 18
2
votes
1 answer

Python: many-to-many comparison to find required set of data

This is my first question so please forgive any mistakes. I have a large file(csv) with several(~10000000+) lines of information like the following example: date;box_id;box_length;box_width;box_height;weight;type --snip-- 1999-01-01…
A.I.
  • 25
  • 6
2
votes
2 answers

Need a method to filter data for records having more than one record for an id in HIVE

Consider the table below in HIVE: Here i need to find out the unique combination of household,vehicle and customer. But the condition is this.If for the same household and vehicle there are two different customers with role DRIVER and OWNER, i have…
Vaishak
  • 607
  • 3
  • 8
  • 30
2
votes
1 answer

Restricting a yarn container to execute only one task at a time

I am running a Spark program using a hadoop cluster, which uses the yarn scheduler to run the tasks. However, I notice a strange behavior. Yarn sometimes kills a task complaining out of memory error, whereas if I execute the tasks in rounds, that…
pythonic
  • 20,589
  • 43
  • 136
  • 219
2
votes
2 answers

Can we integrate Hadoop with Python?

I have one project requirement. I'm using python script for analyzing the data. Initially, I used the txt files as an input to that python script. But as data grows, I have to switch my storage platform to Hadoop HDFS. How can I provide HDFS data to…
M_Gandhi
  • 108
  • 2
  • 10
2
votes
1 answer

Time-Efficient Wide to Long Conversion Pandas

I have an dataset of around 54 million rows that I need to read from a tab-delimited text file, convert from wide to long format, and write to a new text file. The data is too large to fit in memory, so I've been using iterators. There are three…
jesseWUT
  • 581
  • 4
  • 14
2
votes
1 answer

How Locality Sensitive Hashing (LSH) works?

I've read already this question, but unfortunately it didn't help. What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q: suppose that using our set of locality sensitive family…
2
votes
5 answers

How to most efficiently increase values at a specified range in a large array and then find the largest value

So I just had a programming test for an interview and I consider myself a decent programmer, however I was unable to meet time constraints on the online test (and there was no debugger allowed). Essentially the question was give a range of indices…
user3086956
  • 55
  • 1
  • 7
2
votes
2 answers

Is there a way to catch executor killed exception in Spark?

During execution of my Spark program, sometimes (The reason for it is still a mystery to me) yarn kills containers (executors) giving the message that the memory limit was exceeded. My program does recover though with Spark re-executing the task by…
pythonic
  • 20,589
  • 43
  • 136
  • 219
1 2 3
99
100