Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
38
votes
1 answer

How can I tell when my dataset in R is going to be too large?

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead…
Heather Stark
  • 605
  • 7
  • 18
36
votes
1 answer

When do you start additional Elasticsearch nodes?

I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well. I have very, very large amounts of data. I'm…
gdm
  • 905
  • 1
  • 15
  • 21
31
votes
3 answers

Dynamodb query error - Query key condition not supported

I am trying to query my dynamodb table to get feed_guid and status_id = 1. But it returns Query key condition not supported error. Please find my table schema and query. $result =$dynamodbClient->createTable(array( 'TableName' =>…
Arun SS
  • 1,791
  • 8
  • 29
  • 48
29
votes
3 answers

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes. I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated,…
cmd
  • 515
  • 3
  • 9
  • 19
28
votes
2 answers

How to get array/bag of elements from Hive group by operator?

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I…
Anuroop
  • 993
  • 3
  • 13
  • 25
27
votes
6 answers

How to copy data from one HDFS to another HDFS?

I have two HDFS setup and want to copy (not migrate or move) some tables from HDFS1 to HDFS2. How to copy data from one HDFS to another HDFS? Is it possible via Sqoop or other command line?
sharp
  • 2,140
  • 9
  • 43
  • 80
27
votes
5 answers

Books to start learning big data

I would like to start learning about the big data technologies. I want to work in this area in the future. Does anyone know good books to start learning about it? Hadoop, HBase. Beginner - intermediate - advanced - Thanks in advance
Gunter Amorim
  • 77
  • 1
  • 5
  • 14
27
votes
3 answers

Best solution for finding 1 x 1 million set intersection? Redis, Mongo, other

Hi all and thanks in advance. I am new to the NoSQL game but my current place of employment has tasked me with set comparisons of some big data. Our system has customer tag set and targeted tag sets. A tag is an 8 digit number. A customer tag set…
MFD3000
  • 854
  • 1
  • 11
  • 26
25
votes
2 answers

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files. Is there an…
King Dedede
  • 970
  • 1
  • 12
  • 28
25
votes
6 answers

What is the difference between Big Data and Data Mining?

As Wikpedia states The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use How is this related with Big Data? Is it correct if I say that Hadoop…
DesirePRG
  • 6,122
  • 15
  • 69
  • 114
23
votes
3 answers

Export large amount of data from Cassandra to CSV

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried: sstable2json - it produces quite big json files which are hard to parse - because…
KrzysztofZalasa
  • 251
  • 1
  • 2
  • 7
23
votes
1 answer

Postgresql - performance of using array in big database

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index. Every record is around 50-60 bytes. The table name is "Item" The server is: 12 GB RAM, 1,5…
user1573402
  • 233
  • 1
  • 2
  • 5
22
votes
6 answers

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8 I need to have list of most repeated phrase on my whole text data. For example my desire output something like this: { 'a': 423412341, 'this': 423412341, 'is': 322472341, …
22
votes
4 answers

Django + Postgres + Large Time Series

I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call. I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I…
Ben
  • 223
  • 1
  • 2
  • 4
22
votes
3 answers

Error Message: TOK_ALLCOLREF is not supported in current context - while Using DISTINCT in HIVE

I'm using the simple command: SELECT DISTINCT * FROM first_working_table; in HIVE 0.11, and I'm receiving the following error message: FAILED: SemanticException TOK_ALLCOLREF is not supported in current context. Does anyone know why this is…
user3107144
  • 231
  • 1
  • 2
  • 3