Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

1 answer

How can I tell when my dataset in R is going to be too large?

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead…

r bigdata logfile-analysis

asked Oct 07 '12 at 08:57

Heather Stark

votes

1 answer

When do you start additional Elasticsearch nodes?

I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well. I have very, very large amounts of data. I'm…

elasticsearch sharding bigdata

asked Sep 13 '12 at 15:11

gdm

votes

3 answers

Dynamodb query error - Query key condition not supported

I am trying to query my dynamodb table to get feed_guid and status_id = 1. But it returns Query key condition not supported error. Please find my table schema and query. $result =$dynamodbClient->createTable(array( 'TableName' =>…

amazon-web-services bigdata amazon-dynamodb

asked Aug 05 '15 at 11:13

Arun SS

1,791
8
29
48

votes

3 answers

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes. I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated,…

mongodb storage gridfs bigdata

asked Feb 22 '13 at 18:09

cmd

votes

2 answers

How to get array/bag of elements from Hive group by operator?

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I…

sql hadoop hive apache-pig bigdata

asked May 08 '13 at 15:03

Anuroop

votes

6 answers

How to copy data from one HDFS to another HDFS?

I have two HDFS setup and want to copy (not migrate or move) some tables from HDFS1 to HDFS2. How to copy data from one HDFS to another HDFS? Is it possible via Sqoop or other command line?

hadoop hdfs bigdata sqoop

asked Aug 06 '15 at 18:11

sharp

2,140
9
43
80

votes

5 answers

Books to start learning big data

I would like to start learning about the big data technologies. I want to work in this area in the future. Does anyone know good books to start learning about it? Hadoop, HBase. Beginner - intermediate - advanced - Thanks in advance

hadoop hbase hive pentaho bigdata

asked Nov 08 '12 at 15:02

Gunter Amorim

votes

3 answers

Best solution for finding 1 x 1 million set intersection? Redis, Mongo, other

Hi all and thanks in advance. I am new to the NoSQL game but my current place of employment has tasked me with set comparisons of some big data. Our system has customer tag set and targeted tag sets. A tag is an 8 digit number. A customer tag set…

mongodb redis bigdata nosql

asked Jun 19 '12 at 06:11

MFD3000

votes

2 answers

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files. Is there an…

amazon-web-services amazon-s3 aws-cli bigdata

asked Jan 24 '17 at 18:35

King Dedede

votes

6 answers

What is the difference between Big Data and Data Mining?

As Wikpedia states The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use How is this related with Big Data? Is it correct if I say that Hadoop…

hadoop machine-learning bigdata data-mining data-science

asked Mar 15 '14 at 05:25

DesirePRG

6,122
15
69
114

votes

3 answers

Export large amount of data from Cassandra to CSV

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried: sstable2json - it produces quite big json files which are hard to parse - because…

csv cassandra bigdata cassandra-2.0

asked Jul 22 '14 at 19:38

KrzysztofZalasa

votes

1 answer

Postgresql - performance of using array in big database

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index. Every record is around 50-60 bytes. The table name is "Item" The server is: 12 GB RAM, 1,5…

arrays performance postgresql join bigdata

asked Aug 03 '12 at 08:03

user1573402

votes

6 answers

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8 I need to have list of most repeated phrase on my whole text data. For example my desire output something like this: { 'a': 423412341, 'this': 423412341, 'is': 322472341, …

search text full-text-search bigdata

asked Apr 20 '15 at 16:41

Mohammad Hossein Fattahizadeh

2,651
5
35
50

votes

4 answers

Django + Postgres + Large Time Series

I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call. I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I…

python django postgresql heroku bigdata

asked Aug 08 '14 at 20:48

Ben

votes

3 answers

Error Message: TOK_ALLCOLREF is not supported in current context - while Using DISTINCT in HIVE

I'm using the simple command: SELECT DISTINCT * FROM first_working_table; in HIVE 0.11, and I'm receiving the following error message: FAILED: SemanticException TOK_ALLCOLREF is not supported in current context. Does anyone know why this is…

sql hadoop hive distinct bigdata

asked Jan 13 '14 at 10:22

user3107144

Prev 1

…

99 100 Next