Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

4 answers

Log files vs database where to save user activity data for analysis?

I am currently working on a website which have functionality of login. I need to track user activities like time of login-logout, total duration of browsing, IP Address , location etc. This all data will be used for analysis and security purposes.…

asked Jan 24 '17 at 09:30

WeAreRight

votes

1 answer

PHP - processing big data

I am trying processing big data by PHP (100 000 000 records). I am downloading every record from different server, then make some text checkings and probably 10% of appropriate records inserting in my DB (MySQL). My problem is: web browser just…

php bigdata

asked Jan 20 '17 at 10:17

Jarda K.

votes

1 answer

HDFS HA possibilities

Recently, I have managed to enable HA for HDFS and YARN. Now I have an active and standby namenodes and automatic failover is working properly. I am using Cloudera Manager and CDH 5. I have a following question. For example, if my active Namenode…

hadoop hdfs high-availability cloudera-manager bigdata

asked Jan 19 '17 at 12:59

user7441074

votes

2 answers

Filtering on multiple customDimensions and then aggregating

sql pivot google-bigquery analytics bigdata

asked Jan 14 '17 at 13:49

Pentium10

204,586
122
423
502

votes

1 answer

Understanding Shuffle and rePartitioning in spark

I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms . In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be…

apache-spark bigdata

asked Jan 11 '17 at 17:36

Sal

votes

0 answers

Logstash output Performance

I 'am using Elasticsearch 5.1.1 , logstash 5.1.1 ,I imported 3 millions rows from sqlserver into elastic via logstash in 2 hours I have 1 single windows machine with 4GB Ram , core I 3 ): is there any additional configurations should I add to speed…

elasticsearch logstash bigdata

asked Jan 11 '17 at 08:24

Elsayed

2,712
7
28
41

votes

0 answers

Merits of JSON vs CSV file format while writing to HDFS for downstream applications

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format? We are contemplating choosing one of them, but before making the call, we are wondering what are the merits &…

json csv hadoop hdfs bigdata

asked Jan 10 '17 at 20:45

jb04

votes

2 answers

Large scale pivot table in Python

I have 100-300Go data under csv format(numerical + unicode text) and needs to do regular Pivot Table jobs on this. After googling/StackOverflow-ing, could not find satisfactory answer (only partial). Wondering which solution is the fastest for…

pandas pivot-table bigdata

asked Jan 09 '17 at 02:52

tensor

3,088
8
37
71

votes

2 answers

How hive sentences function breaks each sentence

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does…

hive bigdata

asked Jan 04 '17 at 15:19

user7343922

votes

2 answers

select multiple elements with group by in spark.sql

is there any way to group by table in sql spark which selects multiple elements code i am using: val df = spark.read.json("//path") df.createOrReplaceTempView("GETBYID") now doing group by like : val sqlDF = spark.sql( "SELECT count(customerId)…

scala apache-spark apache-spark-sql bigdata

asked Jan 02 '17 at 05:17

rahul

votes

3 answers

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task? Right now we…

sql postgresql bigdata

asked Dec 19 '16 at 17:18

SteveL

3,331
4
32
57

votes

0 answers

Input/output error while copying from hadoop file system to local

hadoop fs -copyToLocal /paulp /abcd (I want to copy the folder paulp in hadoop file system to abcd folder in local) But the oputput of that command shows like this( copyToLocal: mkdir `/abcd': Input/output error) I use ubuntu 14.04 and hadoop…

linux hadoop data-science bigdata

asked Dec 17 '16 at 10:44

paul vineeth

votes

2 answers

Update new added columns in hive

I have been trying to make updates to an orc table in hive which is bucketed and also set transactional=true property. The normal update works great but as soon as I alter the table and add a new column e.g. column_added_5, and try to update…

hadoop hive sql-update acid bigdata

asked Dec 13 '16 at 00:31

Varun Singh

votes

0 answers

Distributed Yarn Application

when i have a job that is submitted on Ressources Manager via YArnClient The RM will instanciate an AM on one of the Node manager ,i suppose that my job was a jarFile , so this jarfile will be distributed on all the NM on the cluster then all the…

java hadoop-yarn hadoop2 bigdata

asked Dec 07 '16 at 15:55

Sendi Zied

votes

2 answers

Is there any concept of auto commit in hbase?

I am new to hbase and want to learn more. I just want to know if there is any auto commit concept available in HBASE?

apache hadoop hbase bigdata

asked Dec 06 '16 at 07:43

Sameer Bhand

Prev 1 2 3

…

99 100 Next