Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer.
  • Relationship between data elements is extremely complex.

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely take many years to finish.
  • Fast distributed algorithms are used instead.

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
  • One storage device is incapable of holding all the data set.

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.
7919 questions
2
votes
4 answers

Log files vs database where to save user activity data for analysis?

I am currently working on a website which have functionality of login. I need to track user activities like time of login-logout, total duration of browsing, IP Address , location etc. This all data will be used for analysis and security purposes.…
WeAreRight
  • 885
  • 9
  • 24
2
votes
1 answer

PHP - processing big data

I am trying processing big data by PHP (100 000 000 records). I am downloading every record from different server, then make some text checkings and probably 10% of appropriate records inserting in my DB (MySQL). My problem is: web browser just…
Jarda K.
  • 161
  • 3
  • 10
2
votes
1 answer

HDFS HA possibilities

Recently, I have managed to enable HA for HDFS and YARN. Now I have an active and standby namenodes and automatic failover is working properly. I am using Cloudera Manager and CDH 5. I have a following question. For example, if my active Namenode…
2
votes
2 answers

Filtering on multiple customDimensions and then aggregating

The data that comes out from the BigQuery implementation of GoogleAnalytics raw data looks like this: |-visitId |- date |- (....) +- hits |- time |- page |- pagePath |- eventInfo |- eventAction +- customDimensions |-…
Pentium10
  • 204,586
  • 122
  • 423
  • 502
2
votes
1 answer

Understanding Shuffle and rePartitioning in spark

I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms . In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be…
Sal
  • 167
  • 2
  • 10
2
votes
0 answers

Logstash output Performance

I 'am using Elasticsearch 5.1.1 , logstash 5.1.1 ,I imported 3 millions rows from sqlserver into elastic via logstash in 2 hours I have 1 single windows machine with 4GB Ram , core I 3 ): is there any additional configurations should I add to speed…
Elsayed
  • 2,712
  • 7
  • 28
  • 41
2
votes
0 answers

Merits of JSON vs CSV file format while writing to HDFS for downstream applications

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format? We are contemplating choosing one of them, but before making the call, we are wondering what are the merits &…
jb04
  • 79
  • 9
2
votes
2 answers

Large scale pivot table in Python

I have 100-300Go data under csv format(numerical + unicode text) and needs to do regular Pivot Table jobs on this. After googling/StackOverflow-ing, could not find satisfactory answer (only partial). Wondering which solution is the fastest for…
tensor
  • 3,088
  • 8
  • 37
  • 71
2
votes
2 answers

How hive sentences function breaks each sentence

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does…
user7343922
  • 316
  • 4
  • 17
2
votes
2 answers

select multiple elements with group by in spark.sql

is there any way to group by table in sql spark which selects multiple elements code i am using: val df = spark.read.json("//path") df.createOrReplaceTempView("GETBYID") now doing group by like : val sqlDF = spark.sql( "SELECT count(customerId)…
rahul
  • 880
  • 3
  • 14
  • 25
2
votes
3 answers

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task? Right now we…
SteveL
  • 3,331
  • 4
  • 32
  • 57
2
votes
0 answers

Input/output error while copying from hadoop file system to local

hadoop fs -copyToLocal /paulp /abcd (I want to copy the folder paulp in hadoop file system to abcd folder in local) But the oputput of that command shows like this( copyToLocal: mkdir `/abcd': Input/output error) I use ubuntu 14.04 and hadoop…
2
votes
2 answers

Update new added columns in hive

I have been trying to make updates to an orc table in hive which is bucketed and also set transactional=true property. The normal update works great but as soon as I alter the table and add a new column e.g. column_added_5, and try to update…
Varun Singh
  • 101
  • 4
2
votes
0 answers

Distributed Yarn Application

when i have a job that is submitted on Ressources Manager via YArnClient The RM will instanciate an AM on one of the Node manager ,i suppose that my job was a jarFile , so this jarfile will be distributed on all the NM on the cluster then all the…
Sendi Zied
  • 75
  • 1
  • 5
2
votes
2 answers

Is there any concept of auto commit in hbase?

I am new to hbase and want to learn more. I just want to know if there is any auto commit concept available in HBASE?
Sameer Bhand
  • 43
  • 1
  • 9