Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes. Big Data is not only data with a huge volume, there are many other characteristics such as velocity, veracity, and variety.

There are several features that allow separating this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer.
Relationship between data elements is extremely complex.

Algorithms

Local algorithms that take longer than O(N) to compute will likely take many years to finish.
Fast distributed algorithms are used instead.

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures.
One storage device is incapable of holding all the data set.

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce, etc.

7919 questions

votes

2 answers

Strategies for reading in CSV files in pieces?

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit…

r bigdata

asked Feb 19 '12 at 20:24

Ari B. Friedman

71,271
35
175
235

votes

10 answers

Python Shared Memory Dictionary for Mapping Big Data

I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python. Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded,…

python dictionary bigdata python-multiprocessing

asked Mar 22 '18 at 21:41

Jon Deaton

3,943
6
28
41

votes

2 answers

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory…

scala apache-spark bigdata distributed-computing rdd

asked Dec 12 '14 at 19:57

monster

1,762
3
20
38

votes

3 answers

Reading big data with fixed width

How can I read big data formated with fixed width? I read this question and tried some tips, but all answers are for delimited data (as .csv), and that's not my case. The data has 558MB, and I don't know how many lines. I'm using: dados <-…

r bigdata

asked Sep 10 '13 at 13:18

Rcoster

3,170
2
16
35

votes

6 answers

PySpark DataFrames - way to enumerate without converting to Pandas?

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just…

python apache-spark bigdata pyspark rdd

asked Sep 24 '15 at 12:06

Maria Koroliuk

votes

3 answers

Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row. As an example, this is a simplified version of my problem: User Time …

python performance pandas bigdata cython

asked Apr 04 '15 at 13:14

AdO

votes

3 answers

How do I determine the size of my HBase Tables ?. Is there any command to do so?

I have multiple tables on my Hbase shell that I would like to copy onto my file system. Some tables exceed 100gb. However, I only have 55gb free space left in my local file system. Therefore, I would like to know the size of my hbase tables so that…

hadoop export hbase bigdata

asked Feb 25 '15 at 20:47

gautham

votes

1 answer

Hive padding leading zeroes

I need the output of a string column in my table as 13 length char, irrespective of whatever length it is, i need to stuff the remaining chars with 0... I tried to use the following code in my hive query, but failed to get the desired…

sql hive bigdata

asked Jul 28 '14 at 16:37

Muthu Palaniappan

votes

2 answers

How to quickly export data from R to SQL Server

The standard RODBC package's sqlSave function even as a single INSERT statement (parameter fast = TRUE) is terribly slow for large amounts of data due to non-minimal loading. How would I write data to my SQL server with minimal logging so it writes…

sql sql-server r bigdata

asked Oct 04 '13 at 21:03

jpd527

1,543
1
14
30

votes

2 answers

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such. The thing is, I now often find myself writing processing steps…

sql apache-spark apache-spark-sql google-bigquery bigdata

asked May 07 '19 at 12:41

CARREAU Clément

votes

3 answers

How to efficiently save a Pandas Dataframe into one/more TFRecord file?

First I want to quickly give some background. What I want to achieve eventually is to train a fully connected neural network for a multi-class classification problem under tensorflow framework. The challenge of the problem is that the size of…

python pandas tensorflow bigdata tfrecord

asked Oct 11 '17 at 03:36

Ling Gu

votes

1 answer

Why does Spark's OneHotEncoder drop the last category by default?

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss =…

apache-spark machine-learning pyspark one-hot-encoding bigdata

asked Sep 14 '16 at 21:52

Corey

1,845
1
12
23

votes

1 answer

Memory limits in data table: negative length vectors are not allowed

I have a data table with several social media users and his/her followers. The original data table has the following format: X.USERID FOLLOWERS 1081 4053807021,2476584389,4713715543, ... So each row contains a user together with his/her ID and…

r data.table bigdata

asked Apr 25 '16 at 13:38

Matthias Bogaert

votes

2 answers

what is the basic difference between jobconf and job?

hi i wanted to know the basic difference between jobconf and job objects,currently i am submitting my job like this JobClient.runJob(jobconf); i saw other way of submitting jobs like this Configuration conf = getConf(); Job job = new Job(conf,…

hadoop mapreduce bigdata

asked Aug 23 '13 at 12:02

user1585111

1,019
6
19
35

votes

2 answers

Haskell: Can I perform several folds over the same lazy list without keeping list in memory?

My context is bioinformatics, next-generation sequencing in particular, but the problem is generic; so I will use a log file as an example. The file is very large (Gigabytes large, compressed, so it will not fit in memory), but is easy to parse…

performance haskell lazy-evaluation bigdata

asked May 29 '12 at 16:36

luispedro

6,934
4
35
45

Prev 1 2

…

99 100 Next