Questions tagged [data-partitioning]

Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.

337 questions
0
votes
0 answers

What causes the tasks to not evenly partition in Spark?

In my experience, sometimes when I apply transformation() to large data, it seems that the tasks are not evenly partitioned and are skewed to one side so that only a few tasks are working. As a result, it was confirmed that the efficiency of the…
S.Kang
  • 581
  • 2
  • 10
  • 28
0
votes
1 answer

Moving historical data from google cloud storage to date-partitioned bigquery table using python

I need to organize a large amount of historical data into date-partitions in google bigquery. It will partition the load date for you (current date only) but that doesn't really help with historical data. The only solutions I have seen so far are…
Ralph
  • 65
  • 1
  • 6
0
votes
2 answers

Why Google BigQuery doesn't use partition date correctly when using views

I have a date partitioned table (call it sample_table) with 2 columns, one to save dateTime in UTC and other to save timezone offset. I have a view on top of this table (call it sample_view). The view takes _partitiontime in from table and exposes…
opensourcegeek
  • 5,552
  • 7
  • 43
  • 64
0
votes
1 answer

Ignored duplicate property derby.module.dataDictionary in Hive

I have an EMPLOYEES table which is partitioned on the basis of COUNTRY and STATE. Below are the partitions. hive (human_resources)> show partitions employees ; OK country=IN/state=PU country=US/state=CA country=US/state=IL Time taken: 0.119 seconds,…
0
votes
1 answer

Vertical Partitioning in Scalding

I have a TypedTipe[(String, String, Long)] where the first String can assume only a limited (~10) number of values. I'd like to partition my output so that a folder is created for each type (I.E. 10 folders with the name of the first String). This…
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154
0
votes
1 answer

Partition Pruning using DATE and RANGE COLUMNS

I'm trying to partition a database using a DATE column to take advantage of partition pruning in MySQL 5.7. For Internal reasons, I need to partition by RANGE COLUMNS because it is easy and fast to add/drop partitions. While the MySQL website…
csn
  • 39
  • 4
0
votes
1 answer

Can a database table partition name be used as a part of WHERE clause for IBM DB2 9.7 SELECT statement?

I am trying to select all data out of the same specific table partition for 100+ tables using the DB2 EXPORT utility. The partition name is constant across all of my partitioned tables, which makes this method more advantageous than using some…
0
votes
2 answers

Case Statement in Where Clause in SQL Server

Good day! I have a query using SQL which gives a result set of sales per tenant. Now, I want to get a final result set that shows the top 5 and bottom 5 in terms of sales (may be flexible, but sets 5 as an example) I used rank function to get the…
rickyProgrammer
  • 1,177
  • 4
  • 27
  • 63
0
votes
0 answers

awk Splitting huge file creates error "too many open files"

I have a bash script for the purpose of splitting up a huge input file -- at the moment it's 400MB, later the script should split a 4GB file. The core or this processing is the following awk script: INPUTFILE="FA.txt" awk -F $'\t' 'BEGIN{ count…
Friedrich
  • 29
  • 2
0
votes
2 answers

Partition on Exisiting Table with Millions of Record

I need you suggestion on creating a partition on a table having millions of record. table definitions CompanyId Type_Of_Data Emp_id Destination Destination_id Now here for a single company ,type of data and emp_id can be different COMPANY_ID …
0
votes
3 answers

How to algorithmically partition a keyspace?

This is related to consistent hashing and while I conceptually understand what I need to do, I'm having a hard time translating this into code. I'm trying to divide a given keyspace (say, 128 bits) into equal sized partitions. I want the upper bound…
Patrick Hogan
  • 2,098
  • 4
  • 20
  • 28
0
votes
0 answers

Divide dataset into training and testing dataset

I have two datasets of images: subjects 1-200 and each having c (e.g. c=8) images per subject. Now I want to divide this two datasets into training and testing sets for my algorithm. I typically want to do it for this following cases: CASES…
roni
  • 1,443
  • 3
  • 28
  • 49
0
votes
1 answer

Updating Kafka Event Log

I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse. The issue is…
Scott Switzer
  • 1,064
  • 1
  • 15
  • 25
0
votes
1 answer

Finding next first record for UserID using first registered row

I'm getting a bit tied down with this and hope to find a solution. Say I have a data set like this: PersonID RowID Reg_date Reg_Time Process_first_Date Process_first_time Process_Last_Date Process_Last_time End_date …
Andrew
  • 1,728
  • 8
  • 28
  • 39
0
votes
1 answer

which approach is better for increasing performance of a stored procedure

I have a sp which has to select data from 8 tables and each select query has a lot of 'where clauses' and each table has thousands of rows of data. Now, the requirement is to increase performance of this sp. Below mentioned approaches are suggested…
Onki
  • 1,879
  • 6
  • 38
  • 58