4

I have a dataset consisting of 100k unique data records, to benchmark the code, I need to test on data with 5 million unique records, I don't want to generate random data. I would like to use the 100k data records which I have as the base dataset and generate the remaining data similar to it with unique values for certain columns, How can I do that using python or Scala ?

Here's the sample data

latitude   longitude  step count
25.696395   -80.297496  1   1
25.699544   -80.297055  1   1
25.698612   -80.292015  1   1
25.939942   -80.341607  1   1
25.939221   -80.349899  1   1
25.944992   -80.346589  1   1
27.938951   -82.492018  1   1
27.944691   -82.48961   1   3
28.355484   -81.55574   1   1

Each pair of latitude and longitude should be unique across the data generated, I should be able to set min and max values for these columns as well

namrutha
  • 183
  • 2
  • 14
  • You want to generate synthetic data, yes? Maybe if you gave an example record and how you want to permute it we might be able to help. – guidoism Apr 07 '18 at 01:47
  • @guidoism Thank you I will update the question with sample record – namrutha Apr 07 '18 at 02:00
  • Each latitude should be unique and each longitude should be unique, or do you mean each pair should be unique? How is 5mil records generated from the existing dataset different from 5mil records of random lat-longs? What are the properties that you're trying to preserve? – jwvh Apr 07 '18 at 04:24
  • @jwvh Each pair should be unique, I updated the question, Thank you – namrutha Apr 07 '18 at 04:30

2 Answers2

4

You can generate data conforming to normal distribution easily using R, you can follow the following steps

#Read the data into a dataframe
library(data.table)
data = data = fread("data.csv", sep=",", select = c("latitude", "longitude"))

#Remove duplicate and null values
df = data.frame("Lat"=data$"latitude", "Lon"=data$"longitude")
df1 = unique(df[1:2])
df2  <- na.omit(df1)

#Determine the mean and standard deviation of latitude and longitude values
meanLat = mean(df2$Lat)
meanLon = mean(df2$Lon)
sdLat = sd(df2$Lat)
sdLon = sd(df2$Lon)

#Use Normal distribution to generate new data of 1 million records

newData = list()
newData$Lat = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLat + meanLat)
newData$Lon = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLon + meanLon)

finalData = rbind(df2,newData)

now final data contains both old records and new records

Write the finalData dataframe to a CSV file and you can read it from Scala or python

arjunsv3691
  • 791
  • 6
  • 19
1

If you just want to generate data only in scala, try in this way.

val r = new scala.util.Random   //create scala random object
val new_val = r.nextFloat() // for generating next random float between 0 to 1 for every call

And add this new_val to maximum value of latitude in your data. Unique latitude anyway makes pair unique.

You can try this option with Spark with Scala.

val latLongDF = ss.read.option("header", true).option("delimiter", ",").format("csv").load(mypath)   // loaded your sample data in your question as Dataframe
+---------+----------+----+-----+
| latitude| longitude|step|count|
+---------+----------+----+-----+
|25.696395|-80.297496|   1|    1|
|25.699544|-80.297055|   1|    1|
|25.698612|-80.292015|   1|    1|


val max_lat = latLongDF.select(max("latitude")).first.get(0).toString().toDouble // got max latitude value

val r = new scala.util.Random // scala random object to get random numbers

val nextLat = udf(() => (28.355484 + 0.000001 + r.nextFloat()).toFloat) // udf to give random latitude more than the existing maximum latitude

In above line toFloat rounds to float which can cause duplicate values. Remove this to get complete random values if you are fine with more decimal values(more than 6) in your latitudes. Or use same method on longitude also to get better uniqueness.

val new_df = latLongDF.withColumn("new_lat", nextLat()).select(col("new_lat").alias("latitude"),$"longitude",$"step",$"count").union(latLongDF) // creating new dataframe and Union with existing dataframe 

New generated data sample.

+----------+----------+----+-----+
|latitude| longitude|step|count|
+----------+----------+----+-----+
| 28.446129|-80.297496|   1|    1|
| 28.494934|-80.297055|   1|    1|
| 28.605234|-80.292015|   1|    1|
| 28.866316|-80.341607|   1|    1|
Praveen L
  • 937
  • 6
  • 13