5

TLDR: I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in the same way that pandas can transpose a dataframe using df.T.

Details: I have sample twitter data from my timeline here. To get to my starting point, here is the code to read a json from disk into a dask.bag and then convert that into a dask.dataframe

import dask.bag as db
import dask.dataframe as dd
import json


b = db.read_text('./sampleTwitter.json').map(json.loads)
df = b.to_dataframe()
df.head()

The Problem All my individual events (i.e. tweets) are recorded as columns vice rows. In keeping with tidy principles, I would like to have rows for each event. pandas has a transpose method for dataframes and dask.array has a transpose method for arrays. My goal is to do the same transpose operation, but on a dask dataframe. How would I do that?

  1. Convert rows to columns

Edit for solution

This code resolves the original transpose problem, cleans Twitter json files by defining the columns you want to keep and dropping the rest, and creates a new column by applying a function to a Series. Then, we write a MUCH smaller, cleaned file to disk.

import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
from dask.diagnostics import ProgressBar,Profiler, ResourceProfiler, CacheProfiler
import pandas as pd
import json
import glob

# pull in all files..
filenames = glob.glob('~/sampleTwitter*.json')


# df = ... # do work with dask.dataframe
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
df = dd.from_delayed(dfs)


# see all the fields of the dataframe 
fields = list(df.columns)

# identify the fields we want to keep
keepers = ['coordinates','id','user','created_at','lang']

# remove the fields i don't want from column list
for f in keepers:
    if f in fields:
        fields.remove(f)

# drop the fields i don't want and only keep whats necessary
df = df.drop(fields,axis=1)

clean = df.coordinates.apply(lambda x: (x['coordinates'][0],x['coordinates'][1]), meta= ('coords',tuple))
df['coords'] = clean

# making new filenames from old filenames to save cleaned files
import re
newfilenames = []
for l in filenames:
    newfilenames.append(re.search('(?<=\/).+?(?=\.)',l).group()+'cleaned.json')
#newfilenames

# custom saver function for dataframes using newfilenames
def saver(frame,filename):
    return frame.to_json('./'+filename)

# converting back to a delayed object
dfs = df.to_delayed()
writes = [(delayed((saver)(df, fn))) for df, fn in zip(dfs, newfilenames)]

# writing the cleaned, MUCH smaller objects back to disk
dd.compute(*writes)
Linwoodc3
  • 1,037
  • 2
  • 11
  • 14
  • 2
    Transpose is not implemented. It probably isn't possible without doing a large shuffle. I recommend raising an issue for discussion. – MRocklin Aug 04 '16 at 12:20
  • 1
    Having transpose would be nice, but I'm not clear it would be the right tool for the job here. What will your real data look like? Many small json files that you're working with? If so, we can parse each of them sensibly then make a dask dataframe out of them. One big JSON file? Then your use of bag hasn't helped yet, since it remains serial. If you can afford serial at this stage, just use raw pandas for the parsing (possibly switching to dask after); if you can't, we might have to change tactics more dramatically. – Mike Graham Aug 04 '16 at 16:36
  • @Mike, the real data will be lots of JSON files merged into a single dask.dataframe using the glob string similar to the read_csv examples in dask. I am trying to combine dask with datashader to build a map of languages...the lang column in each tweet is converted to a category. With 150k tweets, I have some continents and major cities showing up but need lots more to fill the other continents out. Then, I'll have to go to other platforms to fill in countries where twitter isn't popular, and merge into one big data viz. – Linwoodc3 Aug 04 '16 at 17:28
  • To get the structure of the data and outlines of the countries, i need millions of tweets. I pull data routinely, write the json to disk (internet goes down, twitter quota) and store in a common directory. so it will be merged json files at the end of the day. To do the munging and cleaning, I'll need to do a .apply over the dataframe to create a lang column, change projection from WGS84 to web Mercator format, etc. – Linwoodc3 Aug 04 '16 at 17:28
  • I handle 150k tweets fine with pandas, but you see how my memory problem will bite me as I get more data. – Linwoodc3 Aug 04 '16 at 17:29

1 Answers1

2

I think you can get the result you want by bypassing bag altogether, with code like

import glob

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = glob.glob('sampleTwitter*.json')
dfs = [delayed(pd.read_json)(fn, 'records') for fn in filenames]
ddf = dd.from_delayed(dfs)
Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • I am facing the same problem but with a dataframe of type `dask.dataframe.core.DataFrame` but with elements as `float64`s. I am hoping to use the above approach with 2 GPUs but I am unable to infer how to modify it to my case. Can you please eloborate as to how the above code works please? My data does not fit in the RAM of the GPU (dataframe size >> GPU memory). – Quiescent Aug 18 '22 at 20:06
  • Hm, that does not really answer the question, how to transpose a dataframe in dask. – Soerendip Dec 07 '22 at 00:31