We need to extract the events table from GA4 through bigquery (not connecting via Google API directly as it limits both - the number of rows & number of dimensions/metrics), however as there are several nested columns, the ADF reads data in the given format:
{
"v": [{
"v": {
"f": [{
"v": "firebase_conversion"
}, {
"v": {
"f": [{
"v": null
}, {
"v": "0"
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "ga_session_id"
}, {
"v": {
"f": [{
"v": null
}, {
"v": "123"
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "engaged_session_event"
}, {
"v": {
"f": [{
"v": null
}, {
"v": "1"
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "ga_session_number"
}, {
"v": {
"f": [{
"v": null
}, {
"v": "9"
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "page_referrer"
}, {
"v": {
"f": [{
"v": "ABC"
}, {
"v": null
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "page_title"
}, {
"v": {
"f": [{
"v": "ABC"
}, {
"v": null
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "page_location"
}, {
"v": {
"f": [{
"v": "xyz"
}, {
"v": null
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}, {
"v": {
"f": [{
"v": "session_engaged"
}, {
"v": {
"f": [{
"v": null
}, {
"v": "1"
}, {
"v": null
}, {
"v": null
}]
}
}]
}
}]
}
Unnesting is a problem as there are several columns with such data structure, and unnest will increase the number of rows (3.5mn records becomes 40mn). The plan is to maybe extract data as is & use azure functions with python functions to flatten it as JSON, but again the null values are creating trouble there.
Someone can suggest the best way to get data on daily basis without extrapolation in the desired format in the data lake?