1

I'm trying to programmatically delete all the rows in a Socrata dataset, using the SODA API. I don't want to delete the dataset itself, because then I'd have lots of work to reconstruct it. It's also very slow to do this for thousands of rows.

Previously, I iterated slowly over each row, deleting each, until someone on Twitter, suggested doing an Upsert with no rows. I believe that I implemented this and it worked, but now it doesn't.

Here's the relevant bit of code:

headers = {'X-App-Token': config.app_token}
headers['Content-Type'] = 'application/json'

r = requests.put(config.dataset + '.json', data='[ ]', headers=headers, auth=config.auth)
print r.status_code, r.text

The output is:

200 {
  "By RowIdentifier" : 0,
  "Rows Updated" : 0,
  "Rows Deleted" : 0,
  "Rows Created" : 0,
  "Errors" : 0,
  "By SID" : 0
}

(So I think it's safe to say that the problem is not to do with authentication, authorisation, etc? My other functions to upsert rows work fine, so it's also not a problem with a wrong dataset URL or anything.)

I also query the number of rows before and afterwards, and there is no change. Still thousands of rows.

As far as I can tell, I'm following the API document for Replacing rows in bulk.

The only thing I can think of is that due to a bug, I have multiple rows with the same rowid.

EDIT

Here are some duplicate row identifiers:

enter image description here

rowid is definitely set as a row identifier:

enter image description here

Now given that row identifiers are supposed to "essentially act the same way as primary keys", I'm starting to wonder if this is a bug, or something has gone horribly wrong? The publishing code looks like this:

def publishDataset(rows):
  r = requests.post(config.dataset, data=simplejson.dumps(rows), headers = headers, auth=config.auth)
  j = r.json()
  print
  if r.status_code != 200:
    raise RuntimeError ( "%d Socrata error: %s" % (r.status_code, j['message']))
  return j

Full code here: https://github.com/stevage/meshlium-soda

Steve Bennett
  • 114,604
  • 39
  • 168
  • 219

1 Answers1

1

Sorry for the delay in getting back to you. I think there's a bug in the documentation, and you should use a DELETE on the API endpoint instead:

%> curl --verbose -X DELETE --header "X-App-Token: [REDACTED]" --user [REDACTED] https://soda.demo.socrata.com/resource/ppbu-8a96.json
Enter host password for user '[REDACTED]':
* Hostname was NOT found in DNS cache
*   Trying 216.227.229.224...
* Connected to soda.demo.socrata.com (216.227.229.224) port 443 (#0)
* TLS 1.0 connection using TLS_DHE_RSA_WITH_AES_256_CBC_SHA
* Server certificate: *.demo.socrata.com
* Server certificate: AlphaSSL CA - SHA256 - G2
* Server certificate: GlobalSign Root CA
* Server auth using Basic with user '[REDACTED]'
> DELETE /resource/ppbu-8a96.json HTTP/1.1
> Authorization: Basic [REDACTED]
> User-Agent: curl/7.37.1
> Host: soda.demo.socrata.com
> Accept: */*
> X-App-Token: [REDACTED]
>
< HTTP/1.1 200 OK
* Server nginx is not blacklisted
< Server: nginx
< Date: Fri, 09 Jan 2015 06:31:16 GMT
< Content-Type: application/json; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Access-Control-Allow-Origin: *
< X-Socrata-Region: production
< Age: 14
<

That should do the trick. If it works for you too, I'll update my docs. Sorry for the confusion!

chrismetcalf
  • 1,556
  • 1
  • 8
  • 7
  • 1
    Huh, that does work. But the implications are confusing. 1. The [PUT documentation](http://dev.socrata.com/publishers/replace.html) is completely wrong, and there's no way to actually replace a dataset contents? 2. This use of DELETE without a row identifier is undocumented. 3. This use of DELETE to delete *contents* of a resource instead of *the resource itself* might not be standard. 4. Why the hell did it work before?? :) – Steve Bennett Jan 10 '15 at 00:05