13

I have an SQLite database (110kb) in an S3 bucket. I want to connect to that database every time I run my Python application.

An option is to download database everytime I run the Python application and connect it. But I want to know if there exists a way to connect to that SQLite database through memory, using S3FileSystem and open.

I'm using SQLite3 library in Python 3.6.

user4157124
  • 2,809
  • 13
  • 27
  • 42
Joaquin
  • 2,013
  • 3
  • 14
  • 26
  • Related: https://stackoverflow.com/questions/3833162/can-i-use-the-sqlite-as-a-db-storage-for-cloud-based-websites/3833191#3833191 – C Perkins Jun 26 '19 at 18:38

5 Answers5

21

As other answers indicate, you probably don't want to use SQLite as a primary database in the cloud.

However, as part of a fun side project I wrote an Amazon Athena data source connector that allows you to query SQLite databases in S3 from Athena. In order to do that, I wrote a read-only SQLite interface for S3.

SQLite has a concept of an OS Interface or VFS. Using a Python SQLite wrapper called APSW, you can write a VFS implementation for arbitrary filesystems. This is what I did in my project and I've included the implementation below.

In order to use this, you would first register the VFS and then create a new SQLite connection with this implementation as the driver.

I should note this isn't optimized at all, so will likely still require reading full databases from S3 depending on your queries. But doesn't sound like an issue in this specific case.

S3FS = S3VFS()  # S3VFS defined below

# This odd format is used due to SQLite requirements
sqlite_uri = "file:/{}/{}.sqlite?bucket={}&immutable=1".format(
  S3_PREFIX,
  DATABASE_NAME,
  S3_BUCKET
)

connection = apsw.Connection(sqlite_uri,
  flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI,
  vfs=S3FS.vfsname
)
cursor = connection.cursor()

Once you have the cursor, you can execute standard SQL statements like so:

for x,y,z in cursor.execute("select x,y,z from foo"):
    print (cursor.getdescription())  # shows column names and declared types
    print (x,y,z)

VFS Implementation (requires APSW library and boto3 for S3 connectivity)

import apsw
import sys
import boto3

VFS_S3_CLIENT = boto3.client('s3')


class S3VFS(apsw.VFS):
    def __init__(self, vfsname="s3", basevfs=""):
        self.vfsname=vfsname
        self.basevfs=basevfs
        apsw.VFS.__init__(self, self.vfsname, self.basevfs)

    def xOpen(self, name, flags):
        return S3VFSFile(self.basevfs, name, flags)


class S3VFSFile():
    def __init__(self, inheritfromvfsname, filename, flags):
        self.bucket = filename.uri_parameter("bucket")
        self.key = filename.filename().lstrip("/")
        print("Initiated S3 VFS for file: {}".format(self._get_s3_url()))

    def xRead(self, amount, offset):
        response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount))
        response_data = response['Body'].read()
        return response_data

    def xFileSize(self):
        client = boto3.client('s3')
        response = client.head_object( Bucket=self.bucket, Key=self.key)
        return response['ContentLength']

    def xClose(self):
        pass

    def xFileControl(self, op, ptr):
        return False

    def _get_s3_url(self):
        return "s3://{}/{}".format(self.bucket, self.key)
dacort
  • 718
  • 4
  • 11
  • Hello, how are you? I was trying your solution and I have an error: SQLError: SQL logic error. I followed your steps to perfection. The error is in connection = apsw.Connection(sqlite_uri, flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI, vfs=S3FS.vfsname ) – Juan Almada Nov 01 '22 at 13:45
  • @JuanAlmada Likely an apsw version issue. I wrote my example on version 3.30.1-r1 but when tested with the latest version (3.39.4.0) it doesn't work. 3.38.5.post1 *does* work, though. Looks like xFullPathname needs to be implemented in S3VFS in versions >= 3.39. – dacort Nov 02 '22 at 22:02
10

Other answers explain how the sqlite engine can register an alternative "file system" that can work with database files in various cloud storage scenarios. This includes both custom solutions and updated cloud platform solutions. If these systems satisfy your need, by all means go for it.

These possibilities forced me to rethink my previous definite "No" answer, but they ultimately do not change the underlying nature of the sqlite engine. It is still true that the engine has no built-in network/web api and is not designed for that. That is not my opinion, rather it says as much on the official website. It is still true that--even with an alternative file system--the engine works with database "files", so to persist and load in-memory databases still necessitates a "file system". So for most applications and most developers who want an optimal and/or plug-and-play solution that matches the design goals of sqlite, my original answer stands.

Original answer:

No, it is not possible to connect directly to an sqlite database stored in the cloud. Even if you want to host the database in memory, it must still be downloaded completely before being loaded into memory. To do that still requires that the database first be loaded from a disc-based file or using DDL commands to create it directly in memory. See Loading and Saving In-Memory Databases.

In such a scenario, once the database was disconnected, it would then need to be re-uploaded to the cloud storage. S3FileSystem.open just returns a data stream. All the stream will allow you to do is download the file to a local store so that it can be opened / manipulated locally.

If you really need a cloud database, you need to look into another sort of hosted database.

C Perkins
  • 3,733
  • 4
  • 23
  • 37
  • Here's [another answer](https://www.quora.com/Can-we-have-cloud-SQLite-databases) that gives more insight into the purpose of sqlite. – C Perkins Jun 26 '19 at 18:24
  • I am not sure an SQLite database always needs to be downloaded entirely in order to be used. That may be the case for S3, if you can't fetch arbitrary portions of a file. There are SQLite implementations that can query a database stored remotely without downloading it and without a remote process loading the database into RAM and serving as some sort of proxy. – Naikrovek Jun 09 '22 at 23:40
  • @Naikrovek Hmm... how can any process query the data without actually downloading that data? I use "download" in a very general sense here. If you know of a cloud implementation that allows you to get a portion of a SQLite file AND then at the other end properly interpret that portion properly, then somehow queried data from it... well that is some unnecessarily-complex magic, but I doubt that exists. What would be the benefit of sophistication and complexity for a DB engine and file format which are not designed for that? Others already revealed a cloud "file system" solution. – C Perkins Jun 10 '22 at 16:44
  • never heard of an index, huh? "unnecessarily-complex magic?" ok. you're very smart, thank you for correcting me. here is a working example of this unnecessarily complex magic that doesn't exist: https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/ – Naikrovek Jun 11 '22 at 20:10
  • @Naikrovek Somebody made the magic happen! So go ahead and post a new answer with real detail. Starting a comment with "I am not sure..." and then describing some convoluted possibility isn't very helpful. Let's take a step back however and recognize that the SQLite engine still expects some DB file. From the link: "... empty filesystem except for a file called /wdi.sqlite3 that it can read from". Apparently various file systems can be rigged for SQLite, yet the SQLite engine itself still has inherent limitations. That is what my answer was addressing. – C Perkins Jun 11 '22 at 21:13
  • I *wasn't* sure when I typed the "I am not sure" comment, then I got the response I got, and out of spite, I searched around and found someone that did it, so I typed the second comment. I knew it was possible, I just didn't know if anyone had done it. And the arrogance of the reply to me was MORE than confidently wrong enough for me to find it myself and prove them wrong. – Naikrovek Jun 13 '22 at 14:32
9

Yes, it's possible with EFS:

https://www.lambrospetrou.com/articles/aws-lambda-and-sqlite-over-efs/

AWS recently released integration between AWS Lambda and Amazon EFS. It is supporting NFSv4 lock upgrading/downgrading which is needed by SQLite. This means SQLite engine can have read/write access to files stored on EFS filesystem.

Alex B
  • 2,165
  • 2
  • 27
  • 37
2

(Inspired by dacort's answer)

If the database is just used for reads, there is https://github.com/michalc/sqlite-s3-query (full disclosure: written by me)

Modifying an example from its README, say you have my-db.sqlite in a bucket my-bucket in eu-west-2 (and credentials in environment variables):

from sqlite_s3_query import sqlite_s3_query

with \
        sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query, \
        query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)
Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
0

if all your actions are limited in reading from SQLite, I guess it is possible. But I have no idea if writing is also possible. in my case, I'm using gdal(requires libgdal), and gdal's /vsis3, /vsis3-streaming (based on /vsicurl) gives you the ability to read SQLite and many other data source from the cloud. if you want to use raw SQLite rather than based on gdal's data source layer, you can just write them to your local database via gdal's API, however, if so, why not just download it and read it?

for me, since I'm working on spatial data and gdal's DataSource provides a lot of APIs to manipulate spatial data, this method works fine. I'm still looking for a good way to write to a cloud based SQLite.

FYI, here's the document of gdal virtual file system https://gdal.org/user/virtual_file_systems.html

  • You can perform object locks on S3 to implement consistent write operations to the SQLite file. Add an asynchronous API that can queue up writes in front of that with eventual consistency model and it seems like a workable system. – ricosrealm Jan 01 '21 at 02:33