1

I'm experimenting with PyBuilder because I'm looking for a more organised and production-oriented way of developing data science projects.

So far, I've created a PyBuilder project with the following structure (folder are uppercased for readability):

PROJECT
 |   build.py
 |   setup.py
 +-- .ENV
 |   +-- ...
 +-- SRC
 |   +-- MAIN
 |   |   +-- FIXTURES
 |   |   |   +-- data.csv
 |   |   +-- PYTHON
 |   |   |   +-- code.py
 |   |   +-- SCRIPTS
 |   |       +-- run.py
 |   +---TEST
 |       +-- FIXTURES
 |       |   +-- values.csv
 |       +-- PYTHON
 |           +-- test_code.py
...

build.py and setup.py are PyBuilder generated files. .env contains the virtual environment (i.e.: Python 3.7). src\main and src\test have the usual structure apart from the fact that each contain a new fixtures folder (much like resources in Java). If you wonder, src\test looks as stated because:

project.set_property("dir_source_unittest_python", "src/test/python")
project.set_property("unittest_module_glob", "test_*")

My intent is as follows:

  • run.py contains a script that calls the code in code.py to predict tomorrow's weather, for instance
  • code.py contains the code to load the dataset in data.csv and build a model that provides weather's predictions for a given day.
  • data.py contains the historical data that is needed in code.py to train the weather forecasting model
  • test_code.py contains the unit tests to make sure that the model and utility functions in code.py work as expected
  • values.py contains the input values and expected results to be used in test_code.py to test code.py.

My code in code.py accesses data.csv by defining the FIXTURES folder as follows:

FIXTURES = os.path.join(os.path.dirname(__file__), '..', 'fixtures')
...
with open(os.path.join(FIXTURES, 'data.csv'), 'r') as file:
    ...

And I can successfully run the script run.py from within my IDE to generate predictions.

When I try to generate a package to share the predictor with my colleagues, I see that the src\main\fixtures folder is not copied over. After some research (see this question), I managed to amend this by:

  1. Moving the fixtures folder into python

  2. Adding project.include_file("lib/python3.7/site-packages/fixtures", "fixtures/*.csv") to build.py.

Unfortunately, I would like to keep fixtures where it was initially. I've noticed anyway that run.py fails to execute even though the installation terminates successfully (pyb install). The reason is that data.csv can't be successfully located:

... 
FileNotFoundError: [Errno 2] File b'/Users/stefano/Workspace/project/.env/lib/python3.7/site-packages/../fixtures/data.csv' does not exist: b'/Users/stefano/Workspace/project/.env/lib/python3.7/site-packages/../fixtures/data.csv'

Does anyone know how to keep the fixtures folder in src\main (rather than in src\main\python)?

Also, does anyone know how to make files like data.csv discoverable after package installation?

Thanks in advance for any help!

Note: Please be aware that a solution using this structure might not be the most convenient one if the data.csv is quite big.

Stefano Bragaglia
  • 622
  • 1
  • 8
  • 25

1 Answers1

1

I eventually discovered this part of the original documentation suggesting the following solution that works fine with me:

use_plugin("copy_resources")
...
@init
def set_properties(project):
    project.get_property("copy_resources_glob").append("src/main/fixtures/*.csv")
    project.set_property("copy_resources_target", "$dir_dist")
    project.install_file("lib/python3.7/fixtures", "src/main/fixtures/data.csv")

Note: In the last command, for some reason, it is not possible to use a wildcard (project.install_file("lib/python3.7/fixtures", "src/main/fixtures/*.csv")).

Stefano Bragaglia
  • 622
  • 1
  • 8
  • 25