1

I have a simple Python program using pandas_profiling. Here is the source code, which I have stored as c:\temp\pandas_profiling_demo.py:

import pandas as pd
import pandas_profiling as pp
df = pd.DataFrame(data={'x': [1, 2, 3, 4, 5], 'y': [2, 2, 4, 6, 6], 'z': [4, 6, 1, 5, 2]})
print(df.head(10))
profile = pp.ProfileReport(df)
profile.to_file(outputfile="C:\\temp\\pandas_profiling_demo.html")
print("Done.")

I also have a Java program which starts the Python program (this isn't the real program, which involves a GUI, but this will recreate the problem.) My program is in Eclipse, but I will copy it here:

package pandasprofilingbug;

import java.io.File;
import java.lang.ProcessBuilder.Redirect;
import java.nio.file.Files;
import java.util.Map;

public class PandasProfilingBug {

    public static void main(String[] args) 
    {
        String[] command = new String[] { "cmd", "/C", "python", "c:\\temp\\pandas_profiling_demo.py" } ;
        ProcessBuilder pb = new ProcessBuilder(command);
        Map<String, String> env = pb.environment();

        env.remove("PYSPARK_DRIVER_PYTHON");        // else attempts to open in notebook environment    

        Process p = null;

        try
        {
            File log = new File("c:\\temp\\pandas_profiling_demo.log");
            Files.deleteIfExists(log.toPath());

            pb.redirectErrorStream(true);
            pb.redirectOutput(Redirect.to(log));
            // System.out.print("Start...");
            p = pb.start();
            assert pb.redirectInput() == Redirect.PIPE;
            assert pb.redirectOutput().file() == log;
            assert p.getInputStream().read() == -1;

            // TODO: How to give user an option to cancel???
            // TODO: How to provide progress report?
            System.out.print("Waiting...");
            p.waitFor();
            System.out.print("waiting over...exitValue = " + p.exitValue());
        }
        catch (Exception ie)
        {
            System.err.println(ie);
            ie.printStackTrace();
        }

    }

When I run the Java program, it gets stuck in a loop. Here is the first block of repeated output:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\bill_\Anaconda3\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\bill_\Anaconda3\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\bill_\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\temp\pandas_profiling_demo.py", line 21, in <module>
    profile = pp.ProfileReport(df)
  File "C:\Users\bill_\Anaconda3\lib\site-packages\pandas_profiling\__init__.py", line 66, in __init__
    description_set = describe(df, **kwargs)
  File "C:\Users\bill_\Anaconda3\lib\site-packages\pandas_profiling\describe.py", line 349, in describe
    pool = multiprocessing.Pool(pool_size)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
   x  y  z
0  1  2  4
1  2  2  6
2  3  4  1
3  4  6  5
4  5  6  2
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\bill_\Anaconda3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")

# REPEATED AD NAUSEUM

When I run the Python program in a Jupyter notebook, it works fine, creating the desired html file.

When I comment out these lines, it works fine (displays data frame) when called from Java:

    #profile = pp.ProfileReport(df)
    #profile.to_file(outputfile="C:\\temp\\pandas_profiling_demo.html")

Since I can run the program from Java if the profiling is not used, I suspect there is an issue with pandas_profiling (or at least there is for me.) Why would it cause the program to go into a loop?

Thanks in advance.

Simon
  • 5,464
  • 6
  • 49
  • 85
Bill Qualls
  • 397
  • 1
  • 7
  • 20
  • I don't know much about these libraries, but I read some of the docs. Are you using `head(10)` correctly? It appears to print the first 10 rows that way and I only see 5 rows in your `DataFrame`. Info found [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html). Maybe it should be `head(5)`? Correct me if I'm wrong though, this is just based on reading the docs. – Nexevis Jun 18 '19 at 12:05
  • As I said in the original post, the program (including the head() function) works fine from Java if I omit Pandas profiling. It is definitely a Pandas profiling issue. I wonder if it is a directory change going on inside Pandas profiling. – Bill Qualls Jun 19 '19 at 07:20

1 Answers1

1

You can find the solutions here and here.

In your case:

import pandas as pd
import pandas_profiling as pp

if __name__ == "__main__":
    df = pd.DataFrame(data={'x': [1, 2, 3, 4, 5], 'y': [2, 2, 4, 6, 6], 'z': [4, 6, 1, 5, 2]})
    print(df.head(10))
    profile = pp.ProfileReport(df)
    profile.to_file(outputfile="C:\\temp\\pandas_profiling_demo.html")
    print("Done.")
Simon
  • 5,464
  • 6
  • 49
  • 85
  • Simon -- I employed your suggestion and it does indeed work. However, this is a snippet from a much larger program. Easy answer is to put the entire program in a function or within the if __name__... But the program was developed in a Jupyter notebook. I tried the "if" within the function which calls pandas profiling, but that did not work. Any suggestions on a fix using Jupyter. Thank you very much for taking the time to answer. – Bill Qualls Jul 15 '19 at 23:19
  • I suggest using the command line interface to pandas-profiling, or to write a small wrapper script. – Simon Jul 16 '19 at 12:44