4

I'd like to write a function with the following header :

def split_csv(file, sep=";", output_path=".", nrows=None, chunksize=None, low_memory=True, usecols=None):

As you can see, I am using the same parameters as several found in pd.read_csv. What I would like to know (or do) is forward the docstring concerning these parameters from read_csv to my own function without having to copy/paste them.

EDIT : As I understand, there are no out of the box existing solutions for this. So perhaps building one is in order. What I have in mind :

some_new_fancy_library.get_doc(for_function = pandas.read_csv,for_parameters = ['sep','nrows']) would output :

{'sep': 'doc as found in the docstring', 'nrows' : 'doc as found in the docstring', ...}

and then it'd be just a matter of inserting the dictionary's value into my own function's docstring

Cheers

Imad
  • 2,358
  • 5
  • 26
  • 55
  • No, the function I wrote is completely different, but it uses read_csv arguments. I didn't post the whole code for better readability. – Imad Feb 08 '19 at 14:37
  • I basically want to have `pandas.read_csv` parameters' documentation available for some of my own function's parameters. – Imad Feb 08 '19 at 14:38
  • @aws_apprentice agreed, it's possible to parse the param information and pass into the function but the work required is probably more than just copying and pasting the actual docstrings. – r.ook Feb 08 '19 at 15:33

1 Answers1

2

You could parse the docstrings with regex and return the matched arguments to your function:

import re

pat = re.compile(r'([\w_+]+ :)')    # capturing group for arguments

splitted = pat.split(pd.read_csv.__doc__)

# Compare the parsed docstring against your function's arguments and only extract the required docstrings
docstrings = '\n'.join([''.join(splitted[i: i+2]) for i, s in enumerate(splitted) if s.rstrip(" :") in split_csv.__code__.co_varnames])

split_csv.__doc__ = docstrings

help(split_csv)

# Help on function split_csv in module __main__:
# 
# split_csv(file, sep=';', output_path='.', nrows=None, chunksize=None, low_memory=True, usecols=None)
#   sep : str, default ','
#       Delimiter to use. If sep is None, the C engine cannot automatically detect
#       the separator, but the Python parsing engine can, meaning the latter will
#       be used and automatically detect the separator by Python's builtin sniffer
#       tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
#       different from ``'\s+'`` will be interpreted as regular expressions and
#       will also force the use of the Python parsing engine. Note that regex
#       delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``
#   
#   usecols : list-like or callable, default None
#       Return a subset of the columns. If list-like, all elements must either
#       be positional (i.e. integer indices into the document columns) or strings
#       that correspond to column names provided either by the user in `names` or
#       inferred from the document header row(s). For example, a valid list-like
#       `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
#       order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
#       To instantiate a DataFrame from ``data`` with element order preserved use
#       ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
#       in ``['foo', 'bar']`` order or
#       ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
#       for ``['bar', 'foo']`` order.
#   
#       If callable, the callable function will be evaluated against the column
#       names, returning names where the callable function evaluates to True. An
#       example of a valid callable argument would be ``lambda x: x.upper() in
#       ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
#       parsing time and lower memory usage.
#   
#   nrows : int, default None
#       Number of rows of file to read. Useful for reading pieces of large files
#   
#   chunksize : int, default None
#       Return TextFileReader object for iteration.
#       See the `IO Tools docs
#       <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_
#       for more information on ``iterator`` and ``chunksize``.
#   
#   low_memory : boolean, default True
#       Internally process the file in chunks, resulting in lower memory use
#       while parsing, but possibly mixed type inference.  To ensure no mixed
#       types either set False, or specify the type with the `dtype` parameter.
#       Note that the entire file is read into a single DataFrame regardless,
#       use the `chunksize` or `iterator` parameter to return the data in chunks.
#       (Only valid with C parser)

But of course this relies on you having the exact argument names to the copied function. And as you can see, you will need to add the unmatched docstrings your self (e.g. file, output_path).

r.ook
  • 13,466
  • 2
  • 22
  • 39