I'm currently learning Scala and looking for an elegant solution to a problem that is easily solved by the use of co-routines.
Since co-routines are not enabled by default in Scala I assume them to be at least not widely accepted best practice and thus want to write my code without using them.
A convincing argument that co-routines/continuations are the best practice to employ would be an alternative acceptable answer.
generate_all_matching_paths function
I want to write a function which searches files inside a base directory. The match and descend criteria should be provided by an instance of a class that has the "PathMatcher" trait. (At least I think this is the way to go in Scala)
A PathMatcher can be used to determine if a fs_item_path is matching AND it determines if the search should descend into a directory (in case the fs_item_path is the path of a directory).
My now following approach that I took for the Python implementation is only provided to indicate what functionality I have in mind.
I want to write this code the "Scala way".
I'm aiming for a solution with these characteristics:
- be readable
- use Scala idioms where they fit well
- return the matched paths during the search (and not after)
- works with very large file collections (but not extremely deeply nested) - i.e. thousands of files in a directory; many millions of files in total
I assume that the solution will involve lazy evaluating streams, but I was not able to assemble the stream in a working way.
I have also read that if used incorrectly, lazy streams can keep copies of 'old values' around. The solution I'm after would not do this.
arguments
base_abs_path
absolute path of the directory to start the search at
rel_ancestor_dir_list
list of directory names that indicate how far we have descended into base_abs_path subdirectories
rel_path_matcher
An instance of a class with PathMatcher trait.
In the example below I used a regular expression implementation, but I don't want to limit the use to regular expressions.
Example in Python
Here a complete working Python program (tested with Python 3.4) that includes a Python version of "generate_all_matching_paths".
The program will search "d:\Projects" for file system paths that end with "json", analyze the files for what indentation they use, and then print out the results.
If a path includes the substring "python_portable", then the search will not descend into that directory.
import os
import re
import codecs
#
# this is the bespoke function I want to port to Scala
#
def generate_all_matching_paths(
base_dir_abs_path,
rel_ancestor_dir_list,
rel_path_matcher
):
rooted_ancestor_dir_list = [base_dir_abs_path] + rel_ancestor_dir_list
current_dir_abs_path = os.path.join(*rooted_ancestor_dir_list)
dir_listing = os.listdir(current_dir_abs_path)
for fs_item_name in dir_listing:
fs_item_abs_path = os.path.join(
current_dir_abs_path,
fs_item_name
)
fs_item_rel_ancestor_list = rel_ancestor_dir_list + [fs_item_name]
fs_item_rel_path = os.path.join(
*fs_item_rel_ancestor_list
)
result = rel_path_matcher.match(fs_item_rel_path)
if result.is_match:
yield fs_item_abs_path
if result.do_descend and os.path.isdir(fs_item_abs_path):
child_ancestor_dir_list = rel_ancestor_dir_list + [fs_item_name]
for r in generate_all_matching_paths(
base_dir_abs_path,
child_ancestor_dir_list,
rel_path_matcher
):
yield r
#
# all following code is only a context giving example of how generate_all_matching_paths might be used
#
class MyMatchResult:
def __init__(
self,
is_match,
do_descend
):
self.is_match = is_match
self.do_descend = do_descend
# in Scala this should implement the PathMatcher trait
class MyMatcher:
def __init__(
self,
rel_path_regex,
abort_dir_descend_regex_list
):
self.rel_path_regex = rel_path_regex
self.abort_dir_descend_regex_list = abort_dir_descend_regex_list
def match(self, path):
rel_path_match = self.rel_path_regex.match(path)
is_match = rel_path_match is not None
do_descend = True
for abort_dir_descend_regex in self.abort_dir_descend_regex_list:
abort_match = abort_dir_descend_regex.match(path)
if abort_match:
do_descend = False
break
r = MyMatchResult(is_match, do_descend)
return r
def leading_whitespace(file_path):
b_leading_spaces = False
b_leading_tabs = False
with codecs.open(file_path, "r", "utf-8") as f:
for line in f:
for c in line:
if c == '\t':
b_leading_tabs = True
elif c == ' ':
b_leading_spaces = True
else:
break
if b_leading_tabs and b_leading_spaces:
break
return b_leading_spaces, b_leading_tabs
def print_paths(path_list):
for path in path_list:
print(path)
def main():
leading_spaces_file_path_list = []
leading_tabs_file_path_list = []
leading_mixed_file_path_list = []
leading_none_file_path_list = []
base_dir_abs_path = r'd:\Projects'
rel_path_regex = re.compile('.*json$')
abort_dir_descend_regex_list = [
re.compile('^.*python_portable.*$')
]
rel_patch_matcher = MyMatcher(rel_path_regex, abort_dir_descend_regex_list)
ancestor_dir_list = []
for fs_item_path in generate_all_matching_paths(
base_dir_abs_path,
ancestor_dir_list,
rel_patch_matcher
):
if os.path.isfile(fs_item_path):
b_leading_spaces, b_leading_tabs = leading_whitespace(fs_item_path)
if b_leading_spaces and b_leading_tabs:
leading_mixed_file_path_list.append(fs_item_path)
elif b_leading_spaces:
leading_spaces_file_path_list.append(fs_item_path)
elif b_leading_tabs:
leading_tabs_file_path_list.append(fs_item_path)
else:
leading_none_file_path_list.append(fs_item_path)
print('space indentation:')
print_paths(leading_spaces_file_path_list)
print('tab indentation:')
print_paths(leading_tabs_file_path_list)
print('mixed indentation:')
print_paths(leading_mixed_file_path_list)
print('no indentation:')
print_paths(leading_none_file_path_list)
print('space: {}'.format(len(leading_spaces_file_path_list)))
print('tab: {}'.format(len(leading_tabs_file_path_list)))
print('mixed: {}'.format(len(leading_mixed_file_path_list)))
print('none: {}'.format(len(leading_none_file_path_list)))
if __name__ == '__main__':
main()