0

Dear community members,

I'm working on a code analysis system and would like to replace calls to CLI Git application with Dulwich module. As the first step I need to replace "git ls-files" command with Dulwich equivalent. I did it in the following way:

import os
import stat
import subprocess
from tempfile import TemporaryDirectory

from dulwich import porcelain
from dulwich.repo import Repo
from dulwich.objects import Commit, Tree

def _flatten_git_tree(r, object_sha, prefix=b'', sep=b'/'):

    result=[]

    git_object=r.get_object(object_sha)

    if git_object.type_name==b'tree':

        for item in git_object.iteritems():
            if stat.S_ISREG(item.mode):
                result.append(sep.join([prefix, item.path]))
            if stat.S_ISDIR(item.mode):
                result.extend(_flatten_git_tree(r, item.sha, prefix+sep+item.path, sep))

    if git_object.type_name==b'commit':

        result.extend(_flatten_git_tree(r, git_object.tree, prefix, sep))

    return result

def _run_git_cmd(git_arguments):

    return subprocess.Popen(git_arguments, stdout=subprocess.PIPE).communicate()[0]

with TemporaryDirectory() as temp_dir:

    git_clone_url=r"https://github.com/dulwich/dulwich.git"
    repo=porcelain.clone(git_clone_url, temp_dir, checkout=True)
    dulwich_ls_files=_flatten_git_tree(repo, repo.head())

    git_ls_files=_run_git_cmd(['git', '-C', os.path.join(temp_dir, 'dulwich'), 'ls-files'])
    git_ls_files=git_ls_files.decode('utf-8').splitlines()

assert len(dulwich_ls_files)==len(git_ls_files)

Quick assert shows that outputs differ. What could be a reason?

Answering my own question with the help from @jelmer. The reason for the problem was in the line I commented. Now outputs match.

import os
import subprocess
from tempfile import TemporaryDirectory

from dulwich import porcelain
from dulwich.repo import Repo

def _run_git_cmd(git_arguments):

    return subprocess.Popen(git_arguments, stdout=subprocess.PIPE).communicate()[0]

with TemporaryDirectory() as temp_dir:

    git_clone_url=r"https://github.com/dulwich/dulwich.git"
    repo=porcelain.clone(git_clone_url, temp_dir)
    dulwich_ls_files=[path.decode('utf-8') for path in sorted(repo.open_index())]

    #git_ls_files=_run_git_cmd(['git', '-C', os.path.join(temp_dir, 'dulwich'), 'ls-files'])
    git_ls_files=_run_git_cmd(['git', '-C', temp_dir, 'ls-files'])
    git_ls_files=git_ls_files.decode('utf-8').splitlines()

print(len(dulwich_ls_files), len(git_ls_files))
Wladd
  • 21
  • 4
  • Can you show `dulwich_ls_files` and `git_ls_files` so we can compare them? – phd Jun 18 '18 at 16:29
  • I'm not sure what difference you're seeing here, but the overall *strategy* is wrong: `git ls-files` reads the *index* while you're reading a *commit* or a *tree*. (`git ls-files` can also read the work-tree, depending on how you call it.) – torek Jun 18 '18 at 19:30
  • Hi @torek, thank you for pointing to my mistake. Could you please provide a Dulwich way to solve it? – Wladd Jun 19 '18 at 12:07
  • One more point: as far as I understand, since I did no changes to a cloned repo then an index matches a working tree. Am I wrong? – Wladd Jun 19 '18 at 12:31
  • @Wladd: I'm not familiar with Dulwich so I'm not sure how to have it read through the index contents. You are correct that the index and work-tree should match at this point. My comment was meant more as an important side note here. Incidentally, there are two more modes to check for: S_IFLNK for symbolic links, and one that isn't a stat mode at all, 160000, which indicates a submodule (though there's nothing useful to do for that here). – torek Jun 19 '18 at 15:42

1 Answers1

1

Something like this:

from dulwich.repo import Repo
r = Repo('.')
index = r.open_index()
for path in sorted(index):
    print(path)
jelmer
  • 2,405
  • 14
  • 27