0

I want to know how big files are within my repository in terms of lines of code, to see the 'health' of a repository.

In order to answer this, I would like to see a distribution (visualised or not) of the number of files for a specific range (can be 1):

#lines of code   #files   
 1-10             1
11-20             23
etc...

(A histogram of this would be nice)

Is there quick why to get this, with for example cloc or any other (command line) tool?

Jonathan N
  • 41
  • 6

2 Answers2

0

A combination of cloc and Pandas can handle this. First, capture the line counts with cloc to a csv file using --by-file and --csv switches, for example

cloc --by-file --csv --out data.csv curl-7.80.0.tar.bz2

then use the Python program below to aggregate and bin the data by folders:

./aggregate_by_folder.py data.csv

The code for aggregate_by_folder.py is

#!/usr/bin/env python
import sys
import os.path
import pandas as pd
def add_folder(df):
    """
    Return a Pandas dataframe with an additional 'folder' column
    containing each file's parent directory
    """
    header = 'github.com/AlDanial/cloc'
    df = df.drop(df.columns[df.columns.str.contains(header)], axis=1)
    df['folder'] = df['filename'].dropna().apply(os.path.dirname)
    return df

def bin_by_folder(df):
    bins = list(range(0,1000,50))
    return df.groupby('folder')['code'].value_counts(bins=bins).sort_index()

def file_count_by_folder(df):
    df_files = pd.pivot_table(df, index=['folder'], aggfunc='count')
    file_counts = df_files.rename(columns={'blank':'file count'})
    return file_counts[['file count']]

def main():
    if len(sys.argv) != 2:
        print(f"Usage:  {sys.argv[0]} data.csv")
        print("     where the .csv file is created with")
        print("       cloc --by-file --csv --out data.csv my_code_base")
        raise SystemExit
    pd.set_option('display.max_rows', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', -1)
    df = add_folder(pd.read_csv(sys.argv[1]))
    print(pd.pivot_table(df, index=['folder'], aggfunc='sum'))
    print('-' * 50)
    print(file_count_by_folder(df))
    print('-' * 50)
    print(bin_by_folder(df))

if __name__ == "__main__": main()
AlDanial
  • 419
  • 2
  • 4
0

So the goal was to get a histogram of the sizes (in lines of code) for all the files in a directory. Since our project is a React Native project, we are concerned with .ts and .tsx files. All the test files (also .ts and .tsx files) can be skipped.

Also, show the 5 largest files, so we know where our attention is needed.

What we basically did was traverse the directory recursively and for every file we're interested in 1) calculate size (in lines of code), 2) calculate in which 'bin'/'bar' the file belongs and 3) add it to that bin. Meanwhile, you keep track of all the sizes, to display the 5 largest files.

The following python script worked perfectly for my use case:

import os
import matplotlib.pyplot as plt
from heapq import nlargest


# Directory path containing your code files
directory = "./src"

# Extensions we're interested in
extensions = [".ts", ".tsx"]

# Initialize dictionary to store line counts for each bin
line_counts = {}

# Keep track of the largest files
largest_files = []

def count_lines(filepath):
    with open(filepath, "r") as file:
        lines = file.readlines()
        return len(lines)


for root, dirs, files in os.walk(directory):
    # skip jest test files
    if root.find("__tests__") >= 0:
        continue

    for filename in files:
        _, file_extension = os.path.splitext(filename)
        if file_extension not in extensions:
            continue

        filepath = os.path.join(root, filename)
        line_count = count_lines(filepath)

        # Calculate bin index
        bin_index = (line_count // 10) * 10

        # Update line counts dictionary
        line_counts[bin_index] = line_counts.get(bin_index, 0) + 1

        # Add file and line count to the list of largest files
        largest_files.append((filepath, line_count))

# Extract x and y data for the histogram
x = list(line_counts.keys())
y = list(line_counts.values())

# Sort the largest files by line count in descending order
largest_files = nlargest(5, largest_files, key=lambda item: item[1])

# Print the largest files
print("Top 5 Largest Files:")
for file, line_count in largest_files:
    print(f"{file} - {line_count} lines")

# Plot the histogram
plt.bar(x, y, align="edge", width=10)
plt.xlabel("Number of Lines of Code")
plt.ylabel("Number of Files")
plt.title("Distribution of Lines of Code")
plt.show()
Jonathan N
  • 41
  • 6
  • Thank you for contributing to the Stack Overflow community. This may be a correct answer, but it’d be really useful to provide additional explanation of your code so developers can understand your reasoning. This is especially useful for new developers who aren’t as familiar with the syntax or struggling to understand the concepts. **Would you kindly [edit] your answer to include additional details for the benefit of the community?** – Jeremy Caney May 19 '23 at 00:53