Code optimisation for extracting cloud points from AutoCAD DXF Python

Question

I'm working on processing Lidar data with Python. The test data has about 150 000 data points but the actually data will contain hundreds of millions. Initially, it was exported as .dwg file, however, since I couldn't find a way to process it I decided to convert it to *.dxf and work from there. Then I'm trying to extract the point coordinates and layer and save it as a *.cvs file for further processing. Here is the code:

import pandas as pd 

PointCloud = pd.DataFrame(columns=['X', 'Y', 'Z','Layer']) 
filename="template"
# Using readlines()
with open(filename+".dxf", "r") as f2:
    input = list(f2.readlines())

###Strip the data only to datapoints to speed up (look up .dxf documentation)
i=input.index('ENTITIES\n') #find the begining of the entities section
length = input.index('OBJECTS\n') #find the begining of the entities section
while i<length:
    line=input[i]
    if i%1000==0: print ("Completed: "+str(round(i/length*100,2))+"%")
    if line.startswith("AcDbPoi"):
        x=float(input[i+2].strip())
        y=float(input[i+4].strip())
        z=float(input[i+6].strip())
        layer=input[i-2].strip() # Strips the newline character
        point = {'X':x,'Y':y,'Z':z,'Layer':layer}
        PointCloud.loc[PointCloud.shape[0]]=[x,y,z,layer]
        i+=14
    else:
        i+=1

PointCloud.to_csv(filename+'.csv', sep='\t', encoding='utf-8')

While it works, going line by line is not the most efficient way, hence I'm trying to find ways to optimize it. Here is the *.dxf point structure that I'm interested in extracting:

AcDbEntity
  8
SU-SU-Point cloud-Z
100
AcDbPoint
 10
4.0973
 20
2.1156
 30
-0.6154000000000001
  0
POINT
  5
3130F
330
2F8CD
100
AcDbEntity

Where: 10, 20, and 30 are the XYZ coordinates and 8 is the layer. Any ideas on how to improve it would be greatly appreciated.

Ideally yes since conversion to DXF will be done using python API and further processing is in python too. Unless I can call C function with python. — Mariusz, Jun 06 '22 at 15:10

mozman · Answer 1 · 2022-06-07T05:12:43.157

The slowest part is file IO and I don't think this can be sped up much.

But it could be more memory efficient by really reading the (very large) DXF file line by line. The code could also be more robust by just parsing the absolut minimum data from the POINT entities, this way the function can parse newer DXF versions and also DXF R12 and older.

import sys
from dataclasses import dataclass


@dataclass
class Point:
    x: float = 0.0
    y: float = 0.0
    z: float = 0.0
    layer: str = ""


def load_points(filename: str):
    def read_tag():
        """Read the next DXF tag (group code, value)."""
        code = fp.readline()
        if code == "":
            raise EOFError()
        value = fp.readline().strip()
        return int(code), value

    def next_entity():
        """Collect entity tags, starting with the first group code 0 tag like (0, POINT). 
        """
        tags = []
        while True:
            code, value = read_tag()
            if code == 0:
                if tags:
                    yield tags
                tags = [(code, value)]
            else:
                if tags:  # skip everything in front of the first entity
                    tags.append((code, value))

    def parse_point(tags):
        """Parse the DXF POINT entity."""
        point = Point()
        # The order of the DXF tags can differ from application to application.
        for code, value in tags:
            if code == 10:  # x-coordinate
                point.x = float(value)
            elif code == 20:  # y-coordinate
                point.y = float(value)
            elif code == 30:  # z-coordinate
                point.z = float(value)
            elif code == 8:  # layer name
                point.layer = value
        return point

    # DXF R2007 has always utf8 encoding, older DXF versions using
    # the encoding stored in the HEADER section, if only ASCII characters
    # are used for the layer names, the encoding can be ignored.
    fp = open(filename, mode="rt", encoding="utf8", errors="ignore")
    try:
        # find the ENTITIES section
        while read_tag() != (2, "ENTITIES"):
            pass
        # iterate over all DXF entities until tag (0, ENDSEC) appears
        for tags in next_entity():
            if tags[0] == (0, "POINT"):
                yield parse_point(tags)
            elif tags[0] == (0, "ENDSEC"):
                return
    except EOFError:
        pass
    finally:
        fp.close()


def main(files):
    for file in files:
        print(f"loading: {file}")
        csv = file.replace(".dxf", ".csv")
        with open(csv, "wt", encoding="utf8") as fp:
            fp.write("X, Y, Z, LAYER\n")
            for point in load_points(file):
                print(point)
                fp.write(f'{point.x}, {point.y}, {point.z}, "{point.layer}"\n')


if __name__ == "__main__":
    main(sys.argv[1:])

FYI: This is the simplest valid DXF R12 file containing only POINT entities:

0
SECTION
2
ENTITIES
0
POINT
8
layer name
10
1.0
20
2.0
30
3.0
0
ENDSEC
0
EOF

Code optimisation for extracting cloud points from AutoCAD DXF Python

1 Answers1