2

I have the following polygon of a geographic area that I fetch via a request in CAP/XML format from an API

The raw data looks like this:

<polygon>22.3243,113.8659 22.3333,113.8691 22.4288,113.8691 22.4316,113.8742 22.4724,113.9478 22.5101,113.9951 22.5099,113.9985 22.508,114.0017 22.5046,114.0051 22.5018,114.0085 22.5007,114.0112 22.5007,114.0125 22.502,114.0166 22.5038,114.0204 22.5066,114.0245 22.5067,114.0281 22.5057,114.0371 22.5051,114.0409 22.5041,114.0453 22.5025,114.0494 22.5023,114.0511 22.5035,114.0549 22.5047,114.0564 22.5059,114.057 22.5104,114.0576 22.512,114.0584 22.5144,114.0608 22.5163,114.0637 22.517,114.0657 22.5172,114.0683 22.5181,114.0717 22.5173,114.0739</polygon>

I store the requested items in a dictionary and then work through them to transform to a GeoJSON list object that is suitable for ingestion into Elasticsearch according to the schema I'm working with. I've removed irrelevant code here for ease of reading.

# fetches and store data in a dictionary
r = requests.get("https://alerts.weather.gov/cap/ny.php?x=0")
xpars = xmltodict.parse(r.text)
json_entry = json.dumps(xpars['feed']['entry'])
dict_entry = json.loads(json_entry)

# transform items if necessary
for entry in dict_entry:

    if entry['cap:polygon']:
        polygon = entry['cap:polygon']
        polygon = polygon.split(" ") 
        coordinates = []
        # take the split list items swap their positions and enclose them in their own arrays
        for p in polygon:
            p = p.split(",")
            p[0], p[1] = float(p[1]), float(p[0]) # swap lon/lat
            coordinates += [p]

        # more code adding fields to new dict object, not relevant to the question

The output of the p in polygon loop looks like:

[ [113.8659, 22.3243], [113.8691, 22.3333], [113.8691, 22.4288], [113.8742, 22.4316], [113.9478, 22.4724], [113.9951, 22.5101], [113.9985, 22.5099], [114.0017, 22.508], [114.0051, 22.5046], [114.0085, 22.5018], [114.0112, 22.5007], [114.0125, 22.5007], [114.0166, 22.502], [114.0204, 22.5038], [114.0245, 22.5066], [114.0281, 22.5067], [114.0371, 22.5057], [114.0409, 22.5051], [114.0453, 22.5041], [114.0494, 22.5025], [114.0511, 22.5023], [114.0549, 22.5035], [114.0564, 22.5047], [114.057, 22.5059], [114.0576, 22.5104], [114.0584, 22.512], [114.0608, 22.5144], [114.0637, 22.5163], [114.0657, 22.517], [114.0683, 22.5172], [114.0717, 22.5181], [114.0739, 22.5173] ]

Is there a way to do this that is better than O(N^2)? Thank you for taking the time to read.

  • Why are you converting to and from JSON? – Barmar Nov 25 '21 at 00:57
  • Actually now that I'm looking at it with fresh eyes I think it might be O(n^3) due to the p.split()? – Isaac Keleher Nov 25 '21 at 00:58
  • 2
    This is not `O(N^2)` - this is `O(MxN)` because there are M entries and of those, there are N points in a polygon (if there is a polygon). – Larry the Llama Nov 25 '21 at 00:58
  • @Barmar the p in polygon loop transforms it to GeoJSON -> https://geojson.org/ – Isaac Keleher Nov 25 '21 at 00:59
  • I'm talking about `dict_entry = json.loads(json_entry)`. Why not just `dict_entry = xpars['feed']['entry']` – Barmar Nov 25 '21 at 00:59
  • 2
    @IsaacKeleher I am not sure you understand O(N^2), etc. It is only a power of N if it is actually dependent on N. It would not be O(N^3) but rather O(KxMxN) because those variables are unrelated – Larry the Llama Nov 25 '21 at 01:00
  • 1
    It's really O(N) where N is the total number of coordinates in the JSON. – Barmar Nov 25 '21 at 01:00
  • @Barmar exactly – Larry the Llama Nov 25 '21 at 01:01
  • The nested loops aren't multiplying the complexity because they're processing smaller pieces of the original data. – Barmar Nov 25 '21 at 01:01
  • @Barmar difference is that my way gets rid of additional nesting e.g. `[OrderedDict([('id', 'https://alerts.weather.gov/cap/ny.php?x=0'), ('updated', '2021-11-25T01:03:09+00:00'),` vs `[{'id': 'https://alerts.weather.gov/cap/ny.php?x=0', 'updated': '2021-11-25T01:03:09+00:00',` Just makes it easier to read when working with the raw data – Isaac Keleher Nov 25 '21 at 01:07
  • @LarrytheLlama could you please expand on O(KxMxN) or give me a link to learn more about it please? – Isaac Keleher Nov 25 '21 at 01:10
  • @Barmar I'm a bit confused now, why does processing smaller parts of it change the time complexity? :) – Isaac Keleher Nov 25 '21 at 01:10
  • 1
    Consider: `for i in range(0, 20, 5): for j in range(5): do something` versus `for i in range(0, 20):`. They both iterate 20 times, but the first one does it in 4 groups of 5. – Barmar Nov 25 '21 at 01:12
  • No matter how you organize the loops, you're just executing the inner calls to `float()` once for each number in the input data. – Barmar Nov 25 '21 at 01:13
  • Hmm so say I have 5 entries from the API feed and each polygon point is 2 items That would be `5 * M * 2` ? Where `M` = the number of ordered pairs/coordinates? – Isaac Keleher Nov 25 '21 at 01:20

1 Answers1

1

O(KxNxM)

This process involves three obvious loops. These are:

  1. Checking each entry (K)
  2. Splitting valid entries into points (MxN) and iterating through those points (N)
  3. Splitting those points into respective coordinates (M)

The amount of letters in a polygon string is ~MxN because there are N points each roughly M letters long, so it iterates through MxN characters.

Now that we know all of this, let's pinpoint where each occurs.

ENTRIES (K):
    IF:
        SPLIT (MxN)
        POINTS (N):
            COORDS(M)

So, we can finally conclude that this is O(K(MxN + MxN)) which is just O(KxNxM).

Larry the Llama
  • 958
  • 3
  • 13
  • Thank you for taking the time to answer this Larry. – Isaac Keleher Nov 25 '21 at 01:36
  • Just a followup q: In the example data of my original post there are 32 points to be split, so 64 individual coordinates. So assuming 5 entries with polygons we have: `K * (M*N + M*N) = 5 * (32*2 + 32*2) = 640` operations. How could this be equivalent to `O(n)` where `N = 64` as discussed in the comments of the original question? – Isaac Keleher Nov 25 '21 at 02:04
  • 1
    @IsaacKeleher Yes, sort of. The M is the length of a point before it is split, so if each point is about ~15-20 characters, then M will be that, because the split iterates through _each_ character, the amount of coords in a point is negligible. – Larry the Llama Nov 25 '21 at 02:37