2

I am using CFSpreadsheet to read a .xlsx file.

The file has about 3000 duplicates which I can safely ignore so I thought I'd do a select distinct QoQ but once I do this, the results are ordered as if order by col_1, col_2 was added to the query which is a very bad thing.

<cfspreadsheet query = "qSheet" ...>
<cfquery dbtype="query" name = "qDistinctSheet">
    select distinct
          col_1
        , col_2
    from
        qSheet
</cfquery> 
<cfdump var = "#qDistinctSheet#">

If I remove distinct I get the expected results which SHOULD be:

  1. [empty string]
  2. Name
  3. John
  4. John
  5. Adam
  6. Steve
  7. Bob
  8. Bob

When I add distinct I get

  1. [Empty String]
  2. Adam
  3. Bob
  4. John
  5. Name
  6. Steve

Any idea how to prevent this unwanted ordering?

Edit

End solution is to apply a row number and use group by as suggested by Matt and Dan

<cfset ids = []>
<cfloop query="qSheet">
    <cfset ids[qSheet.currentRow] = qSheet.currentRow>
</cfloop>
<cfset queryAddColumn(qSheet,"id",ids)>
<cfquery dbtype="query" name="qDistinct">
    SELECT  
          col_1
        , col_2
        , min(ID) AS firstID
    FROM
        qSheet
    GROUP BY    
        col_1
        , col_2
    ORDER BY
        firstID
</cfquery>
Leigh
  • 28,765
  • 10
  • 55
  • 103
genericHCU
  • 4,394
  • 2
  • 22
  • 34
  • What is the use case for not having them ordered - not sure why it matters when using `distinct`. – Scott Stroz Mar 04 '15 at 13:30
  • Is there a row number in your cfspreadsheet query? If so, you can order be that. – Matt Busche Mar 04 '15 at 13:30
  • @MattBusche adding distinct ID (or row number) would give back all rows – genericHCU Mar 04 '15 at 13:41
  • 3
    taking the min ID based on the name wouldn't, use a GROUP BY instead – Matt Busche Mar 04 '15 at 13:46
  • @ScottStroz because the format of the spreadsheet can change so I need to be able to determine what order the columns are in. Eventually this will be used for more than 2 columns. Good question though, thanks. – genericHCU Mar 04 '15 at 13:46
  • @MattBusche the sheet doesn't have IDs and I do not see a way to use the row number as a field, `#qSheet.currentRow# as ID`only returns 1 for every row. – genericHCU Mar 04 '15 at 14:06
  • Use ColdFusion query functions to add and populate a new column. – Dan Bracuk Mar 04 '15 at 14:07
  • @DanBracuk that would require looping over the entire sheet which is what i'm trying to avoid. at that point it would be just as easy to check to see if the qSheet.col_1[currentRow] = qSheet.col_1[currentRow-1] and skip the processing. – genericHCU Mar 04 '15 at 14:26
  • Checking the value of one row to the previous one also requires looping. It is also based on the assumption that duplicate records only occur in consecutive rows. – Dan Bracuk Mar 04 '15 at 14:53
  • 1
    *because the format of the spreadsheet can change..* Not to beat a dead horse, but ... why does that matter? :) What it is you need to do that would break if the names were sorted differently? From what you have described so far, I do not think you can avoid looping. However, it does not seem we have the whole picture. – Leigh Mar 04 '15 at 16:34
  • @Leigh this is sample data of course so `name` was just a fictional column header. The actual data will be aircraft data from several sources so while the column headings are similar they will not be in the same order from every source and may not even be in the same order from the same source week to week. I'm open to ideas if I'm going about it wrong. Currently I'm looking for possible headers in the first 20 rows (because that can change too), determine what order the columns are in and process the data. – genericHCU Mar 04 '15 at 17:02
  • I've been testing Dan and Matt's idea and so far the performance hit isn't as bad as i thought it would be with larger record sets so I'll probably go with that. – genericHCU Mar 04 '15 at 17:03
  • When you are using `distinct`, you are getting a subset from the initial result set, so, why does the order of the second result set matter? – Scott Stroz Mar 04 '15 at 17:50
  • FWIW, in any SQL engine (even QoQ), not using an 'order by' clause states that the engine itself can determine the order of results. In Oracle, for example, you can get almost random orders. – Joe Rinehart Mar 04 '15 at 17:51
  • @ScottStroz because i'm expecting the headings to be within the first 20 rows, in my live data it is not. – genericHCU Mar 04 '15 at 17:51
  • Do you need the headings for any reason? Couldn't you simply alias the columns and strip out the 'headings'? Or have the headings be removed when you are importing with `cfspreadsheet`? – Scott Stroz Mar 04 '15 at 17:54
  • 1
    If you do not need the header rows for any reason, use `excludeHeaderRow=true` when importing the spreadsheet. Then you can use `distinct` as you outlined – Scott Stroz Mar 04 '15 at 17:57
  • @ScottStroz as I said above, the columns could be in various orders. `col_1` may contain `serial number` today but `repair facility` tomorrow. They will also never be at the top of the spreadsheet because each source likes to put self important garbage before the actual data. "provided by:" "here's a pic of our logo" "generated on...", etc. Everyone has their own info so it isn't always going to start on row 5 or 7 or 8... so to remove that nonsense it is easiest for me to find the row which contains the headings to start the data. – genericHCU Mar 04 '15 at 17:58
  • Trying to parse dynamic column names from a spreadsheet where the 'header row' is not actually the header row is quite daunting...as you are finding out. Sounds to me like it would be easiest to tell people providing the spreadsheets NOT to put nonsense like that at the top of the spreadsheet. Garbage in...garbage out... – Scott Stroz Mar 04 '15 at 18:03
  • @ScottStroz I would love to do that but unfortunately it's been tried by people with bigger paychecks than me so the requirement falls to me to make it as flexible as possible. Some of the sources aren't even required to provide the data by contract so they think they're being nice enough and not going to change their report to make our lives easier. – genericHCU Mar 04 '15 at 18:05
  • @MattBusche I took your approach to solve the problem if you'd like to make it an answer so I can close it out. Thanks everyone for your input! – genericHCU Mar 04 '15 at 18:16
  • *"where the 'header row' is not actually the header row"* @Travis - Ahh, that is was the missing piece. If that is the case, I agree with the general consensus. There are not a lot of options for a QoQ. Looping is probably your best bet. (If you were bulk importing into a staging table, that is a different story). – Leigh Mar 04 '15 at 18:42
  • No need to loop - CF is like MySQL in that group by will let you select non-grouped columns, returning the first value - @MattBusche's recommendation to add a column with a row number then essentially do "select column1, column2 from query group by column1, column2 where rowNumber > 1 order by rowNumber" should work – Joe Rinehart Mar 04 '15 at 18:52
  • @JoeRinehart how would you add a column with a row number in it without a loop of some kind? – genericHCU Mar 04 '15 at 18:57
  • @Leigh Sorry about that, I tried to be as accurate as possible while divulging as little real info as possible. I'll work on that. – genericHCU Mar 04 '15 at 18:58
  • @Travis I added an answer – Matt Busche Mar 04 '15 at 19:16

1 Answers1

4

You can use a GROUP BY option instead and use the ID row from the spreadsheet query

<cfquery dbtype="query" name="qDistinct">
SELECT  
      col_1
    , col_2
    , min(ID) AS firstID
FROM
    qSheet
GROUP BY    
    col_1
    , col_2
ORDER BY
    firstID

Matt Busche
  • 14,216
  • 5
  • 36
  • 61