1

I have a file in the following format;

Line 1 {"name": "Hotel Eiffel Petit Louvre", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.870967741935484", "stars": "2.5", "max_price": "324", "min_price": "117", "ref": "208100", "review": "Within walking distance of the Eiffel Tower."}
Line 2 {"name": "Novotel Paris Centre Tour Eiffel", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.1739130434782608", "stars": "4", "max_price": "271", "min_price": "149", "ref": "233528", "review": "Close to Seine river and Eiffel Tower."}
Line 3 {"name": "Hotel Tourisme Avenue", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.703125", "stars": "3", "max_price": "285", "min_price": "130", "ref": "558849", "review": "Close to the Eiffel Tower and metro station literally right outside the door."}
Line 4 {"name": "Hotel du Champ de Mars", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.714285714285714", "stars": "3", "max_price": "255", "min_price": "189", "ref": "570544", "review": "Very close to everything including the Eiffel Tower."}
Line 5 {"name": "Le Derby Alma", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.707865168539326", "stars": "4", "max_price": "418", "min_price": "210", "ref": "240927", "review": "Only a couple of blocks from the Eiffel Tower."}
Line 6 {"name": "Hotel Eiffel Seine", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.237288135593221", "stars": "0", "max_price": "297", "min_price": "141", "ref": "572984", "review": "Driectly next to 2 amazing cafes and literally only a 4 minute walk to the Eiffel Tower."}
Line 7 {"name": "Hotel Galileo", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.5396825396825395", "stars": "3", "max_price": "599", "min_price": "90", "ref": "197576", "review": "Within walking distance to the Eiffel Tower and many other attractions."}
Line 8 {"name": "Hotel Eiffel Seine", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.237288135593221", "stars": "0", "max_price": "297", "min_price": "141", "ref": "572984", "review": "Only a few blocks from Eiffel tower and about a short block from river Seine."}
Line 9 {"name": "Hotel Relais Bosquet Paris", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.8", "stars": "3", "max_price": "332", "min_price": "145", "ref": "229602", "review": "Very close to the metro station, restaurants and the Eiffel Tower!"}
Line 10 {"name": "Hotel Le Marquis", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.673333333333333", "stars": "4", "max_price": "368", "min_price": "155", "ref": "290384", "review": "Near a metro station, a few blocks from the Eiffel tower, and a grocery store across the street."}
Line 11 {"name": "Hotel Relais Bosquet Paris", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.8", "stars": "3", "max_price": "332", "min_price": "145", "ref": "229602", "review": "Located a 10 minute walk to Eiffel Tower."}
Line 12 {"name": "Hotel Eiffel Petit Louvre", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.870967741935484", "stars": "2.5", "max_price": "324", "min_price": "117", "ref": "208100", "review": "Metro station is literally across the road."}
Line 13 {"name": "Novotel Paris Centre Tour Eiffel", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.1739130434782608", "stars": "4", "max_price": "271", "min_price": "149", "ref": "233528", "review": "Its about 1.5 kms from Eiffel Tower and about 3 kms from Champ de ellesse."}
Line 14 {"name": "Hotel Tourisme Avenue", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.703125", "stars": "3", "max_price": "285", "min_price": "130", "ref": "558849", "review": "It is conveniently located a few steps (literally) from the Metro, about a 7 mins walk from the Eiffel Tower, there is a supermarket across the street, a bakery two stores down, and many cafes and restaurants close by."}
Line 15 {"name": "Hotel du Champ de Mars", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.714285714285714", "stars": "3", "max_price": "255", "min_price": "189", "ref": "570544", "review": "Location is absolutely brilliant, only a few mins to Ecole Militaire metro and 15min walk to the Eiffel Tower."}
Line 16 {"name": "Le Derby Alma", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.707865168539326", "stars": "4", "max_price": "418", "min_price": "210", "ref": "240927", "review": "Very nice small hotel right by the Eiffel tower."}
Line 17 {"name": "Hotel Galileo", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.5396825396825395", "stars": "3", "max_price": "599", "min_price": "90", "ref": "197576", "review": "It’s a small hotel near Champs-Elysées!!!"}
Line 18 {"name": "Hotel Le Marquis", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.673333333333333", "stars": "4", "max_price": "368", "min_price": "155", "ref": "290384", "review": "Fantastic  Boutique Hotel, Location only 5 mins walk to Eiffel Tower."}

For the sake of convenience, I have given an example of 18 lines. But I have a file with millions of lines. What would be the fastest way with the minimum latency to group the lines by "name" with the minimum order change, like following?

Line 1 {"name": "Hotel Eiffel Petit Louvre", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.870967741935484", "stars": "2.5", "max_price": "324", "min_price": "117", "ref": "208100", "review": "Within walking distance of the Eiffel Tower."}
Line 12 {"name": "Hotel Eiffel Petit Louvre", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.870967741935484", "stars": "2.5", "max_price": "324", "min_price": "117", "ref": "208100", "review": "Metro station is literally across the road."}
Line 2 {"name": "Novotel Paris Centre Tour Eiffel", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.1739130434782608", "stars": "4", "max_price": "271", "min_price": "149", "ref": "233528", "review": "Close to Seine river and Eiffel Tower."}
Line 13 {"name": "Novotel Paris Centre Tour Eiffel", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.1739130434782608", "stars": "4", "max_price": "271", "min_price": "149", "ref": "233528", "review": "Its about 1.5 kms from Eiffel Tower and about 3 kms from Champ de ellesse."}
Line 3 {"name": "Hotel Tourisme Avenue", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.703125", "stars": "3", "max_price": "285", "min_price": "130", "ref": "558849", "review": "Close to the Eiffel Tower and metro station literally right outside the door."}
Line 14 {"name": "Hotel Tourisme Avenue", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.703125", "stars": "3", "max_price": "285", "min_price": "130", "ref": "558849", "review": "It is conveniently located a few steps (literally) from the Metro, about a 7 mins walk from the Eiffel Tower, there is a supermarket across the street, a bakery two stores down, and many cafes and restaurants close by."}
Line 4 {"name": "Hotel du Champ de Mars", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.714285714285714", "stars": "3", "max_price": "255", "min_price": "189", "ref": "570544", "review": "Very close to everything including the Eiffel Tower."}
Line 15 {"name": "Hotel du Champ de Mars", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.714285714285714", "stars": "3", "max_price": "255", "min_price": "189", "ref": "570544", "review": "Location is absolutely brilliant, only a few mins to Ecole Militaire metro and 15min walk to the Eiffel Tower."}
Line 5 {"name": "Le Derby Alma", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.707865168539326", "stars": "4", "max_price": "418", "min_price": "210", "ref": "240927", "review": "Only a couple of blocks from the Eiffel Tower."}
Line 16 {"name": "Le Derby Alma", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.707865168539326", "stars": "4", "max_price": "418", "min_price": "210", "ref": "240927", "review": "Very nice small hotel right by the Eiffel tower."}
Line 6 {"name": "Hotel Eiffel Seine", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.237288135593221", "stars": "0", "max_price": "297", "min_price": "141", "ref": "572984", "review": "Driectly next to 2 amazing cafes and literally only a 4 minute walk to the Eiffel Tower."}
Line 8 {"name": "Hotel Eiffel Seine", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.237288135593221", "stars": "0", "max_price": "297", "min_price": "141", "ref": "572984", "review": "Only a few blocks from Eiffel tower and about a short block from river Seine."}
Line 7 {"name": "Hotel Galileo", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.5396825396825395", "stars": "3", "max_price": "599", "min_price": "90", "ref": "197576", "review": "Within walking distance to the Eiffel Tower and many other attractions."}
Line 17 {"name": "Hotel Galileo", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.5396825396825395", "stars": "3", "max_price": "599", "min_price": "90", "ref": "197576", "review": "It’s a small hotel near Champs-Elysées!!!"}
Line 9 {"name": "Hotel Relais Bosquet Paris", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.8", "stars": "3", "max_price": "332", "min_price": "145", "ref": "229602", "review": "Very close to the metro station, restaurants and the Eiffel Tower!"}
Line 11 {"name": "Hotel Relais Bosquet Paris", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.8", "stars": "3", "max_price": "332", "min_price": "145", "ref": "229602", "review": "Located a 10 minute walk to Eiffel Tower."}
Line 10 {"name": "Hotel Le Marquis", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.673333333333333", "stars": "4", "max_price": "368", "min_price": "155", "ref": "290384", "review": "Near a metro station, a few blocks from the Eiffel tower, and a grocery store across the street."}
Line 18 {"name": "Hotel Le Marquis", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.673333333333333", "stars": "4", "max_price": "368", "min_price": "155", "ref": "290384", "review": "Fantastic  Boutique Hotel, Location only 5 mins walk to Eiffel Tower."}

I heard that it is possible to do it with jq. If so, what would be the command look like? If there are faster tools, I would love to know.

Note: The following must be the 3rd line!

Line 2 {"name": "Novotel Paris Centre Tour Eiffel", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.1739130434782608", "stars": "4", "max_price": "271", "min_price": "149", "ref": "233528", "review": "Close to Seine river and Eiffel Tower."}

Best,

yusuf
  • 3,591
  • 8
  • 45
  • 86
  • Clarify if the Line is part of the JSON or added above for the purposes of demonstration – Inian Aug 01 '22 at 10:24

2 Answers2

1

What would be the fastest way with the minimum latency to group the lines by "name" with the minimum order change

In brief - use GROUP_BY/2, defined by:

def GROUP_BY(stream;f): reduce stream as $x ({}; .[$x|f] += [$x]);

In your case, you'd use this as follows:

GROUP_BY(inputs; .name)[][]

with invocation along the lines of: jq -cnf program.jq lines.json

(Notice: no slurping!)

Explanation

  1. "minimum order change" is accomplished because jq constructs objects incrementally, adding new keys after old ones.

  2. "fastest way" is accomplished because this solution does not involve the sorting of the input.

  3. "minimum latency" is accomplished because the input is not "slurped".

peak
  • 105,803
  • 17
  • 152
  • 177
0

If the JSON content is always structured the same way (.name up front), it'd suffice to use sort from GNU coreutils:

sort file.json
{"name": "Hotel du Champ de Mars", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.714285714285714", "stars": "3", "max_price": "255", "min_price": "189", "ref": "570544", "review": "Location is absolutely brilliant, only a few mins to Ecole Militaire metro and 15min walk to the Eiffel Tower."}
{"name": "Hotel du Champ de Mars", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.714285714285714", "stars": "3", "max_price": "255", "min_price": "189", "ref": "570544", "review": "Very close to everything including the Eiffel Tower."}
{"name": "Hotel Eiffel Petit Louvre", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.870967741935484", "stars": "2.5", "max_price": "324", "min_price": "117", "ref": "208100", "review": "Metro station is literally across the road."}
{"name": "Hotel Eiffel Petit Louvre", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "3.870967741935484", "stars": "2.5", "max_price": "324", "min_price": "117", "ref": "208100", "review": "Within walking distance of the Eiffel Tower."}
{"name": "Hotel Eiffel Seine", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.237288135593221", "stars": "0", "max_price": "297", "min_price": "141", "ref": "572984", "review": "Driectly next to 2 amazing cafes and literally only a 4 minute walk to the Eiffel Tower."}
{"name": "Hotel Eiffel Seine", "detailed_city": "Europe | France | Ile-de-France | Paris", "review_rating": "4.237288135593221", "stars": "0", "max_price": "297", "min_price": "141", "ref": "572984", "review": "Only a few blocks from Eiffel tower and about a short block from river Seine."}
:

If not, you can --slurp (or -s) the JSON stream, sort_by the .name field, and use the --compact-output (or -c) format.

jq -sc 'sort_by(.name)[]' file.json
{"name":"Hotel Eiffel Petit Louvre","detailed_city":"Europe | France | Ile-de-France | Paris","review_rating":"3.870967741935484","stars":"2.5","max_price":"324","min_price":"117","ref":"208100","review":"Within walking distance of the Eiffel Tower."}
{"name":"Hotel Eiffel Petit Louvre","detailed_city":"Europe | France | Ile-de-France | Paris","review_rating":"3.870967741935484","stars":"2.5","max_price":"324","min_price":"117","ref":"208100","review":"Metro station is literally across the road."}
{"name":"Hotel Eiffel Seine","detailed_city":"Europe | France | Ile-de-France | Paris","review_rating":"4.237288135593221","stars":"0","max_price":"297","min_price":"141","ref":"572984","review":"Driectly next to 2 amazing cafes and literally only a 4 minute walk to the Eiffel Tower."}
{"name":"Hotel Eiffel Seine","detailed_city":"Europe | France | Ile-de-France | Paris","review_rating":"4.237288135593221","stars":"0","max_price":"297","min_price":"141","ref":"572984","review":"Only a few blocks from Eiffel tower and about a short block from river Seine."}
{"name":"Hotel Galileo","detailed_city":"Europe | France | Ile-de-France | Paris","review_rating":"4.5396825396825395","stars":"3","max_price":"599","min_price":"90","ref":"197576","review":"Within walking distance to the Eiffel Tower and many other attractions."}
{"name":"Hotel Galileo","detailed_city":"Europe | France | Ile-de-France | Paris","review_rating":"4.5396825396825395","stars":"3","max_price":"599","min_price":"90","ref":"197576","review":"It’s a small hotel near Champs-Elysées!!!"}
:

Demo

pmf
  • 24,478
  • 2
  • 22
  • 31
  • 1
    I don't think the line number is part of the JSON input ;) – Inian Aug 01 '22 at 10:15
  • 2
    And also a million lines of input? with slurping? probably worthwhile to invoke the streaming parser – Inian Aug 01 '22 at 10:16
  • 1
    @Inian Weirdly formatted files are all over the place, so I took it literally. Unlike the "millions of lines" which I believed to be an exaggeration, considering the actual content of the file. Hopefully, OP will comment on how to improve in their sense. :) – pmf Aug 01 '22 at 10:23
  • Yes, "Line" is not part of the input. I have demonstrated them for the sake of the convenience of representing my problem – yusuf Aug 01 '22 at 10:29
  • 1
    @yusuf Dismantled the answer to only accept input without the "Line" prefixes. – pmf Aug 01 '22 at 11:44
  • @pmf, actually you can keep the "Line" answer as well, because according to my expectations, Novotel Paris Centre Tour Eiffel must be the 3rd output. So, it seems like, in order to get the best ranking, we have to take "Line" into account. – yusuf Aug 01 '22 at 12:12
  • 1
    @yusuf The ordering should be identical regardless of "Line" being present and disregarded, or not present in the first place. The differences in the ordering between the GNU `sort` and the `jq` approach come from different underlying comparison methods. You can affect the former using another language collation e.g. `LC_COLLATE=C`, while the latter is bound to unicode codepoints (not sure if it can be changed). Regardless, if there are more constraints (e.g "must be the 3rd output"), you should include them in the question (along with removing the "Line" prefixes from the input samples). – pmf Aug 01 '22 at 12:24
  • @pmf, could you help me to show how to do it? – yusuf Aug 01 '22 at 12:29
  • 1
    @yusuf Below the tags to your question there is an [Edit](https://stackoverflow.com/posts/73191532/edit) button, which should bring you back to the editor with the current version of your question preloaded. Edit and save it as you did the first time. – pmf Aug 01 '22 at 12:31
  • No, I mean the real sorting as I want to have. Okay, let me edit the question. – yusuf Aug 01 '22 at 12:32
  • @pmf, I have edited. – yusuf Aug 01 '22 at 12:46
  • 1
    @yusuf I meant some **general** constraints, e.g. how to order upper case letters wrt to their lower case counterparts, or how to treat diacritics wrt to their plain variants, and so on. Locking one item (or all items) by name into a specific position is a static measure, whereas you probably want a solution that can be fed with data that is confined in form but unknown or arbitrary in content. – pmf Aug 01 '22 at 13:16