3

I have an array of arrays in my dataframe - split.(df2.name):

> 392-element Array{Array{SubString{String},1},1}:
 ["chevrolet", "chevelle", "malibu"]
 ["buick", "skylark", "320"]        
 ["plymouth", "satellite"]          
 ["amc", "rebel", "sst"]            
 ["ford", "torino"]                 
 ⋮                                  
 ["ford", "mustang", "gl"]          
 ["vw", "pickup"]                   
 ["dodge", "rampage"]               
 ["ford", "ranger"]                 
 ["chevy", "s-10"]   

I want to select all but the first element of each array and join them together, to get the model names of these cars.

First, I thought to do something like this: model = join.(split.(df2.name)[2:end], " ") but instead of removing the first element of each array, this removes the first car (The first element of the outer array).

So I thought to broadcast the range [2:end] to all the elements by puting a point just before the range: model = join.(split.(df2.name).[2:end], " "). But this does not seem to work either, because there is a syntax error:

syntax: missing last argument in "2:" range expression

So what is the julian way to broadcast a range in such a case?

chefhose
  • 2,399
  • 1
  • 21
  • 32

2 Answers2

2

It seems to be a little tricky to use a plain broadcast here since, as you found out, building the 2:end range explicitly results in a syntax error. I think this is because expressions like

a[2:end]

are specially parsed and lowered to something like

a[2:lastindex(a)]

as explained in the documentation for lastindex.

You can however use an iterator like Iterators.drop in order to iterate through all elements but the first, an operation which can be broadcasted:

julia> cars = [["chevrolet", "chevelle", "malibu"],
               ["buick", "skylark", "320"],        
               ["plymouth", "satellite"],          
               ["amc", "rebel", "sst"],            
               ["ford", "torino"]];

julia>  join.(Iterators.drop.(cars, 1), " ")
5-element Array{String,1}:
 "chevelle malibu"
 "skylark 320"
 "satellite"
 "rebel sst"
 "torino"

But I think that in this case, I would probably go for a comprehension, which I think would be more readable:

julia> [join(car[2:end], " ") for car in cars]
5-element Array{String,1}:
 "chevelle malibu"
 "skylark 320"
 "satellite"
 "rebel sst"
 "torino"


EDIT: looking back at your global problem, it looks like you first split more than you want, then struggle joining back the parts you didn't want to split in the first place.

So you might want to take advantage of the limit keyword argument to split, so that you dont split too many words in the first place:

julia> cars2 = ["chevrolet chevelle malibu",
                "buick skylark 320",
                "plymouth satellite",
                "amc rebel sst",
                "ford torino"];

julia> split.(cars2, " ", limit=2)
5-element Array{Array{SubString{String},1},1}:
 ["chevrolet", "chevelle malibu"]
 ["buick", "skylark 320"]
 ["plymouth", "satellite"]
 ["amc", "rebel sst"]
 ["ford", "torino"]

julia> getindex.(split.(cars2, " ", limit=2), 2)
5-element Array{SubString{String},1}:
 "chevelle malibu"
 "skylark 320"
 "satellite"
 "rebel sst"
 "torino"

This last example also demonstrates how to broadcast indexing syntax via the explicit getindex(a, i) function call, which is the lowered form of syntactic sugar a[i].

François Févotte
  • 19,520
  • 4
  • 51
  • 74
1

In general, broadcasting indexing syntax works by converting it to getindex and broadcasting that:

model = join.(getindex.(split.(df2.name), Ref(2:10)), " ")

Ref is needed to treat the range as a scalar; you can also use a 1-tuple instead.

That's the easy part. However, this trick becomes ugly to use with end, since outside of the brackets it has no meaning, which is the reasone for the error you got. One way to resolve the problem is to replace end with lastindex, but then you should probably cache the array calculation:

model = let nameparts = split.(df2.name)
    join.(getindex.(nameparts, Ref(2:lastindex(nameparts))), " ")
end

This loses the advantage of the broadcast fusion, though.

In this specific case, you could also use Iterators.rest, since we know about how Array iterators work:

join.(Iterators.rest.(split.(df2.name), 2), " ")

But the easiest version in my opinion is just a comprehension:

model = [join(split(carname)[2:end]) for carname in df2.name]

(Unless you're very familiar with Iterators. Then I'd personally prefer the previous one.)

phipsgabler
  • 20,535
  • 4
  • 40
  • 60