1

I am trying to read all files from a folder and trying to create filename variable based on name of the file

I am using the below code to do it. But I am unable to add the variable which lets me know the filename -

using DataFrame
using Queryverse
using VegaLite
using Statistics
using CSV
using Glob

path = "D:\\Udemy\\FInancial_Engineering_Lazy_Programmer\\Yfinance_Data"
files = glob("*.csv", path)

df_com = DataFrame()
for file in files
    df = CSV.File(file)
    df[:filename] = first(split(last(split(file, "\\")),"."))
    append!(df_com, df)
end

I am getting the below error -

ERROR: ArgumentError: invalid index: :filename of type Symbol
Stacktrace:
 [1] to_index(i::Symbol)
   @ Base .\indices.jl:300
 [2] to_index(A::CSV.File{false}, i::Symbol)
   @ Base .\indices.jl:277
 [3] to_indices
   @ .\indices.jl:333 [inlined]
 [4] to_indices
   @ .\indices.jl:325 [inlined]
 [5] setindex!(A::CSV.File{false}, v::Tuple{SubString{String}, Vector{Symbol}}, I::Symbol)
   @ Base .\abstractarray.jl:1267
 [6] top-level scope
   @ .\REPL[161]:3

There is no problem in creating the filename, but there is a problem in adding it to dataframe. The below code works fine and provides the filename, but unable to add it as a variable

for file in files
    println(first(split(last(split(file, "\\")),".")))
end

Can you please help?

1 Answers1

1

This is the most terse way to do it:

reduce(vcat,
       CSV.read.(files, DataFrame),
       source=:filename => chop.(basename.(files), tail=4))

Now, let me add some comments on your code, I hope they will be helpful:

  • split(file, "\\") is not recommended, as it would work only on Windows, it is better to use basename that will work on all operating systems;
  • using first(split(your_filename,".")) is not correct, as it will produce a wrong result if your file name contains multiple . in it; it is cleaner to chop four last characters as you know they are .csv;
  • CSV.File(file) does not produce a DataFrame object; that is why later df[:filename] = first(split(last(split(file, "\\")),".")) fails; better use CSV.read(file, DataFrame) to efficiently create a data frame, in which case you could e.g. add a column like this df.filename = first(split(last(split(file, "\")),"."))`
  • having changed the code above your code would work, but then using vcat is more efficient than repeated call to append!, as vcat is optimized for merging multiple data frames, the reduce(vcat, ...) part makes sure that you can pass a vector of data frames (instead of having to list them);
  • finally a benefit of vcat over append! is that you do not have to create the :filename column manually, as vcat supports source keyword argument to handle your use case.

I hope these hints help you in using DataFrames.jl in general.

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • thanks a lot for your help. I was trying to combine multiple codes from Stackoverflow to implement the Python code I know. I will try to check the details documents for the functions mentioned by you. Thanks again for the prompt reply :) – Harneet.Lamba Aug 29 '21 at 17:28