Where does EMR store Spark stdout?

Question

I am running my Spark application on EMR, and have several println() statements. Other than the console, where do these statements get logged?

My S3 aws-logs directory structure for my cluster looks like:

node ├── i-0031cd7a536a42g1e │ ├── applications │ ├── bootstrap-actions │ ├── daemons │ ├── provision-node │ └── setup-devices containers/ ├── application_12341331455631_0001 │ ├── container_12341331455631_0001_01_000001

Viewing log files: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html — John Hanley, Dec 08 '17 at 00:32

score 14 · Answer 1 · answered Dec 08 '17 at 06:20

14

You can find println's in a few places:

Resource Manager -> Your Application -> Logs -> stdout
Your S3 log directory -> containers/application_.../container_.../stdout (though this takes a few minutes to populate after the application)
SSH into the EMR, yarn logs -applicationId <Application ID> -log_files <log_file_type>

answered Dec 08 '17 at 06:20

ayplam

1,943
1
14
20

2

None of these places is showing my output to stdout (and it's definitely in the Spark driver, not a lambda exported to the executors). I also can't seem to add anything to the logs at any log level using Log4j or Log4j2. Black hole. – nclark Nov 14 '19 at 18:37
where is the resource manager for a EMR cluster? Like @nclark I haven't found my output in any of this locations – Cr4zyTun4 Jan 18 '22 at 10:10
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html - you want the resource manager on port 8088, it may require some browser configuration – ayplam Jan 19 '22 at 06:49

score 3 · Answer 2 · answered Dec 08 '17 at 02:37

There is a very important thing that you need to consider when printing from Spark: are you running code that gets executed in the driver or is it code that runs in the executor?

For example, if you do the following, it will output in the console as you are bringing data back to the driver:

for i in your_rdd.collect():
    print i

But the following will run within an executor and thus it will be written in the Spark logs:

def run_in_executor(value):
    print value

your_rdd.map(lambda x: value(x))

Now going to your original question, the second case will write to the log location. Logs are usually written to the master node which is located in /mnt/var/log/hadoop/steps, but it might be better to configure logs to an s3 bucket with --log-uri. That way it will be easier to find.

is it not: `your_rdd.map(lambda x: run_in_executor(x))` you want on that final line? — nclark, Nov 14 '19 at 18:33

Where does EMR store Spark stdout?

2 Answers2