dr.who

提交了 Spark Streaming 程序后,看到 Spark Streaming 的 UI 界面, Executors 下面只有一个 driver 在运行,Streaming 界面一直在排队,像卡住了一样!点进去 driver 的 logs -> stdout 界面,看到很多这样的日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
18/09/04 19:10:45 org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
18/09/04 19:10:51 org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) WARN YarnAllocator: Container marked as failed: container_e13_1534402443030_0354_02_000125 on host: WMBigdata6. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e13_1534402443030_0354_02_000125
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

[YARN调度报错Stack trace: ExitCodeException exitCode=1解决方式](https://blog.csdn.net/wendingzhulu/article/details/53571529), 这篇日志说是权限问题, 于是找到 CDH 集群下 RESOURCEMANAGER 的日志:

vi /var/log/hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-WMBigdata0.log.out

1
2

看到类似的日志:

2018-08-30 12:58:36,062 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:appattempt_1534402443030_0230_000001 (auth:TOKEN) cause:org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1534402443030_0230_000001 doesn’t exist in ApplicationMasterService cache.
2018-08-30 12:58:36,062 INFO org.apache.hadoop.ipc.Server: IPC Server handler 40 on 8030, call org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 10.0.201.124:56400 Call#2491 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1534402443030_0230_000001 doesn’t exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:442)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2220)

1
2

还发现一个 dr.who:

2018-08-31 10:52:34,563 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is accessing unchecked http://WMBigdata3:42229/api/v1/applications/application_1534402443030_0277/allexecutors which is the app master GUI of application_1534402443030_0277 owned by root

1
2

各种搜索一无所获,此时还注意到一个异常:

Slow ReadProcessor read fields took 30001ms (threshold=30000ms);

1
2

[reduce100%卡死故障排除](http://xiaoyue26.github.io/2018/02/05/2018-02/reduce100-卡死故障排除/) 这里让我查 DataNode 的日志:

vi /var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-bigdata3.log.out

1
2

搜索一下 `error` 关键字:

2018-08-31 16:14:45,695 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: bigdata3:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.0.166.172:45462 dst: /10.0.166.172:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:203)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:901)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:808)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)

1
2

谷歌到了[这个](https://blog.csdn.net/knowledgeaaa/article/details/21240247)解决方法

/etc/security/limits.conf

End of file

    • nofile 1000000
    • nproc 1000000
      1
      2

      在 CDH UI 界面中修改 `dfs.datanode.max.transfer.threads` 的值 4096, 改为 8192, 然后重启 hdfs 服务:


dfs.datanode.max.transfer.threads
8192

Specifies the maximum number of threads to use for transferring data
in and out of the DN.


`

过了一会儿, Spark Streamig 的界面好了, 不卡了。大西瓜啊, 之前忘记修改最大文件数了。