Wait the light to fall

dr.who

焉知非鱼

提交了 Spark Streaming 程序后,看到 Spark Streaming 的 UI 界面, Executors 下面只有一个 driver 在运行,Streaming 界面一直在排队,像卡住了一样!点进去 driver 的 logs -> stdout 界面,看到很多这样的日志:

18/09/04 19:10:45 org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
18/09/04 19:10:51 org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) WARN YarnAllocator: Container marked as failed: container_e13_1534402443030_0354_02_000125 on host: WMBigdata6. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e13_1534402443030_0354_02_000125
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
	at org.apache.hadoop.util.Shell.run(Shell.java:507)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

YARN调度报错Stack trace: ExitCodeException exitCode=1解决方式, 这篇日志说是权限问题, 于是找到 CDH 集群下 RESOURCEMANAGER 的日志:

vi /var/log/hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-WMBigdata0.log.out 

看到类似的日志:

2018-08-30 12:58:36,062 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:appattempt_1534402443030_0230_000001 (auth:TOKEN) cause:org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1534402443030_0230_000001 doesn't exist in ApplicationMasterService cache.
2018-08-30 12:58:36,062 INFO org.apache.hadoop.ipc.Server: IPC Server handler 40 on 8030, call org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 10.0.201.124:56400 Call#2491 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1534402443030_0230_000001 doesn't exist in ApplicationMasterService cache.
        at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:442)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
        at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2220)

还发现一个 dr.who:

2018-08-31 10:52:34,563 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is accessing unchecked http://WMBigdata3:42229/api/v1/applications/application_1534402443030_0277/allexecutors which is the app master GUI of application_1534402443030_0277 owned by root

各种搜索一无所获,此时还注意到一个异常:

Slow ReadProcessor read fields took 30001ms (threshold=30000ms);

reduce100%卡死故障排除 这里让我查 DataNode 的日志:

vi /var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-bigdata3.log.out 

搜索一下 error 关键字:

2018-08-31 16:14:45,695 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: bigdata3:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.0.166.172:45462 dst: /10.0.166.172:50010
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:203)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:901)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:808)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
        at java.lang.Thread.run(Thread.java:748)

谷歌到了这个解决方法

/etc/security/limits.conf
# End of file
*               -      nofile          1000000
*               -      nproc           1000000

在 CDH UI 界面中修改 dfs.datanode.max.transfer.threads 的值 4096, 改为 8192, 然后重启 hdfs 服务:

<property> 
    <name>dfs.datanode.max.transfer.threads</name> 
    <value>8192</value> 
    <description> 
        Specifies the maximum number of threads to use for transferring data
        in and out of the DN. 
    </description>
</property>

过了一会儿, Spark Streamig 的界面好了, 不卡了。大西瓜啊, 之前忘记修改最大文件数了。