010-spark standalone模式Scala版本WordCount代码
标签: 010-spark standalone模式Scala版本WordCount代码 博客 51CTO博客
2023-07-20 18:24:17 190浏览
1、服务器部署架构

/home/hadoop/app/spark/conf/spark-env.sh 的配置文件:
#!/usr/bin/env bash
SPARK_MASTER_IP=mycluster
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_76
export SCALA_HOME=/home/hadoop/app/scala
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0
export HIVE_HOME=/home/hadoop/app/hive
export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.28.jar
slaves配置文件的内容
# A Spark Worker will be started on each of the machines listed below.
192.168.2.20
192.168.2.33
2、统计单词Scala程序
2.1 引入Spark对应的jar包,添加pom.xml 中
<?xml version= "1.0" encoding ="UTF-8"?>
<project xmlns= "http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" >
<modelVersion >4.0.0 </modelVersion>
<groupId >testSpark </groupId>
<artifactId >testSpark </artifactId>
<version >1.0-SNAPSHOT </version>
<repositories >
<repository>
<id> Akka repository </id>
<url> http://repo.akka.io/releases</url >
</repository>
<repository>
<id> cloudera</id >
<url> https://repository.cloudera.com/artifactory/cloudera-repos/. </url>
</repository>
<repository>
<id> jboss</id >
<url> http://repository.jboss.org/nexus/content/groups/public-jboss </url>
</repository>
<repository>
<id> Sonatype snapshots </id>
<url> http://oss.sonatype.org/content/repositories/snapshots/</url >
</repository>
</repositories >
<build >
<sourceDirectory> src/spark/</sourceDirectory >
<testSourceDirectory> src/test/</testSourceDirectory >
<!-- 添加pluginManagement:解决 Maven报Plugin execution not covered by lifecycle configuration
具体详见:
-->
<pluginManagement>
<plugins>
<plugin>
<groupId> org.scala-tools</groupId >
<artifactId> maven- scala-plugin</artifactId >
<executions>
<execution>
<goals>
<goal> compile</ goal>
<goal> testCompile</goal >
</goals>
</execution>
</executions>
<configuration>
<scalaVersion> 2.10.5</ scalaVersion>
</configuration>
</plugin>
<plugin>
<groupId> org.apache.maven.plugins</groupId >
<artifactId> maven-shade- plugin</artifactId>
<version> 2.2</ version>
<executions>
<execution>
<phase> package</ phase>
<goals>
<goal> shade</ goal>
</goals>
<configuration>
<filters>
<filter>
<artifact> *:*</ artifact>
<excludes>
<exclude> *</ exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer" >
<resource> reference.conf</resource >
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer" >
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build >
<dependencies >
<!-- 解决 maven引用jdk中的tools.jar报Missing artifact的问题 ,指定maven去本地寻找 tools.jar -->
<dependency>
<groupId> jdk.tools</groupId >
<artifactId> jdk.tools</artifactId >
<version> 1.7</ version>
<scope> system</ scope>
<systemPath> ${JAVA_HOME}/lib/tools.jar </systemPath>
</dependency>
<dependency>
<groupId> org.apache.spark</groupId >
<artifactId> spark-core_2.10</artifactId >
<version> 1.2.0-cdh5.3.2</version >
</dependency>
<dependency>
<groupId> org.apache.hadoop</groupId >
<artifactId> hadoop-client </artifactId>
<version> 2.6.0-cdh5.3.6</version >
</dependency>
<dependency>
<groupId> org.apache.spark</groupId >
<artifactId> spark-streaming_2.10</artifactId >
<version> 0.9.0-cdh5.0.0</version >
</dependency>
<dependency>
<groupId> org.apache.spark</groupId >
<artifactId> spark-yarn_2.10</artifactId >
<version> 1.2.0-cdh5.3.3</version >
</dependency>
<dependency>
<groupId> org.apache.spark</groupId >
<artifactId> spark-tools_2.10</artifactId >
<version> 0.9.0-cdh5.0.0</version >
</dependency>
</dependencies >
</project>
注意:
(1)使用CDH集成的相关包,其中版本号位2.10 版本,故这里的scala使用2.10.x系列,若果使用高的版本,会出现无法编译。
(2)当网络很慢比较的情况下,会出现很多的错误,无论怎么样,请耐心等待
(3)org.apache.spark 下的包我们使用的是2.10,这里我们使用scala版本2.10 相对应。
(4)< pluginManagement > 解决插件报错问题
2.2 开发Spark的WordCount程序
Eclipse版本:4.4.2
scala版本:2.10.5。这里尝试使用过2.11.x,出现cdh集成的spar相关的包是无发编译
JDK版本:1.7.0
新建项目步骤: 新建scala project->拷贝pom.xml 到scala project中->转成maven项目

2.3 导出jar包
(1) 引入spark-assembly-1.4.0-hadoop2.6.0.jar

WordCountSA”,选择“Export”,并在弹出框中选择“Java” –> “JAR File”,进而将该程序编译成jar包,可以起名为“spark-wordcount-in-scala.jar”,我导出的jar包下载地址是 spark-wordcount-in-scala.jar
参考: http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/
3、spark standalone 模式执行wordCount
3.1 启动hdfs
start-dfs.sh
3.2 启动spark
cd $SPARK_HOME/sbin
start-all.sh
cd $SPARK_HOME/bin
spark-shell
3.3 执行wordCount
[hadoop@mycluster spark]$ ./bin/spark-submit --class spark.WordCountSA /home/hadoop/testSpark-1.0-SNAPSHOT.jar hdfs://mycluster:9000/wc.txt hdfs://mycluster:9000/output2
3.4 验证结果
[hadoop@mycluster ~]$ hdfs dfs -cat /output2/part*
(hello,4)
(you,2)
(china,1)
(me,1)
3.5 登录spark ui 任务页面
http://192.168.2.20:8080/

4、spark在提交任务时,出现如下错误
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:29:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:30:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:30:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
从警告信息上看,初始化job时没有获取到任何资源;提示检查集群,确保workers可以被注册并有足够的内存资源。
如上问题产生的原因是多方面的,可能原因如下:
1.因为提交任务的节点不能和spark工作节点交互,因为提交完任务后提交任务节点上会起一个进程,展示任务进度,大多端口为4044,工作节点需要反馈进度给该该端口,所以如果主机名或者IP在hosts中配置不正确。所以检查下主机名和ip是否配置正确
2.有可能是内存不足
检查内存
conf.set("spark.executor.memory", "3000m")
Make sure to set SPARK_LOCAL_IP andSPARK_MASTER_IP.
workers保持Alive状态,确保 some cores 是可利用的。
好博客就要一起分享哦!分享海报
此处可发布评论
评论(0)展开评论
展开评论
