010-spark standalone模式Scala版本WordCount代码

奋斗吧

擅长邻域：未填写

标签： 010-spark standalone模式Scala版本WordCount代码博客 51CTO博客

2023-07-20 18:24:17 244浏览

010-spark standalone模式Scala版本WordCount代码，sparkstandalone模式WordCount代码计算，通过maven，scala2.10.5

1、服务器部署架构

010-spark standalone模式Scala版本WordCount代码_spark

/home/hadoop/app/spark/conf/spark-env.sh 的配置文件：

#!/usr/bin/env bash
  
 SPARK_MASTER_IP=mycluster  
 export JAVA_HOME=/home/hadoop/app/jdk1.7.0_76  
 export SCALA_HOME=/home/hadoop/app/scala  
 export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0  
 export HIVE_HOME=/home/hadoop/app/hive  
 export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.28.jar

slaves配置文件的内容
  
 # A Spark Worker will be started on each of the machines listed below.  
 192.168.2.20  
 192.168.2.33

2、统计单词Scala程序

2.1 引入Spark对应的jar包，添加pom.xml 中

<?xml version= "1.0" encoding ="UTF-8"?>
<project xmlns= "http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" >
    <modelVersion >4.0.0 </modelVersion>

    <groupId >testSpark </groupId>
    <artifactId >testSpark </artifactId>
    <version >1.0-SNAPSHOT </version>

    <repositories >
        <repository>
            <id> Akka repository </id>
            <url> http://repo.akka.io/releases</url >
        </repository>
        <repository>
            <id> cloudera</id >
            <url> https://repository.cloudera.com/artifactory/cloudera-repos/. </url>
        </repository>
        <repository>
            <id> jboss</id >
            <url> http://repository.jboss.org/nexus/content/groups/public-jboss </url>
        </repository>
        <repository>
            <id> Sonatype snapshots </id>
            <url> http://oss.sonatype.org/content/repositories/snapshots/</url >
        </repository>
    </repositories >

    <build >
        <sourceDirectory> src/spark/</sourceDirectory >
        <testSourceDirectory> src/test/</testSourceDirectory >
          <!-- 添加pluginManagement：解决 Maven报Plugin execution not covered by lifecycle configuration
         具体详见： 
          -->
          <pluginManagement>
            <plugins>
                <plugin>
                    <groupId> org.scala-tools</groupId >
                    <artifactId> maven- scala-plugin</artifactId >
                    <executions>
                        <execution>
                            <goals>
                                <goal> compile</ goal>
                                <goal> testCompile</goal >
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <scalaVersion> 2.10.5</ scalaVersion>
                    </configuration>
                </plugin>
    
                <plugin>
                    <groupId> org.apache.maven.plugins</groupId >
                    <artifactId> maven-shade- plugin</artifactId>
                    <version> 2.2</ version>
                    <executions>
                        <execution>
                            <phase> package</ phase>
                            <goals>
                                <goal> shade</ goal>
                            </goals>
                            <configuration>
                                <filters>
                                    <filter>
                                        <artifact> *:*</ artifact>
                                        <excludes>
                                            <exclude> *</ exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                                <transformers>
    
                                    <transformer
                                            implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer" >
                                        <resource> reference.conf</resource >
                                    </transformer>
    
                                    <transformer
                                             implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer" >
                                    </transformer>
                                </transformers>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
          </pluginManagement>
    </build >

    <dependencies >
          <!-- 解决 maven引用jdk中的tools.jar报Missing artifact的问题 ,指定maven去本地寻找 tools.jar -->
          <dependency>
            <groupId> jdk.tools</groupId >
            <artifactId> jdk.tools</artifactId >
            <version> 1.7</ version>
            <scope> system</ scope>
            <systemPath> ${JAVA_HOME}/lib/tools.jar </systemPath>
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-core_2.10</artifactId >
            <version> 1.2.0-cdh5.3.2</version >
        </dependency>
        <dependency>
            <groupId> org.apache.hadoop</groupId >
            <artifactId> hadoop-client </artifactId>
            <version> 2.6.0-cdh5.3.6</version >
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-streaming_2.10</artifactId >
            <version> 0.9.0-cdh5.0.0</version >
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-yarn_2.10</artifactId >
            <version> 1.2.0-cdh5.3.3</version >
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-tools_2.10</artifactId >
            <version> 0.9.0-cdh5.0.0</version >
        </dependency>
     
    </dependencies >

</project>

注意：

（1）使用CDH集成的相关包，其中版本号位2.10 版本，故这里的scala使用2.10.x系列，若果使用高的版本，会出现无法编译。

（2）当网络很慢比较的情况下，会出现很多的错误，无论怎么样，请耐心等待

（3）org.apache.spark 下的包我们使用的是2.10，这里我们使用scala版本2.10 相对应。

（4）< pluginManagement > 解决插件报错问题

2.2 开发Spark的WordCount程序

Eclipse版本：4.4.2

scala版本：2.10.5。这里尝试使用过2.11.x，出现cdh集成的spar相关的包是无发编译

JDK版本：1.7.0

新建项目步骤：新建scala project->拷贝pom.xml 到scala project中->转成maven项目

010-spark standalone模式Scala版本WordCount代码_apache_02

2.3 导出jar包

（1）引入spark-assembly-1.4.0-hadoop2.6.0.jar

010-spark standalone模式Scala版本WordCount代码_apache_03

WordCountSA”，选择“Export”，并在弹出框中选择“Java” –> “JAR File”，进而将该程序编译成jar包，可以起名为“spark-wordcount-in-scala.jar”，我导出的jar包下载地址是 spark-wordcount-in-scala.jar

参考： http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/

3、spark standalone 模式执行wordCount

3.1 启动hdfs

start-dfs.sh

3.2 启动spark

cd $SPARK_HOME/sbin

start-all.sh

cd $SPARK_HOME/bin

spark-shell

3.3 执行wordCount

[hadoop@mycluster spark]$ ./bin/spark-submit --class spark.WordCountSA /home/hadoop/testSpark-1.0-SNAPSHOT.jar hdfs://mycluster:9000/wc.txt hdfs://mycluster:9000/output2

3.4 验证结果

[hadoop@mycluster ~]$ hdfs dfs -cat /output2/part*
(hello,4)
(you,2)
(china,1)
(me,1)

3.5 登录spark ui 任务页面

http://192.168.2.20:8080/

010-spark standalone模式Scala版本WordCount代码_apache_04

4、spark在提交任务时，出现如下错误

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:29:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:30:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:30:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

从警告信息上看，初始化job时没有获取到任何资源；提示检查集群，确保workers可以被注册并有足够的内存资源。

如上问题产生的原因是多方面的，可能原因如下：

1.因为提交任务的节点不能和spark工作节点交互，因为提交完任务后提交任务节点上会起一个进程，展示任务进度，大多端口为4044，工作节点需要反馈进度给该该端口，所以如果主机名或者IP在hosts中配置不正确。所以检查下主机名和ip是否配置正确

2.有可能是内存不足

检查内存

conf.set("spark.executor.memory", "3000m")

Make sure to set SPARK_LOCAL_IP andSPARK_MASTER_IP.

workers保持Alive状态，确保 some cores 是可利用的。

好博客就要一起分享哦！分享海报

此处可发布评论

评论（0）展开评论

暂无评论，快来写一下吧