010-spark standalone模式Scala版本WordCount代码

奋斗吧
奋斗吧
擅长邻域:未填写

标签: 010-spark standalone模式Scala版本WordCount代码 博客 51CTO博客

2023-07-20 18:24:17 190浏览

010-spark standalone模式Scala版本WordCount代码,sparkstandalone模式WordCount代码计算,通过maven,scala2.10.5


1、服务器部署架构


010-spark standalone模式Scala版本WordCount代码_spark


/home/hadoop/app/spark/conf/spark-env.sh 的配置文件: 


#!/usr/bin/env bash
  
 SPARK_MASTER_IP=mycluster  
 export JAVA_HOME=/home/hadoop/app/jdk1.7.0_76  
 export SCALA_HOME=/home/hadoop/app/scala  
 export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0  
 export HIVE_HOME=/home/hadoop/app/hive  
 export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.28.jar



slaves配置文件的内容
  
 # A Spark Worker will be started on each of the machines listed below.  
 192.168.2.20  
 192.168.2.33


2、统计单词Scala程序


2.1 引入Spark对应的jar包,添加pom.xml 中


<?xml version= "1.0" encoding ="UTF-8"?>
<project xmlns= "http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" >
    <modelVersion >4.0.0 </modelVersion>

    <groupId >testSpark </groupId>
    <artifactId >testSpark </artifactId>
    <version >1.0-SNAPSHOT </version>

    <repositories >
        <repository>
            <id> Akka repository </id>
            <url> http://repo.akka.io/releases</url >
        </repository>
        <repository>
            <id> cloudera</id >
            <url> https://repository.cloudera.com/artifactory/cloudera-repos/. </url>
        </repository>
        <repository>
            <id> jboss</id >
            <url> http://repository.jboss.org/nexus/content/groups/public-jboss </url>
        </repository>
        <repository>
            <id> Sonatype snapshots </id>
            <url> http://oss.sonatype.org/content/repositories/snapshots/</url >
        </repository>
    </repositories >

    <build >
        <sourceDirectory> src/spark/</sourceDirectory >
        <testSourceDirectory> src/test/</testSourceDirectory >
          <!-- 添加pluginManagement:解决 Maven报Plugin execution not covered by lifecycle configuration
         具体详见: 
          -->
          <pluginManagement>
            <plugins>
                <plugin>
                    <groupId> org.scala-tools</groupId >
                    <artifactId> maven- scala-plugin</artifactId >
                    <executions>
                        <execution>
                            <goals>
                                <goal> compile</ goal>
                                <goal> testCompile</goal >
                            </goals>
                        </execution>
                    </executions>
                    <configuration>
                        <scalaVersion> 2.10.5</ scalaVersion>
                    </configuration>
                </plugin>
    
                <plugin>
                    <groupId> org.apache.maven.plugins</groupId >
                    <artifactId> maven-shade- plugin</artifactId>
                    <version> 2.2</ version>
                    <executions>
                        <execution>
                            <phase> package</ phase>
                            <goals>
                                <goal> shade</ goal>
                            </goals>
                            <configuration>
                                <filters>
                                    <filter>
                                        <artifact> *:*</ artifact>
                                        <excludes>
                                            <exclude> *</ exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                                <transformers>
    
                                    <transformer
                                            implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer" >
                                        <resource> reference.conf</resource >
                                    </transformer>
    
                                    <transformer
                                             implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer" >
                                    </transformer>
                                </transformers>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
          </pluginManagement>
    </build >

    <dependencies >
          <!-- 解决 maven引用jdk中的tools.jar报Missing artifact的问题 ,指定maven去本地寻找 tools.jar -->
          <dependency>
            <groupId> jdk.tools</groupId >
            <artifactId> jdk.tools</artifactId >
            <version> 1.7</ version>
            <scope> system</ scope>
            <systemPath> ${JAVA_HOME}/lib/tools.jar </systemPath>
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-core_2.10</artifactId >
            <version> 1.2.0-cdh5.3.2</version >
        </dependency>
        <dependency>
            <groupId> org.apache.hadoop</groupId >
            <artifactId> hadoop-client </artifactId>
            <version> 2.6.0-cdh5.3.6</version >
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-streaming_2.10</artifactId >
            <version> 0.9.0-cdh5.0.0</version >
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-yarn_2.10</artifactId >
            <version> 1.2.0-cdh5.3.3</version >
        </dependency>
        <dependency>
            <groupId> org.apache.spark</groupId >
            <artifactId> spark-tools_2.10</artifactId >
            <version> 0.9.0-cdh5.0.0</version >
        </dependency>
     
    </dependencies >

</project>








注意: 


(1)使用CDH集成的相关包,其中版本号位2.10 版本,故这里的scala使用2.10.x系列,若果使用高的版本,会出现无法编译。


(2)当网络很慢比较的情况下,会出现很多的错误,无论怎么样,请耐心等待


(3)org.apache.spark 下的包我们使用的是2.10,这里我们使用scala版本2.10 相对应。


(4)< pluginManagement > 解决插件报错问题




2.2 开发Spark的WordCount程序



Eclipse版本:4.4.2 


scala版本:2.10.5。这里尝试使用过2.11.x,出现cdh集成的spar相关的包是无发编译


JDK版本:1.7.0 


新建项目步骤: 新建scala project->拷贝pom.xml 到scala project中->转成maven项目






010-spark standalone模式Scala版本WordCount代码_apache_02







2.3 导出jar包


(1) 引入spark-assembly-1.4.0-hadoop2.6.0.jar 


010-spark standalone模式Scala版本WordCount代码_apache_03

WordCountSA”,选择“Export”,并在弹出框中选择“Java” –> “JAR File”,进而将该程序编译成jar包,可以起名为“spark-wordcount-in-scala.jar”,我导出的jar包下载地址是  spark-wordcount-in-scala.jar



参考: http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/




3、spark standalone 模式执行wordCount


3.1 启动hdfs


start-dfs.sh 



3.2 启动spark


cd $SPARK_HOME/sbin


start-all.sh



cd $SPARK_HOME/bin


spark-shell



3.3 执行wordCount



[hadoop@mycluster spark]$ ./bin/spark-submit --class spark.WordCountSA /home/hadoop/testSpark-1.0-SNAPSHOT.jar hdfs://mycluster:9000/wc.txt hdfs://mycluster:9000/output2



3.4 验证结果


[hadoop@mycluster ~]$ hdfs dfs -cat /output2/part*
(hello,4)
(you,2)
(china,1)
(me,1)


 



3.5 登录spark ui 任务页面


http://192.168.2.20:8080/




010-spark standalone模式Scala版本WordCount代码_apache_04




4、spark在提交任务时,出现如下错误


WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:29:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:30:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/03/26 22:30:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory




从警告信息上看,初始化job时没有获取到任何资源;提示检查集群,确保workers可以被注册并有足够的内存资源。

如上问题产生的原因是多方面的,可能原因如下:

1.因为提交任务的节点不能和spark工作节点交互,因为提交完任务后提交任务节点上会起一个进程,展示任务进度,大多端口为4044,工作节点需要反馈进度给该该端口,所以如果主机名或者IP在hosts中配置不正确。所以检查下主机名和ip是否配置正确

2.有可能是内存不足

  检查内存

conf.set("spark.executor.memory", "3000m")

Make sure to set SPARK_LOCAL_IP andSPARK_MASTER_IP.

workers保持Alive状态,确保 some cores 是可利用的。





好博客就要一起分享哦!分享海报

此处可发布评论

评论(0展开评论

暂无评论,快来写一下吧

展开评论

客服QQ 1913284695