Jerry Licun Pages

Week1 - Tutorial

RDD Object• parallelize() distribute a local python collection to form an RDD. 
把python数据类型（List，set）转换成RDD
• textFile() function reads a text file and returns it as an RDD of strings.
读取txt文档转换为RDD

RDD Operations• map() method applies a function to each elements of the RDD.
把括号里的函数应用于RDD里的每一个元素
•flatMap(): first applies a function to each elements of an RDD and then flatten the results
首先对每个元素操作，然后把结果扁平化
•mapValues() function requires that each element in the RDD has a key/value pair structure.It applies 
a function to each of the element values. The element key will remain unchanged.
要求RDD里每一个元素有主键和这个键所对应的值，这个函数作用于值而不操作主键
• flatMapValues()只对值操作而不操作主键，并对结果扁平化
• reduce() 用于计算，常见的有把元素全部相加

Spark Operations•全局变量：blank_lines = sc.accumulator(0) # Create Accumulator[int] intitialized to 0
•共享变量（Broadcast Variable):
bad_words = {'crap', 'rascals', 'fuck'}
broadcast_bad_words = sc.broadcast(bad_words)
# if word in bad_words:
if word in broadcast_bad_words.value: 
            bad_words_count += 1
•实现统计字符个数：
rdd_4 = rdd_3.reduceByKey(lambda a,b:a+b)
Jerry Licun Pages

Week1 - Tutorial

RDD Object

RDD Operations

Spark Operations

About Me

Pragramming

Data Science