Week1 - Tutorial
RDD Object
• parallelize() distribute a local python collection to form an RDD.
把python数据类型(List,set)转换成RDD
• textFile() function reads a text file and returns it as an RDD of strings.
读取txt文档转换为RDD
RDD Operations
• map() method applies a function to each elements of the RDD.
把括号里的函数应用于RDD里的每一个元素
•flatMap(): first applies a function to each elements of an RDD and then flatten the results
首先对每个元素操作,然后把结果扁平化
•mapValues() function requires that each element in the RDD has a key/value pair structure.It applies
a function to each of the element values. The element key will remain unchanged.
要求RDD里每一个元素有主键和这个键所对应的值,这个函数作用于值而不操作主键
• flatMapValues()只对值操作而不操作主键,并对结果扁平化
• reduce() 用于计算,常见的有把元素全部相加
Spark Operations
•全局变量:blank_lines = sc.accumulator(0) # Create Accumulator[int] intitialized to 0
•共享变量(Broadcast Variable):
bad_words = {'crap', 'rascals', 'fuck'}
broadcast_bad_words = sc.broadcast(bad_words)
# if word in bad_words:
if word in broadcast_bad_words.value:
bad_words_count += 1
•实现统计字符个数:
rdd_4 = rdd_3.reduceByKey(lambda a,b:a+b)