江苏快三开奖结果

++wythern++

X presents Y for a better Z

[Collection] Spark partition related things.

Partition:
Understanding:
1. http://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
2. http://dev.sortable.com/spark-repartition/ -- example of partition & repartition to avoid data-imbalance.
3. http://acadgild.com/blog/partitioning-in-spark/ -- real case on existing partitioner & self-created partitioner.

Programming guidence.
Avoid using GroupByKey http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Reference 1 says: Applying transformations that return RDDs with specific partitioners. Some operation on RDDs that hold to and propagate a partitioner are-
  • Join
  • LeftOuterJoin
  • RightOuterJoin
  • groupByKey
  • reduceByKey
  • foldByKey
  • sort
  • partitionBy
  • foldByKey
groupByKey is one of them, My understanding is such operations may cause extra shuffle, but repartition also helps relieve data imbalance if well considered, so use head please! :)

posted on 2017-05-18 14:29 wythern 閱讀(81) 評論(0)  編輯 收藏 引用


只有注冊用戶登錄后才能發表評論。

網站導航:                管理