【TreasureData】郵便番号データをインポートするワークフロー

2022年12月3日2023年8月5日TreasureData

処理概要

日本郵政で公開している郵便番号のCSVファイルをS3からTreasureDataにインポートする

処理内容

CSVファイルを一時テーブルにインポートする
一時テーブルに対してSQLを実行する
SQL実行結果を目的のテーブルに格納する
一時テーブルを削除する

CSVファイル

AWSのS3バケット配下に配置
取得元
- https://www.post.japanpost.jp/zipcode/dl/oogaki-zip.html
- 全国一括 のデータを使用
- 取得時はzip形式なので展開してから KEN_ALL.CSV のファイル名で配置する

digファイル

csv_import_test.dig
ワークフローの設計書となるファイル

ymlファイル

csv_import_test.yml
CSVファイルの入出力設定を記載するファイル
configディレクトリ配下に配置する

補足

time列がないのでインポート時に追加する設定をymlファイルに記載する

digファイルの内容

timezone: Asia/Tokyo

schedule:
  daily>: 10:00:00

_export:
  td:
    dest_db: db_name
    dest_table: table_name

+prepare_table:
  td_ddl>:
  create_databases: ["${td.dest_db}"]
  create_tables: ["${td.dest_table}"]
  database: ${td.dest_db}

+load:
  td_load>: config/load_zip_code.yml
  database: ${td.dest_db}
  table: ${td.dest_table}

ymlファイルの内容

in:
  type: s3
  access_key_id: ${secret:s3.access_key_id}
  secret_access_key: ${secret:s3.secret_access_key}
  bucket: bucket_name
  path_prefix: path/to/KEN_ALL.CSV
  use_modified_time: true
  incremental: true
  parser:
    charset: SJIS
    newline: CRLF
    type: csv
    skip_header_lines: 0
    columns:
    - name: col_01
      type: string
    - name: col_02
      type: string
    - name: col_03
      type: string
    - name: col_04
      type: string
    - name: col_05
      type: string
    - name: col_06
      type: string
    - name: col_07
      type: string
    - name: col_08
      type: string
    - name: col_09
      type: string
    - name: col_10
      type: string
    - name: col_11
      type: string
    - name: col_12
      type: string
    - name: col_13
      type: string
    - name: col_14
      type: string
    - name: col_15
      type: string
filters:
  - type: add_time
    to_column:
      name: time
      type: timestamp
    from_value:
      mode: upload_time
out: {}
exec: {}

TreasureData

Posted by junichi