[工作筆記] 土法煉鋼之異地資料轉換

當作是筆記記錄一下，因為之前自己的電腦掛點，資料都掉光光，做一下筆記以防萬一。

首先是要備份或是轉換的地方做資料輸出，輸出的格式就寫成 XML 的格式就好了。例如這樣：

<?xml version="1.0" encoding="UTF-8"?>
<item>
        <response>1</response>
        <id>23200</id>
        <title>新航線</title>
        <content><b>6月1日長榮航空即將首航日本九州東南部的宮崎縣，對於這個以日本開國神話以及陽光海洋著稱的美麗所在，將成為今年初夏最有魅力的旅遊新航線。</b><br /><br /><!pic1><br />宮崎還有一樣推動觀光的秘密武器，就是去年上任的<b>宮崎縣長－東&#22255;原  英夫</b>，原來是日本普受歡迎的搞笑藝人－&#12381;&#12398;&#12414;&#12435;&#12414;東，選上縣長之後就把藝名封存，將全副熱情投入縣政推動，更把焦點鎖定觀光旅遊的推廣，在宮崎到處都可看到名人縣長的Q版肖像，向來訪的觀光客推銷宮崎名物與觀光景點，使得這位名人縣長也成為宮崎的熱門景觀之一。<br /><br /><b>熱帶風情洋溢的宮崎料理</b><br /><br />位於南九州的宮崎縣，有幽邃神秘的高千&#31298;峽，也有明朗亮麗的陽光碧海，更有誘人的美味料理，讓來訪的味蕾都鼓舌稱快。<br /><br /><b>超爽口南蠻燒炸雞塊</b><br /><br /><!pic2>到宮崎一定要品嚐<b>「南蠻燒」</b>，尤其南蠻燒炸雞塊最為經典。</content>
        <pdate>2008-04-02 19:34:00</pdate>

然後準備接收端，以 PHP 的環境來說，可以利用 curl module 來做遠端讀取資料的方式。

function getRemoteWithCurl($url) {
    /// use curl module
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $reponse .= @curl_exec($ch);
    if (!curl_errno($ch)) {
        return $reponse;
    } else {
        return false;
    }
    curl_close($ch);

然後，假設上述吐出資料的網誌是 http://192.168.1.140/xindex.php?id=23200 那麼程式就是這樣：

$url = "http://192.168.1.141/xindex.php?id=23200";

$response = getRemoteWithCurl($url);
$xml = new XMLTree($response, "utf-8");
if($xml->getValue("/item/response")==1) {
    $post = array();
    /* clean the blank */
    $patternStr = array(' & ',' & #','; ',' < ','< ',' > ',' >', '" ', ' "');
    $replaceStr = array('&','&#',';','<','<','>','>','"', '"');
    /* replace some html tag */
    $sourceHtml = array('<b>','</b>','<p>','</p>');
    $targetHtml = array('<strong>','</strong>','<div>','</>');
    $post['id'] = $xml->getText("/item/id");
    $title = $xml->getText("/item/title");
    $post['title'] = strip_tags(str_replace($patternStr,$replaceStr, $title));
    
    $content = htmlspecialchars_decode($xml->getText("/item/content"), ENT_COMPAT);
    $content = str_replace($patternStr,$replaceStr, $content);
    $content = str_replace($sourceHtml,$targetHtml, $content);
    $post['content'] = htmlUnicode2Utf8($content);
    $created = split(" ", $xml->getText("/item/pdate"));
    $dateline = split("-", $created[0]);
    $timeline = split(":", $created[1]);
    $post['created'] = mktime($timeline[0],$timeline[1],$timeline[2],$dateline[1],$dateline[2],$dateline[0]);

    /* dump to check */
    var_dump($post);
}

其中有一個函式叫做 htmlUnicode2Utf8，它的功用是把 の 這種 html unicode 轉成我們＂人類＂看得懂的字，關於這個東西我之前有發過一篇文章說明，就不再贅述了。主要是用 preg_replace_callback 去做，大概需要這些函式：

/* Simple HEX to BIN */
function hex2bin($data) {
    $len = strlen($data);
    return pack("H" . $len, $data);
}
/* Convert UCS-2 to UTF-8 */
function ucs2toutf8($string) {
    return iconv("ucs-2", "utf-8", hex2bin(base_convert($string, 10, 16)));
}
/* Convert matcher Html unicode to utf8 */
function htmlUnicodeToUtf8HEXConvert($m) {
    return ucs2toutf8($m[2]);
}
function htmlUnicode2Utf8($content) {
    return preg_replace_callback('/(&#)([0-9]+)(;)/', "htmlUnicodeToUtf8HEXConvert" , $content);

最後，再把這些東西全部包成 shell-script 丟到伺服器去用 crontab 去跑就好了。

You might also like...