想提高一下自己的程序水平
刚开始想用windows里面的cygwin编译数据生成器,结果在/home/hadoop/flink-community/resource/tpcds中找到的compileTpcds.sh执行之后提示找不到gcc和make,于是在cygwin安装界面安装gcc,安装完成之后冲过去目录去运行,结果报错。
错误如下:
gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG -DLINUX -g -Wall -c -o mkheader.o mkheader.c
In file included from mkheader.c:37:0:
porting.h:46:10: 致命错误:values.h:No such file or directory
#include <values.h>
^~~~~~~~~~
编译中断。
make: *** [<内置>:mkheader.o] 错误 1
compileTpcds.sh:行40: ./dsdgen: No such file or directory
compileTpcds.sh:行41: ./dsqgen: No such file or directory
cp: 无法获取'/cygdrive/d/javaProgram/home/hadoop/flink-community-perf/resource/tpcds/tpc-ds-tool/tools/dsdgen' 的文件状态(stat): No such file or directory
cp: 无法获取'/cygdrive/d/javaProgram/home/hadoop/flink-community-perf/resource/tpcds/tpc-ds-tool/tools/tpcds.idx' 的文件状态(stat): No such file or directory
cp: 无法获取'/cygdrive/d/javaProgram/home/hadoop/flink-community-perf/resource/tpcds/tpc-ds-tool/tools/dsqgen' 的文件状态(stat): No such file or directory
cp: 无法获取'/cygdrive/d/javaProgram/home/hadoop/flink-community-perf/resource/tpcds/tpc-ds-tool/tools/tpcds.idx' 的文件状态(stat): No such file or directory
chmod: 无法访问'/cygdrive/d/javaProgram/home/hadoop/flink-community-perf/resource/tpcds/querygen/dsqgen': No such file or directory
Compile SUCCESS...
error log
只能给自己的windows机器安装个docker toolbox,用里面的虚拟docker运行ubuntu环境。
docker容器的下载不是一般的慢,更何况这个环境还是在windows的虚拟机里面,为了加速,ssh进入虚拟docker之后,执行以下命令
sudo sed -i "s|EXTRA_ARGS='|EXTRA_ARGS='--registry-mirror=http://f1361db2.m.daocloud.io |g" /var/lib/boot2docker/profile
(http://f1361db2.m.daocloud.io也可改为自己在阿里云拿到的地址)
(之前也有人说在docker环境新建/etc/docker/daemon.json,键入{"registry-mirrors": ["https://registry.docker-cn.com"]},再重启服务就可以改变镜像地址。我试过,不知道是不是toolbox的原因,这个方法用了之后,物理机开启不了虚拟机内部的docker daemon服务。)
将镜像地址配置进去之后,在物理机运行docker-machine restart default,即可让配置生效。使用docker search gcc找到镜像rikorose/gcc-cmake之后,使用docker pull 这个gcc镜像:docker pull rikorose/gcc-cmake,再执行docker run -itd -P rikorose/gcc-cmake即可运行gcc容器。
如果执行过程需要关闭容器,之后输入指令docker start <容器名>后再docker exec -it <容器名> /bin/bash进入容器
这里需要传文件,于是乎执行docker ps指令(如果容器是关闭状态要用docker ps -a)
docker@default:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c6b71328afa8 rikorose/gcc-cmake "bash" 7 days ago Up 5 hours serene_noyce
这里知道容器名字是serene_noyce,所以下一个指令是docker inspect -f '{{.Id}}' serene_noyce(第3、第4个参数的作用是过滤参数,不然运行结果一堆参数)
docker@default:~$ docker inspect -f '{{.Id}}' serene_noyce
c6b71328afa828c9e4c62ac37dd9dede538c3999355189e44218290a2ae885d3
下一步就是把物理机的“我的照片”里面的项目放进来了,使用docker cp指令:
docker cp /c/Users/Administrator/Pictures/home c6b71328afa828c9e4c62ac37dd9dede538c3999355189e44218290a2ae885d3:/root
cp 本地文件路径 ID全称:容器路径 ,如果需要反过来传送,把容器内文件拷出来,命令格式的第三和第四参数互换就可以了。
把项目放进容器里面后,进入容器,跳到flink-community/resource/tpcds里面的目录,运行compileTpcds.sh,提示有命令找不到路径:yacc
make: yacc: Command not found
看来又要用apt-get指令了。不过,这种安装程序可以改镜像地址,首先要修改地址
由于docker镜像没有vi指令,我们要在虚拟docker里面编辑好package.json文件:
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-proposed main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-proposed main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
编辑好文件之后,容器内执行 mv /etc/apt/sources.list /etc/apt/sources.list.bak,虚拟docker执行
docker cp package.json c6b71328afa828c9e4c62ac37dd9dede538c3999355189e44218290a2ae885d3:/etc/apt
apt-get更新之后,下载地址就会指向新的镜像地址。
执行apt-get install bison(命令名与包名不同的理由见这里),结果返回Package has no installation candidate错误,就想到可能是apt-get没有更新的原因,遂输入指令:
apt-get update
apt-get upgrade
apt-get install flex bison bc
之后再执行flink-community/resource/tpcds里面的compileTpcds.sh,ok,编译完成!(编译日志太长了,没有放上来)
进入下一级datagen目录,运行generateTpcdsData.sh,居然报错:96行,Syntax error: “(” unexpected,可是这个程序的96行是正常的阿:
#!/bin/bash
##############################################################################
# TPC-DS data Generation
##############################################################################
export JAVA_HOME=/home/hadoop/java
# set data save path
targetPath=./data/
# set work threads ,initial value is 0
workThreads=0
# set data scale
if [ $# -lt 1 ]; then
echo "[ERROR] Insufficient # of params"
echo "USAGE: `dirname $0`/$0 <scaleFactor>"
exit 127
fi
scaleFactor=$1
# random seed to build data,default value is 0
rngSeed=0
if [ $# -ge 2 ]; then
rngSeed=$2
fi
### Check for target path
if [ -z $targetPath ]; then
echo "[ERROR] HDFS target path was not configured"
exit 127
fi
### Init
### in workload.lst, dimension table was configured parallel value 1,and fact table was configured bigger parallel
workFile=workloads/tpcds.workload.${scaleFactor}.lst
if [ ! -e $workFile ]; then
echo "[INFO] generating Workload file: "$workFile
echo "a call_center $((scaleFactor))" >>$workFile
echo "b catalog_page $((scaleFactor))" >>$workFile
echo "d catalog_sales $((scaleFactor))" >>$workFile
echo "e customer_address $((scaleFactor))" >>$workFile
echo "f customer $((scaleFactor))" >>$workFile
echo "g customer_demographics $((scaleFactor))" >>$workFile
echo "h date_dim $((scaleFactor))" >>$workFile
echo "i household_demographics $((scaleFactor))" >>$workFile
echo "j income_band $((scaleFactor))" >>$workFile
echo "k inventory $((scaleFactor))" >>$workFile
echo "l item $((scaleFactor))" >>$workFile
echo "m promotion $((scaleFactor))" >>$workFile
echo "n reason $((scaleFactor))" >>$workFile
echo "o ship_mode $((scaleFactor))" >>$workFile
echo "p store $((scaleFactor))" >>$workFile
echo "r store_sales $((scaleFactor))" >>$workFile
echo "s time_dim $((scaleFactor))" >>$workFile
echo "t warehouse $((scaleFactor))" >>$workFile
echo "u web_page $((scaleFactor))" >>$workFile
echo "w web_sales $((scaleFactor))" >>$workFile
echo "x web_site $((scaleFactor))" >>$workFile
fi
### Basic Params
echo "[INFO] Data will be generated locally on each node at a named pipe ./<tblName.tbl.<chunk#>"
echo "[INFO] Generated data will be streamingly copied to the cluster at "$targetHSDFPath
echo "[INFO] e.g. lineitem.tbl.10 --> /disk/1/tpcds/data/SF100/lineitem/lineitem.10.tbl"
#Clear existing workloads
rm -rf writeData.sh
#Check Dir on disk
targetPath=${targetPath}/SF${scaleFactor}
rm -rf ${targetPath}
mkdir -p ${targetPath}
### Init Workloads
fileName=writeData.sh
echo "#!/bin/bash" >> $fileName
echo " " >> $fileName
echo "ps -efww|grep dsdgen |grep -v grep|cut -c 9-15|xargs kill -9" >> $fileName
echo "ps -efww|grep FsShell |grep -v grep|cut -c 9-15|xargs kill -9" >> $fileName
echo "ps -efww|grep wait4process.sh |grep -v grep|cut -c 9-15|xargs kill -9" >> $fileName
echo "rm -rf *.dat" >> $fileName
echo " " >> $fileName
mkdir -p ${targetPath}/catalog_returns
mkdir -p ${targetPath}/store_returns
mkdir -p ${targetPath}/web_returns
### Generate Workloads
while read line; do
params=( $line )
#Extracting Parameters
#echo ${params[*]}
tblCode=${params[0]}
tblName=${params[1]}
tblParts=${params[2]}
echo "====$tblName==="
mkdir -p ${targetPath}/$tblName
# Assigning workload in round-robin fashion
partsDone=1
while [ $partsDone -le $tblParts ]; do
if [ $tblParts -gt 1 ]; then
echo "rm -rf ./${tblName}_${partsDone}_${tblParts}.dat" >> writeData.sh
echo "mkfifo ./${tblName}_${partsDone}_${tblParts}.dat" >> writeData.sh
if [ "$tblName" = "catalog_sales" ]; then
echo "rm -rf ./catalog_returns_${partsDone}_${tblParts}.dat" >> writeData.sh
echo "mkfifo ./catalog_returns_${partsDone}_${tblParts}.dat" >> writeData.sh
fi
if [ "$tblName" = "store_sales" ]; then
echo "rm -rf ./store_returns_${partsDone}_${tblParts}.dat" >> writeData.sh
echo "mkfifo ./store_returns_${partsDone}_${tblParts}.dat" >> writeData.sh
fi
if [ "$tblName" = "web_sales" ]; then
echo "rm -rf ./web_returns_${partsDone}_${tblParts}.dat" >> writeData.sh
echo "mkfifo ./web_returns_${partsDone}_${tblParts}.dat" >> writeData.sh
fi
echo "./dsdgen -SCALE $scaleFactor -TABLE $tblName -CHILD $partsDone -PARALLEL $tblParts -FORCE Y -RNGSEED $rngSeed &" >> writeData.sh
echo "./copyAndDelete.sh ./${tblName}_${partsDone}_${tblParts}.dat ${targetPath}/$tblName &" >> writeData.sh
if [ "$tblName" = "catalog_sales" ]; then
echo "./copyAndDelete.sh ./catalog_returns_${partsDone}_${tblParts}.dat ${targetPath}/catalog_returns &" >> writeData.sh
fi
if [ "$tblName" = "store_sales" ]; then
echo "./copyAndDelete.sh ./store_returns_${partsDone}_${tblParts}.dat ${targetPath}/store_returns &" >> writeData.sh
fi
if [ "$tblName" = "web_sales" ]; then
echo "./copyAndDelete.sh ./web_returns_${partsDone}_${tblParts}.dat ${targetPath}/web_returns &" >> writeData.sh
fi
else
echo "rm -rf ./${tblName}.dat" >> writeData.sh
echo "mkfifo ./${tblName}.dat" >> writeData.sh
if [ "$tblName" = "catalog_sales" ]; then
echo "rm -rf ./catalog_returns.dat" >> writeData.sh
echo "mkfifo ./catalog_returns.dat" >> writeData.sh
fi
if [ "$tblName" = "store_sales" ]; then
echo "rm -rf ./store_returns.dat" >> writeData.sh
echo "mkfifo ./store_returns.dat" >> writeData.sh
fi
if [ "$tblName" = "web_sales" ]; then
echo "rm -rf ./web_returns.dat" >> writeData.sh
echo "mkfifo ./web_returns.dat" >> writeData.sh
fi
echo "./dsdgen -SCALE $scaleFactor -TABLE $tblName -FORCE Y -RNGSEED $rngSeed &" >> writeData.sh
echo "./copyAndDelete.sh ./${tblName}.dat ${targetPath}/$tblName &" >> writeData.sh
if [ "$tblName" = "catalog_sales" ]; then
echo "./copyAndDelete.sh ./catalog_returns.dat ${targetPath}/catalog_returns &" >> writeData.sh
fi
if [ "$tblName" = "store_sales" ]; then
echo "./copyAndDelete.sh ./store_returns.dat ${targetPath}/store_returns &" >> writeData.sh
fi
if [ "$tblName" = "web_sales" ]; then
echo "./copyAndDelete.sh ./web_returns.dat ${targetPath}/web_returns &" >> writeData.sh
fi
fi
let partsDone=1+$partsDone
let workThreads=1+workThreads
done
done <$workFile;
echo "echo \"[INFO] this machine has ${workThreads} dsden thread\" ">> writeData.sh
echo "echo \"[INFO] Waiting until completion...\" ">> writeData.sh
echo "./wait4process.sh dsdgen 0 " >> writeData.sh
echo " " >> writeData.sh
echo "[INFO] Started Generation @ "`date +%H:%M:%S`
startTime=`date +%s`
echo "[INFO] Executing writeData.sh on "${worker}
chmod 755 writeData.sh
sh writeData.sh
endTime=`date +%s`
echo "[INFO] Completed Generation @ "`date +%H:%M:%S`
echo "[INFO] Generated and loaded SF"${scaleFactor}" in "`echo $endTime - $startTime |bc`" sec"
generateTpcdsData.sh
然而这个问题好像大家都碰到过,还不是简单的拼写错误。。。这可能跟sh命令对应的运行程序有关,执行ls -l /bin/*sh就知道了,我的容器是这样的:
-rwxr-xr-x 1 root root 1099016 May 15 2017 /bin/bash
-rwxr-xr-x 1 root root 117208 Jan 24 2017 /bin/dash
lrwxrwxrwx 1 root root 4 May 15 2017 /bin/rbash -> bash
lrwxrwxrwx 1 root root 4 Jan 24 2017 /bin/sh -> dash
看起来,系统把sh程序交给了dash而不是bash,导致程序编译错误,这时就不要用sh执行了,要用bash。