Advanced Week 211
Last updated
Was this helpful?
Last updated
Was this helpful?
“ScalersTalk 口译进阶小组”的前身是“ScalersTalk 交传小组”,成立于 2015 年 2 月,现阶段,小组继续专注高级阶段的交替传译与同声传译训练,在巩固语言基本功的同时训练各类口译技能,从而为承担正式场合的口译打下坚实基础。
第 211 周训练的主题为:what is big data。 练习步骤:1. 先自己播放一遍视频,做一遍recall,发群语音;之后再播放一遍并重新发一遍复述的语音;2. 对照原文文稿,梳理文章内容,积累词汇和句式;3.再做一到两遍,用刚刚学过的表达讲故事。
We all use smartphones, but have you ever wonder how much data it generates in the form of texts, phone calls, emails, photos, videos, searches and music? Approximately 40 exabytes of data gets generated every month by a single smartphone user. Now, imagine this number multiplied by 5,000,000,000 smartphone users. That’s a lot for our mind to even process, isn’t it? In fact, this amount of data is quite a lot for traditional computing systems to handle, and this mass amount of data is what we term as big data.
处理:to process/to handle
Exabytes:艾字节(缩写为EB),计算机存储容量单位,据估算,2011年整个互联网的容量总和不超过525EB。
** 其他存储容量单位:比特(bit)是最小的存储单位。计算机存储单位一般用字节(Byte)、千字节(KB/kilobyte)、兆字节(MB/megabyte)、吉字节(GB/Gigabyte)、太字节(TB/ terabyte)、拍字节(PB/petabyte)、艾字节(EB)、泽字节(ZB/ZetaByte)、尧它字节(YB/ Yao it byte)表示。换算关系:1 Byte = 8 bit 1KB=1024B;1MB=1024KB;1GB=1024MB;1TB=1024GB;1PB=1024TB;1EB=1024PB;1ZB=1024EB;1YB=1024ZB
Let’s have a look at the data generated per minute on the internet. 2.1 million snaps are shared on Snapchat, 2.8 million search queries are made on google, 1 million people log on the Facebook, 4.5 million videos are watched on YouTube, 188 million emails are sent. That’s a lot of data. So how do you classify any data as big data? This is possible with the concept of 5Vs. Volume, Velocity, Variety, Veracity and Value. Let’s understand this with an example from the health care industry. Hospitals and clinics across the world generate mass volumes of data. 2,314 exabytes of data are collected annually in the form of patient records and test results. All this data is generated at a very high speed, which attribute to the velocity of big data. Variety refers to the various data types, such as structured, semi-structured and unstructured data. Examples include excel records, log files and x-ray images. Accuracy and trustworthiness of the generated data is termed as veracity. Analyzing all this data will benefit the medical center by enabling faster disease detection, better treatment and reduced costs. This is known as the value of big data.
Search query搜索查询;搭配:make a search query (on google)
Patient record:病历;
Log files:日志文件
Doing A will benefit B by enabling…:A可以通过……让B受益
Disease detection:疾病诊断;疾病监测
数据的五大特征:
容量(Volume)是指大规模的数据量,并且数据量呈持续增长趋势。目前一般指超过10T规模的数据量,但未来随着技术的进步,符合大数据标准的数据集大小也会变化。
速率(Velocity)即数据生成、流动速率快。数据流动速率指指对数据采集、存储以及分析具有价值信息的速度。
多样性(Variety)指是大数据包括多种不同格式和不同类型的数据。数据来源包括人与系统交互时与机器自动生成,来源的多样性导致数据类型的多样性。根据数据是否具有一定的模式、结构和关系,数据可分为三种基本类型:结构化数据、非结构化数据、半结构化数据。
真实性(Veracity)指数据的质量和保真性。
价值(Value)即低价值密度。随着数据量的增长,数据中有意义的信息却没有成相应比例增长。
But how do we store and process this big data? To do this job, we have various framework, such as Cassandra, Hadoop and Spark. Let us take Hadoop as an example and see how Hadoop stores and processes its big data. Hadoop uses a distributive file system, known as Hadoop Distributed File System. To store big data, if you have a huge file, you file will be broken down into smaller chunks and stored in various machines. Not only that, when you break the file, you also make copies of it, which goes in different nodes. In this way, you store your big data in a distributive way and make sure if one machine fails, your data is safe on another. Mac-produced technique is used to process big data. A lengthy Task A is broken into smaller tasks: B, C and D. Now instead of one machine, three machines take up each task, and complete in a parallel fashion, and assemble the results at the end. Thanks for these, the process become easy and fast. This is known as parallel processing.
分布式文件系统(Distributive File System)多台计算机联网协同工作(有时也称为一个集群)就像单台系统一样解决某种问题,这样的系统我们称之为分布式系统。分布式文件系统是分布式系统的一个子集,它们解决的问题就是数据存储。换句话说,它们是横跨在多台计算机上的存储系统。存储在分布式文件系统上的数据自动分布在不同的节点上(nodes)。
分布式文件系统在大数据时代有着广泛的应用前景,它们为存储和处理来自网络和其它地方的超大规模数据提供所需的扩展能力。
In a parallel fashion:同步进行/同步发生
Assemble the results:集成结果
One machine fails:一部机器出现故障
Now that we have stored and processed our big data, we can analyze this data for numerous applications. In games like Halo 3 and Call of Duty, designers analyze user data to understand that which stage most of the users pause, restart or quit playing. This instead could help them rework on the storyline and improve the user experience, which in term reduces the customer churn rate. Similarly, big data also help to disaster management during Hurricane Sandy in 2012. It was used to gain a better understanding of the storm’s effect on the east coast of the U.S. And necessary measures were taken. It could predict the hurricane’s landfall five days in advance, which wasn’t possible earlier. These are some of the clear indications of how valuable big data can be once it’s accurately processed and analyzed.
we can analyze this data for numerous applications:数据分析的各种应用场景;将数据分析应用在各种场合。
Rework on the storyline:重新修改故事情节
Customer churn rate:客户流失率
Disaster management:灾难管理
To gain a better understanding of:更好地了解
Landfall: the land that you see or arrive at first after a journey by sea or by air(航海或飞行后)初见陆地,踏上陆地;搭配:make landfall on
Eg. After three weeks they made landfall on the coast of Ireland.三个星期之后,他们登上了爱尔兰的海岸。
Which was possible earlier:这是过去不可能做到的
users pause, restart or quit playing:暂停、重启或退出游戏
根据大数据的分析处理目的的不同,可将大数据分析分为描述性分析(Descriptive Analytics)、诊断性分析(Diagnostic Analytics)、预测性分析(Predictive Analytics)和规范性分析(Prescriptive Analytics)。这四种分析模式均大量应用于企业业务运作的各个流程中。
描述性分析——对历史数据进行统计和分析(即文中举的关于玩游戏的例子,分析玩家的行为数据);
** 诊断性分析——主要目的是探索历史事件背后的原因;
** 预测性分析——通过分析历史数据,对未来的可能情况进行预测(上文举的例子中,预测飓风的影响属于预测性分析这一目的);
** 规范性分析——在预测性分析结果的基础上,进行更深一步的挖掘,并解释深层次的原因。